bruxelles, 2006-03-10 computer aided document indexing system (cadis) with eurovoc bojana dalbelo...

19
Bruxelles, 2006- 03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]

Upload: rudolf-miles

Post on 23-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Computer Aided Document Indexing System (CADIS) with Eurovoc

Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]

Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]

Page 2: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Project AIDE

idea for a project

September 2004, conference at JRC, Ispra

interdisciplinary collaboration of 3 institutions

Croatian Information Documentation Referral Agency (HIDRA)

Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb

Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb

Page 3: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

AIDE – collaborating institutions HIDRA

collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia

coordinator Maja Cvitaš, M.A.

ZEMRIS

research in the field of artificial intelligence, neural networks, machine learning, data and text mining

coordinators prof. Bojana Dalbelo Bašić andJan Šnajder

ZZL

computational linguistic research and building language technologies for Croatian

coordinator prof. Marko Tadić

Page 4: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

AIDE – project objective

Development of intelligentsystem for automatic indexingof the official documentation

of the Republic of Croatiawith descriptors from Eurovoc

thesaurus

Page 5: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

AIDE – how? automatic indexing, how?

program which “learns to index”

Joint Research Center of EC (JRC), Ispra, Italy at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)

compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors

situation with Croatian documentation in 2004. there were only few hundreds of documents indexed manual indexing: painfully slow

Page 6: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

AIDE – how?

how could we speed up the manual indexing?

plan:

to develop a workstation for computer aided document indexing

conduct the research and development of algorithms in the field of computational linguistics/language technologies

insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)

Page 7: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

CADIS: two windows

Document window

Eurovoc browser window

Page 8: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Document Window

Page 9: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Page 10: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

CADIS features

Enhanced user interface

list of descriptors appearing in document

Page 11: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

CADIS features

Descriptors and non-descriptors marked in document

Page 12: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

CADIS features

Lists of n-grams

Page 13: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

CADIS features

Integration of corpus analysis

greyed n-grams are statistically relevant in the corpus

Page 14: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

CADIS features

Manual marking of significant n-grams — important step towards automatic indexing

Page 15: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Eurovoc browser window

Page 16: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Further development CADIS for other languages?

already for Croatian and English

usable for other languages without linguistic module

cooperation needed with respective language technology experts for development of linguistic module for other languages

partners for EU project proposals for the next step

AIDE

research on machine learning and text-mining

use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc

establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

Page 17: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

http://textmining.zemris.fer.hr

Page 18: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Conclusion

CADIS is unique in Europe

Web info at:

HIDRA: www.hidra.hr/hidra/aide/aide.htm

ZEMRIS: textmining.zemris.fer.hr

for download contact: [email protected]

Page 19: Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing

Bruxelles, 2006-03-10

Computer Aided Document Indexing System (CADIS) with Eurovoc

Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]

Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]