a domain based approach to information retrieval in digital libraries - rotella, ferilli, leuzzi
TRANSCRIPT
Universit degli studi di Bari Aldo Moro
Dipartimento di Informatica
A Domain Based Approach to Information Retrieval in Digital LibrariesF. Rotella, S. Ferilli, F. [email protected], {fabio.leuzzi, rotella.fulvio}@gmail.com
8th Italian Research Conference on Digital LibrariesBari, Italy, February 9-10, 2012
L.A.C.A.M. http://lacam.di.uniba.it:8000
Overview
Introduction & Objectives
Keyword Extraction
Word Sense Disambiguation
Synset Clustering
A Multistrategy Similarity Measure
Document Partitioning
User Query Processing
A Preliminary Evaluation
Conclusions & Future Works
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Some repositories leave the responsibility of quality to the authors.+Anybody can produce and distribute documents.=Possible low average quality of the repository contents.
Users are often overwhelmed by documents that only apparently are suitable for satisfying their information needs.
Introduction
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Introduction
Possible way out: Information Retrieval systems
Numerical/statistical manipulation of (key)words has been widely explored in the literatureStill unable to fully solve the problem
Achieving better retrieval performance requires to go beyond simple lexical interpretation of the user queriesPass through an understanding of their semantic content and aims
Ontological taxonomyWordNet
WordNet Domains
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Objectives
Improving fruition of a DL
Use of advanced techniques for document retrieval
Try to overcome the ambiguity of natural language
Inspired by the typical behavior of humans:take into account the possible meanings of words
select the most appropriate one according to the context of the discourse
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Keyword Extraction
Each document in the digital library is progressively split into paragraphs, sentences, and single wordsIntegrated in the DOMINUS framework
Obtained the syntactic structure of sentences, and the lemmasIntegrated in the Stanford Parser
Classical VSM TF*IDF weighting
Two filters:Only nouns consideredThe representation of adverbs, verbs and adjectives in WordNet is different
Only the top 10% keywords for each documentTo be noise-tolerant
To limit the possibility of including non-discriminative and very general words in the representation of a document
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Word Sense Disambiguation
Domain Driven
One Domain per Discourse assumption: many uses of a word in a coherent portion of text tend to share the same domain.
Prevalent domain individuationExtraction of allsynsets for each termExtraction of alldomains for each synsetChoice of prevalent domain synset
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Synset Clustering
Pairwise complete link agglomerative strategy
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Each synset generates a singleton cluster
For each pair of clustersIf the complete link property holdsMerge the involved clusters
A Multistrategy
Similarity Measure
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella
3 components are summed and normalized, in ]0,1[depth (ancestors)
breadth (direct neighbors)
breadth (inverse neighbors)
WordNet relationship are considered
A Multistrategy Similarity Measure
Cosidered Relationship
member meronimy: the latter synset is a member meronym of the former;substance meronimy: the latter synset is a substance meronym of the former;part meronimy: the latter synset is a part meronym of the former;similarity: the latter synset is similar in meaning to the former;antonym: specifies antonymous word;attribute: defines the attribute relation between noun and adjective synsetpairs in which the adjective is a value of the noun;additional information: additional information about the first word can beobtained by seeing the second word;part of speech based: specifies two different relations based on the parts ofspeech involved;participle: the adjective first word is a participle of the verb second word;hyperonymy: the latter synset is a hypernym of the former.
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Document Partitioning
SynsetWord structure:Original word
TF*IDF weight
Synset
The Pairwise Clustering step returned a set of synset clusters
For each document in the collectionEach of its SynsetWord votes with its TF*IDF weight
The first three clusters are chosen from the ranked listThey represent the intensional description of the document
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Users Query Elaboration
Overview
Same grammatical preprocessing as in the previous phase
Query usually very shortNo keyword extraction: all nouns retained for the next operations
WSD Domain Driven unreliableFor each word, all corresponding synsets in WordNet are kept
A single lexical query yields many semantic queriesAll possible combinations of synsets
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Users Query Elaboration
A Brute Force WSD
For each combination:
a similarity evaluated against each cluster that has at least one associated document
using the same similarity function as for clustering
Twofold objective:
finding the combination of synsets that represents the best word sense disambiguation
obtaining the most similar cluster to the involved words
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Users Query Elaboration
Query Results
The best combination is used to obtain the list of clusters ranked by descending relevance, that can be used as an answer to the user search.
The results are then displayed to the user, in particular are displayed the first n sets of document such that n is the minimum value that shows at least 10 results.
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
A Preliminary Evaluation
The Quality of Clusters
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
86 documents, 4 topics: 27 general science and physics; 21 music; 15 politics; 23 religion.
Query: Reincarnation and eternal lifeBest combination:synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a person may be reborn successively into one of five classes of living beings (god or human or animal or hungry ghost or denizen of Hell) depending on the persons own actions;
synset: 100006269; lemmas: life; gloss: living things collectively.
Most similar cluster:synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a person may be reborn successively into one of five classes of living beings (god or human or animal or hungry ghost or denizen of Hell) depending on the persons own actions;
synset: 105943300; lemmas: doctrine, philosophical system, philosophy and school of thought; gloss: a belief (or system of beliefs) accepted as authoritative by some group or school;
synset: 105941423; lemmas: belief; gloss: any cognitive content held as true.
Query: Ornaments and melodiesBest combination:synset: 103169390; lemmas: decoration, ornament and ornamentation; gloss: something used to beautify;
synset: 107028373; lemmas: air, line, melodic line, melodic phrase, melody, strain and tune; gloss: a succession of notes forming a distinctive sequence.
Most similar cluster:synset: 107025900; lemmas: classical, classical music and serious music; gloss: traditional genre of music conforming to an established form and appealing to critical interest and developed musical taste;
synset: 107033753; lemmas: mass; gloss: a musical setting for a Mass;
synset: 107026352; lemmas: opera; gloss: a drama set to music, consists of singing with orchestral accompaniment and an orchestral overture and interludes;
synset: 107071942; lemmas: genre, music genre, musical genre and musical style; gloss: an expressive style of music;
synset: 107064715; lemmas: rock, rock n roll, rock and roll, rock music, rocknroll and rock-and-roll; gloss: a genre of popular music originating in the 1950s, a blend of black rhythm-and-blues with white country-and-western.
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
A Preliminary Evaluation
The Quality of Clusters
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
#QueryOutcomesPrecisionRecall
1Ornaments and melodies[1 to 9] music[10 to 11] religion0.82 (1.0)0.43 (9/21)
2Reincarnation and eternal life[1 to 9] religion[10] science0.9 (1.0)0.39 (9/23)
3Traditions and folks[1 to 4] music[5 to 6] religion[7 to 10] music0.8 (1.0)0.38 (8/21)
4Limits of theory of relativity[1 to 2] science[3] politics[4 to 5] religion[6 to 15] science0.80.44 (12/27)
5Capitalism vs communism[1 to 3] politics[4] science[5 to 6] religion[7 to 11] politics[12] science[13] music0.61 (0.77)0.53 (8/15)
6Markets and new economy[1] politics[2] music[3] science[4 to 8] politics[9 to 10] religion0.6 (0.7)0.4 (6/15)
7Relationship between democracy and parliament[1 to 3] politics[4] science[5 to 6] politics[7 to 10] religion0.5 (0.6)0.33 (5/15)
A Preliminary Evaluation
Synthesis of Outcomes
Conclusions
Proposed an approach to extract information from digital libraries Go beyond simple lexical matching, toward the semantic content underlying queries
The approach consists of:An off-line preprocessing on the entire corpus Find sets of synset as intensional descriptions for the documents
An on-line phase on the queriesFind the most suitable sense, evaluating all possible combinations of synset against each intensional descriptions of the documentsIn order to propose as result the most relevant ones
Preliminary experiments show that this approach can be viable.
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Future Works
Substitution of the ODD assumption with a more elaborated strategy for WSD
Avoiding the pre-processing stepTo handle cases when new documents are progressively included in the collection
Including adverbs, verbs and adjectivesTo improve the quality of the semantic representatives of the documents
To explore other approaches to choose better intensional descriptions of each document
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi
Muokkaa otsikon tekstimuotoa napsauttamalla
Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso