xml r etrieval tarık teksen tutal 21.07.2011. i nformation r etrieval xml (extensible markup...

35
XML RETRİEVAL Tarık Teksen Tutal 21.07.2011

Upload: noel-powell

Post on 17-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

XML RETRİEVAL

Tarık Teksen Tutal

21.07.2011

Page 2: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INFORMATİON RETRİEVAL

XML (Extensible Markup Language)

XQuery

Text Centric vs Data Centric

Page 3: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

BASİC XML CONCEPTS

Page 4: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

XML

Ordered, Labeled Tree

XML Element

XML Attribute

XML DOM (Document Object Model): Standard for accessing and processing XML documents.

Page 5: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

XML STRUCTURE

An Example:

Page 6: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

XML DOM OBJECT

XML DOMObject of theSample in thePrevious Slide

Nodes in a Tree

Parse the TreeTop Down

Page 7: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

XPATH

Standard for enumerating paths in an XML document collection

Query language for selecting nodes from an XML document

Defined by the World Wide Web Consortium (W3C)

Page 8: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

SCHEMA

Puts Constraints on the Structure of Allowable XML

Two Standarts for Schemas:

XML DTD XML Schema

Page 9: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

CHALLANGES İN XML RETRİEVAL

Page 10: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

STRUCTURED DOCUMENT RETRİEVAL PRİNCİPLE

A system should always retrieve the most specific part of a document answering the query

In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book.

In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

Page 11: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INDEXİNG UNİT

Unstructured:

Files on PC, Pages on the Web, E-Mail Messages etc.

Structured

Non-Overlapping Pseudodocuments Top-Down Bottom-Up All

Page 12: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INDEXİNG UNİT

Non-Overlapping Pseudodocuments

Not Coherent

Page 13: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INDEXİNG UNİT

Top-Down

Start with one of the latest units (e.g book in a book collection)

Postprocess search results to find for each book the subelement that is the best hit.

Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

Page 14: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INDEXİNG UNİT

Bottom-Up

Search all leaves, select relevant ones Extend them to larger units in postprocessing

Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

Page 15: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INDEXİNG UNİT

Index All the Elements

Not Useful to Index Some Elements (e.g ISBN)

Creates redundancy (Deeper Level Elements are Returned Several Times)

Page 16: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

NESTED ELEMENTS

To Get Rid of Redundancy,

Discard All Small Elements

Discard All Element Types that Users do not Look at (Working XML Retrieval System Logs)

Discard All Element Types that Assessors Generally do not Judge to be Relevant (If Relevance Assessments are Available)

Only Keep Element Types that a System Designer or Librarian has Deemed to be Useful Search Results

Page 17: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

NESTED ELEMENTS

Remove Nested Elements in a Postprocessing Step

Collapse Several Nested Elements in the Results List and then Highlight Results

Page 18: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

VECTOR SPACE MODEL FOR XML RETRİEVAL

Page 19: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

LEXİCALİZED SUBTREES

To get each word together with its position within the XML tree encoded by a dimension of the vector space

Map XML documents to lexicalized subtrees

Take each text node (leaf) and break it into multiple nodes, one for each word.

E.g. split Bill Gates into Bill and Gates

Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term

Page 20: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

LEXİCALİZED SUBTREES

Page 21: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

LEXİCALİZED SUBTREES

Queries and documents can be respresented as vectors in this lexicalized subtree context

Matches can then be computed for example by using the Vector Space Formalism

V.S. Formalism -> Unstructured vs Structured

Dimensions: Vocabulary Terms vs Lexicalized Subtrees

Page 22: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

DİMENSİONS: TRADEOFF

Dimensionality of Space vs Accuracy of Results

Restrict Dimensions to Vocabulary Terms Standart Vector Space Retrieval System Do Not Match the Structure of the Query

Separate Lexicalized Dimension for Each Subtree Dimensionality of Space Becomes too Large

Page 23: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

DİMENSİONS: COMPROMİSE

Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs)

Structural Term <c, t>: a pair of XML-context c and vocabulary term t

Page 24: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

CONTEXT RESEMBLANCE

To measure the similarity between a path in a query and a path in a document

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd if and only if we can transform cq into cd by inserting additional nodes

Page 25: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

CONTEXT RESEMBLANCE

CR(cq4 , cd2) = 3/4 = 0.75

CR(cq4 , cd3) = 3/5 = 0.6

Page 26: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

DOCUMENT SİMİLARİTY MEASURE

Final Score for a Document

Variant of the Cosine Measure

Also called «SimNoMerge»

Not a True Cosine Measure Since Its Value can be Larger than 1.0

Page 27: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

DOCUMENT SİMİLARİTY MEASURE

V is the vocabulary of non-structural terms B is the set of all XML contexts weight (q, t, c), weight(d, t, c) are the

weights of term t in XML context c in query q and document d, respectively

standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.

Page 28: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

SİMNOMERGE ALGORİTHMSCOREDOCUMENTSWITHSIMNOMERGE(q, B, V, N, normalizer)

Page 29: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

EVALUATİON OF XML RETRİEVAL

Page 30: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INEX

Initiative for the Evaluation of XML Retrieval

Yearly standard benchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments)

Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection)

The relevance of documents is judged by human assessors.

Page 31: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INEX TOPİCS

Content Only (CO) Regular Keyword Queries Like in Unstructured IR

Content and Structure (CAS) Structured Constraints in Addition to Keywords Relevance Assessments are More Complicated

Page 32: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INEX RELEVANCE ASSESSMENTS

INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance

Component Coverage: Evaluates Whether the Element Retrieved is

«Structurally» Correct

Topical Relevance

Page 33: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INEX RELEVANCE ASSESSMENTS Component Coverage:

Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information

Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information

Too large (L): The information sought is present in the component, but is not the main topic

No coverage (N): The information sought is not a topic of the component

Topical Relevance: Highly Relevant (3), Fairly Relevant (2), Marginally

Relevant (1) and Nonrelevant (0)

Page 34: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

COMBİNİNG THE RELEVANCE DİMENSİONS

All of the combinations are not possible -> 3N

Quantization:

Page 35: XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

INEX EVALUATİON MEASURES

Precision and Recall can be applied

Sum Grades vs Binary Relevance

Overlap is not accounted for Nested elements in the same search result

Recent INEX focus: Develop algorithms and evaluation measures

that return non-redundant results lists and evaluate them properly.