ranked information retrieval on xml data

Ranked Information Retrieval on

XML Data

Seminar “Informationsorganisation und -suche mit XML”

Dr. Ralf SchenkelSS 2003

Saarland University

8. Juli 2003Bernadette Blum, Christian Nicolaus, Markus Uhl

Ranked Information Retrieval on XML Data 2/48

OutlineOutline

1. Introduction in Information Retrieval

2. Information Retrieval on XML Data

3. Approaches1. ELIXIR

- The ELIXIR language- The ELIXIR query processing algorithm- Experiments, Conclusion

2. XRANK- Data model- Ranking function- Data structures and algorithms- Experiments

4. Conclusion

1. Introduction in Information Retrieval 1. Introduction in Information Retrieval

• Definition:

– Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML, ...

– search engines, digital libraries, similarity search on scientific data

• Vector space model (text analysis):

– based on word occurrence frequency

– documents and queries are vectors

– result ranking based on similarity metric in vector space

1. Introduction in Information Retrieval (II)1. Introduction in Information Retrieval (II)

• Link analysis (structure analysis):

– weighting documents

– improve result ranking

Page rank approach (I):

– web as directed graph G

– “random walk” of a web surfer

• follow hyperlinks with probability (1-)

• “random jump” with probability

Page rank approach (II):

1. Introduction in Information Retrieval (III)1. Introduction in Information Retrieval (III)

“random jump” hyperlinks

Hyperlink

Probability of “random jump” Probability of following hyperlink (1- )

G)q,p( )p(reedegout

)p(r)(1+

“random jump”

Document

(1-)/3

(1-)/3(1-)/3

2. Information Retrieval on XML Data2. Information Retrieval on XML Data

• XML: standard for exchange of structured data and documents

• existing query languages (e.g. XML-QL, Quilt, XQL, … XQuery)

– no ranked or weighted results based on textual similarity

– but extensions (XXL, XIRQL …)

2 Approaches

ELIXIR

SQL-like approach

Keyword based approach

3.1 ELIXIR3.1 ELIXIR

• ELIXIR = “expressive and efficient language for XML information retrieval”

• extension to XML-QL: similarity operator “~”

• “~” computed by WHIRL

• returns best r answers

ELIXIR – The ELIXIR languageELIXIR – The ELIXIR language

• Syntax:

– XML-QL Syntax (SQL-like)

CONSTRUCT <item>$b</>

WHERE <items.book year=$yb>$b</> in “db.xml”,

<items.cd>$c</> in “db.xml”,

$yb > 1990,

$b ~ $c.

outputformat

pattern statement

s +predicates

boolean operators

ELIXIR’s similarity operator

• similarity calculation even between 2 variables ( expressiveness)

• no nested queries

ELIXIR – The ELIXIR language (II)ELIXIR – The ELIXIR language (II)

WHIRL (I):

• Word-based Heterogeneous Information Retrieval Logic

• extends DATALOG with “~”

• only relational data

• efficiently supports ranked IR

• Syntax (Horn clause):

output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a.

output relation input relation

conjunction of relational predicates

boolean operator

similarity operator

WHIRL (II):

• Similarity computation “~”:

– standard IR term vector techniques

– weighting terms (TF-IDF values)

– cosine measure:

'ddsim(d,d')

(V Vocabulary of distinct terms; Terms t V; Documents d, d’ R|V|)

ELIXIR – The ELIXIR language (III)ELIXIR – The ELIXIR language (III)

ELIXIR – The ELIXIR query processing algorithmELIXIR – The ELIXIR query processing algorithm

Example (naïve approach):

CONSTRUCT <tuple>$b</><c>$c</></>

WHERE <items.book>$b</> in “db.xml”,

<items.cd>$c</> in “db.xml”

XML-QL query Q2

Similarity computation for every tupel ($b, $c)

full cross product !

ELIXIR – The ELIXIR query processing algorithm (II)ELIXIR – The ELIXIR query processing algorithm (II)

Problem:

full cross product !

Solution:

• not simply map the full XML data into relational model

• invoke WHIRL as a “subroutine” ( efficiency)

Avoid generating full cross product!

ELIXIR – The ELIXIR query processing algorithm (III)ELIXIR – The ELIXIR query processing algorithm (III)

2 pattern statements with variables that are compared with a similarity predicate => distinct Q2

j queries

ELIXIR – The ELIXIR query processing algorithm (IV)ELIXIR – The ELIXIR query processing algorithm (IV)

Start query Q1

3 Stages: intermediate queries Q2, Q3, Q4

1. Partition into a set, Q21 … Q2

N, of XML-QL queries- avoid generating full cross product - ordinary predicates

2. WHIRL query Q3 - similarity predicates - ordered table of the r best answers

3. XML-QL query Q4

– transformation of Q3’s output

– specified XML structure by Q1

Example (Step I – Partition in Q2n queries):

CONSTRUCT <tuple>$b</></>

WHERE <items.book>$b</> in "db.xml"

CONSTRUCT <tuple><c>$c</></>

WHERE <items.cd>$c</> in "db.xml"

XML-QL query Q21

XML-QL query Q22

<q22><tuple><c>Ukrainian folk music</></>

<tuple><c>Being there</></>

<tuple><c>Milk cow blues</></></>

<q21><tuple>Traditional Ukrainian cookery</></>

<tuple>Being and nothingness</></>

<tuple>Shooting Elvis</></></>

Avoid generating full cross product!

ELIXIR – The ELIXIR query processing algorithm (V)ELIXIR – The ELIXIR query processing algorithm (V)

Example (Step II – WHIRL query Q3):

q3($b) :- q21($b), q22($c), $b ~ $c.WHIRL query Q3

<tuple>Being and nothingness</></></>

<q22><tuple><c>Ukrainian folk music</></>

<tuple><c>Being there</></>

<tuple><c>Milk cow blues</></></>

<tuple>Being and nothingness</></>

<tuple>Shooting Elvis</></></>

ELIXIR – The ELIXIR query processing algorithm (VI)ELIXIR – The ELIXIR query processing algorithm (VI)

Example (Step III – XML-QL query Q4):

CONSTRUCT <item>$b</>

WHERE <q3.tuple>$b</></> in "q3.xml“

XML-QL query Q4

<results><item>Traditional Ukrainian cookery</>

<item>Being and nothingness</></>

Final XML OUTPUT

<tuple>Being and nothingness</></></>

ELIXIR – The ELIXIR query processing algorithm (VII)ELIXIR – The ELIXIR query processing algorithm (VII)

ELIXIR – Experiments, ConclusionELIXIR – Experiments, Conclusion

Experiments:

Total processing time …

– … depends on details of each query and input data

– … increases marginal with number of answers r

– … increases linearly with number of similarity join predicates

– Partition (Step 1) of initially query dominate (expensive parsing and traversing)

ELIXIR – Experiments, Conclusion (II)ELIXIR – Experiments, Conclusion (II)

Conclusion:

• ELEXIR extends XML-QL by supporting IR-similarity-features for ranking

• similarity joins even between 2 variables (expressiveness)

• Algorithm:

– rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries.

– no full cross product, only filtered tuples of variable bindings (efficiency)

• But …

– only non-nested queries

– strict three-stage approach may be suboptimal in some cases (partition)

XRANK:Ranked Keyword Search

over XML Documents

IntroductionIntroduction

XRANK - Keyword Search over XML documents

results: XML elements that contain all searched keywords

ranking: at granularity of XML elements based on hyperlink structure

advantages: user does not have to learn a query language no knowledge about the structure of XML documents is needed

generalized keyword search engine(both HTML and XML are possible)

• G = (V, CE, HE) : collection of XML documents• V : set of XML elements (tags and attributes)• CE : set of containment edges • HE : set of hyperlinked edges

• (u,v) in CE v is a sub-element of u• (u,v) in HE u contains a hyperlink to v• contains(v,k) v (in)directly contains the keyword k

Data ModelData Model

Example: XML GraphExample: XML Graph

XML element value

How to define results of keyword search queries overXML documents?

elements with at least one

sub-element containiningall keywords &

at least one sub-elementcontaining some

keywords

elements that contain all keywords –

no sub-element contains all keywords!

Keyword Query Results (1)Keyword Query Results (1)

Ranking ElementsRanking Elements

How to rank XML elements?

extension of PageRank at the granularity of elements objective importance of XML elements based on hyperlinked and nested structure of XML elements

ElemRank

n : # XML elementsnc(u) : # sub-elements of unh (u) : # outgoing hyperlinks from u

CE-1 : (v,u) | (u,v) CE “reverse containment edges“E : HE CE CE -1

nc(u) = 3

nh(u) = 3

containment edge reverse containment edge hyperlink edge

ElemRank (1)ElemRank (1)

: prob. for following a hyperlink 1- - - : prob. for a random jump : prob. for using a containment edge : prob. for using a reverse containment edge

containment edge reverse containment edge hyperlink edge

/ 3 + ε / 10

/ 3 + ε/10

/ 3 + ε /

/ 3 + ε / 10

ε / 10

ElemRank e(v) =

(0 ≤ , , ≤ 1)

random navigation

via hyperlinks

via forward containment

(u,v) HE (u,v) CE (u,v) CE-1

via reverse containment

(1- - - ) * 1/n + * ∑ + * ∑ + * ∑

ranking functions should take into account: result specifity hyperlinks keyword proximity

based on hyperlinked structure result specifity

contains(v,k)

∃ sequence (v1,v2), ..., (vn-1,vn) s.t. vn directly contains k

r(v,k) = ElemRank(vn) * decayn-1 (0 ≤ decay ≤ 1)

Ranking Function (1)Ranking Function (1)

• m occurences of keyword k computation of r1, ..., rm

r*(v,k) = f(r1, ..., rm)

• query q consists of keywords k1, ..., kn

R(v,q) = ( r*(v,ki)) * p(v,k1, ..., kn)

keyword proximity

p = proximity measure

(with accumulation function f - e.g. max or sum)

Ranking Function (2)Ranking Function (2)

<song> <title> Radio Song </title> <time> 4:12 </time> </song>

<song> <title> Losing My Religion </title> <time> 4:26 </time> </song> ... </CD>

<CD id = “2“> <title> R.E.M. – Automatic For... </title> ... </CD> ...</CDs>

ElemRank computation

XML documents

index structures &algorithms

Query Evaluator

XML elements

with ElemRanks

data acces

keyword search query

ranked result list

XRANK ArchitectureXRANK Architecture

• naïve inverted list: contains all XML elements that contain the keyword

key1 elem11 elem12 ...

key2 elem21 elem22 ...

space overhead spurious results inaccurate ranking

Naïve ApproachNaïve Approach

... ...

0.00.1

<title> <title><song> <song>0.1.0

R.E.M. – Automatic For The People

0.0.2.1<time>

0.0.2.0<title>

4:26Losing My Religion

0.0.0 0.0.1

0.0.1.10.0.1.0

4:12Radio Song

R.E.M. – Out Of Time

Dewey IDsDewey IDs

• Dewey inverted list:• contains the Dewey IDs of all XML elements that directly contain the keyword• sorted by Dewey ID (ascending)

Dewey ID ElemRank position list

R.E.M.

Religion

Dewey ID ElemRank position list

0.0.2.0 88 [2]

DIL – Data StructureDIL – Data Structure

• key idea: computation of longest common prefix (lcp) of Dewey IDs

DIL – Query Processing (1) DIL – Query Processing (1)

20 0.0 , 0

• ranked Dewey inverted list:• each Dewey ID in the list has a position in the B+-tree• B+-tree sorted by Dewey ID (ascending)• inverted list sorted by ElemRank (descending)

Dewey IDElemRank

R.E.M.80

0.0.00.1.0 …

B+-tree onDewey IDs

RDIL – Data StructureRDIL – Data Structure

key1 key3

entry21

entry22

entry23

entry31

entry32

entry33

entry11

entry12

entry13

B+ B+B+on D

ewey ID

RDIL – Query Processing (1) RDIL – Query Processing (1)

lcp with Dewey ID11

result heap

key1 key3

entry31

entry32

entry33

B+ B+B+on D

ewey ID

lcp with Dewey ID21

result heap

entry22

entry23

entry21entry11

entry12

entry13

key1 key3

entry21

entry22

entry23

entry31

entry32

entry33

B+ B+B+on D

ewey ID

entry11

entry12

entry13

∑ Ranking = threshold Ωmax. reachable Ranking ≤

RDIL algorithm stops

threshold Ω < lowest ElemRank in result heap

because

max. reachable ranking ≤ Ω < lowest ElemRank in result heap

max. reachable ranking < lowest ElemRank in result heap

DIL / RDIL ElemRank computation

XML documents

Query Evaluator

data acces

keyword search query

ranked result list

XML elements

with ElemRanks

XRANK ArchitectureXRANK Architecture

high keyword correlation:

1 2 3 4 5

number of keywords

Experimental Results (1)Experimental Results (1)

low keyword correlation:

1 2 3 4 5

number of keywords

Experimental Results (2)Experimental Results (2)

DIL RDIL

• inverted lists sorted by Dewey ID

• compute longest common prefix on Dewey IDs

• extracts the minimum of all remaining Dewey IDs

• all lists are completely scanned

• outperforms RDIL if keyword correlation is low

• inverted lists sorted by ElemRank

• chooses next list sequentially

• stops if a certain threshold is reached

• outperforms DIL if keyword correlation is high

Comparison DIL - RDILComparison DIL - RDIL

2 Approaches

ELIXIR:– SQL-like structure based

search– extends XML-QL by

supporting IR-similarity-features for ranking

– ranked results based only on textual similarity (even between 2 variables)

XRANK:– keyword based search à la

Google– ranked results based on

textual similarity– hierarchical and

hyperlinked structure

ConclusionConclusion

ranked information retrieval on xml data

information retrieval

xml dataelixir

xml data1

xml data3

information retrieval2

xml dataxml

xml dataoutline1

xml data2

Documents

sems: model search and ranked retrieval (ron henkel)

information retrieval lecture 6: ranked retrieval · 2020....

evaluating content-oriented xml retrieval: the inex...

xirql: eine anfragesprache für information retrieval in...

document similarity in information...

ranked retrieval lbsc 796/infm 718r session 3 september 24,...

xml information retrieval - university of glasgow

cs276 information retrieval and web search lecture 10: xml...

xrank: ranked keyword search over xml documents

improvements and extras paul thomas csiro. overview of the...

sigir 2006 tutorial xml information...

xml retrieval

ralf schenkel joint work with jens graupmann and gerhard...

a survey on tree matching and xml retrieval

ranked information retrieval on xml data seminar...

inex 2002 - 2006: understanding xml retrieval evaluation

xml retrieval &...

xml information retrieval tutorial @ sigir 2003

1 - fuhr: information retrieval methods for xml documents...

xml information retrieval and inex