ranked information retrieval on xml data
Post on 08-Jan-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Ranked Information Retrieval on
XML Data
Seminar “Informationsorganisation und -suche mit XML”
Dr. Ralf SchenkelSS 2003
Saarland University
8. Juli 2003Bernadette Blum, Christian Nicolaus, Markus Uhl
Ranked Information Retrieval on XML Data 2/48
OutlineOutline
1. Introduction in Information Retrieval
2. Information Retrieval on XML Data
3. Approaches1. ELIXIR
- The ELIXIR language- The ELIXIR query processing algorithm- Experiments, Conclusion
2. XRANK- Data model- Ranking function- Data structures and algorithms- Experiments
4. Conclusion
Ranked Information Retrieval on XML Data 3/48
1. Introduction in Information Retrieval 1. Introduction in Information Retrieval
• Definition:
– Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML, ...
– search engines, digital libraries, similarity search on scientific data
• Vector space model (text analysis):
– based on word occurrence frequency
– documents and queries are vectors
– result ranking based on similarity metric in vector space
Ranked Information Retrieval on XML Data 4/48
1. Introduction in Information Retrieval (II)1. Introduction in Information Retrieval (II)
• Link analysis (structure analysis):
– weighting documents
– improve result ranking
Page rank approach (I):
– web as directed graph G
– “random walk” of a web surfer
• follow hyperlinks with probability (1-)
• “random jump” with probability
Ranked Information Retrieval on XML Data 5/48
Page rank approach (II):
1. Introduction in Information Retrieval (III)1. Introduction in Information Retrieval (III)
“random jump” hyperlinks
Hyperlink
Probability of “random jump” Probability of following hyperlink (1- )
n
1
G)q,p( )p(reedegout
)p(r)(1+
“random jump”
Document
p(q)=
q
(1-)/3
(1-)/3(1-)/3
/5
/5
/5/5
/5
Ranked Information Retrieval on XML Data 6/48
2. Information Retrieval on XML Data2. Information Retrieval on XML Data
• XML: standard for exchange of structured data and documents
• existing query languages (e.g. XML-QL, Quilt, XQL, … XQuery)
– no ranked or weighted results based on textual similarity
– but extensions (XXL, XIRQL …)
2 Approaches
ELIXIR
SQL-like approach
XRANK
Keyword based approach
Ranked Information Retrieval on XML Data 7/48
3.1 ELIXIR3.1 ELIXIR
• ELIXIR = “expressive and efficient language for XML information retrieval”
• extension to XML-QL: similarity operator “~”
• “~” computed by WHIRL
• returns best r answers
Ranked Information Retrieval on XML Data 8/48
ELIXIR – The ELIXIR languageELIXIR – The ELIXIR language
• Syntax:
– XML-QL Syntax (SQL-like)
CONSTRUCT <item>$b</>
WHERE <items.book year=$yb>$b</> in “db.xml”,
<items.cd>$c</> in “db.xml”,
$yb > 1990,
$b ~ $c.
outputformat
pattern statement
s +predicates
boolean operators
ELIXIR’s similarity operator
• similarity calculation even between 2 variables ( expressiveness)
• no nested queries
Ranked Information Retrieval on XML Data 9/48
ELIXIR – The ELIXIR language (II)ELIXIR – The ELIXIR language (II)
WHIRL (I):
• Word-based Heterogeneous Information Retrieval Logic
• extends DATALOG with “~”
• only relational data
• efficiently supports ranked IR
• Syntax (Horn clause):
output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a.
output relation input relation
conjunction of relational predicates
boolean operator
similarity operator
Ranked Information Retrieval on XML Data 10/48
WHIRL (II):
• Similarity computation “~”:
– standard IR term vector techniques
– weighting terms (TF-IDF values)
– cosine measure:
Vt
tt
d'd
'ddsim(d,d')
(V Vocabulary of distinct terms; Terms t V; Documents d, d’ R|V|)
ELIXIR – The ELIXIR language (III)ELIXIR – The ELIXIR language (III)
Ranked Information Retrieval on XML Data 11/48
ELIXIR – The ELIXIR query processing algorithmELIXIR – The ELIXIR query processing algorithm
Example (naïve approach):
<q2>
CONSTRUCT <tuple><b>$b</><c>$c</></>
WHERE <items.book>$b</> in “db.xml”,
<items.cd>$c</> in “db.xml”
</>
XML-QL query Q2
Similarity computation for every tupel ($b, $c)
full cross product !
Ranked Information Retrieval on XML Data 12/48
ELIXIR – The ELIXIR query processing algorithm (II)ELIXIR – The ELIXIR query processing algorithm (II)
Problem:
full cross product !
Ranked Information Retrieval on XML Data 13/48
Solution:
• not simply map the full XML data into relational model
• invoke WHIRL as a “subroutine” ( efficiency)
Avoid generating full cross product!
ELIXIR – The ELIXIR query processing algorithm (III)ELIXIR – The ELIXIR query processing algorithm (III)
Ranked Information Retrieval on XML Data 14/48
2 pattern statements with variables that are compared with a similarity predicate => distinct Q2
j queries
ELIXIR – The ELIXIR query processing algorithm (IV)ELIXIR – The ELIXIR query processing algorithm (IV)
Start query Q1
3 Stages: intermediate queries Q2, Q3, Q4
1. Partition into a set, Q21 … Q2
N, of XML-QL queries- avoid generating full cross product - ordinary predicates
2. WHIRL query Q3 - similarity predicates - ordered table of the r best answers
3. XML-QL query Q4
– transformation of Q3’s output
– specified XML structure by Q1
Ranked Information Retrieval on XML Data 15/48
Example (Step I – Partition in Q2n queries):
<q21>
CONSTRUCT <tuple><b>$b</></>
WHERE <items.book>$b</> in "db.xml"
</>
<q22>
CONSTRUCT <tuple><c>$c</></>
WHERE <items.cd>$c</> in "db.xml"
</>
XML-QL query Q21
XML-QL query Q22
<q22><tuple><c>Ukrainian folk music</></>
<tuple><c>Being there</></>
<tuple><c>Milk cow blues</></></>
<q21><tuple><b>Traditional Ukrainian cookery</></>
<tuple><b>Being and nothingness</></>
<tuple><b>Shooting Elvis</></></>
Avoid generating full cross product!
ELIXIR – The ELIXIR query processing algorithm (V)ELIXIR – The ELIXIR query processing algorithm (V)
Ranked Information Retrieval on XML Data 16/48
Example (Step II – WHIRL query Q3):
q3($b) :- q21($b), q22($c), $b ~ $c.WHIRL query Q3
<q3><tuple><b>Traditional Ukrainian cookery</></>
<tuple><b>Being and nothingness</></></>
<q22><tuple><c>Ukrainian folk music</></>
<tuple><c>Being there</></>
<tuple><c>Milk cow blues</></></>
<q21><tuple><b>Traditional Ukrainian cookery</></>
<tuple><b>Being and nothingness</></>
<tuple><b>Shooting Elvis</></></>
ELIXIR – The ELIXIR query processing algorithm (VI)ELIXIR – The ELIXIR query processing algorithm (VI)
Ranked Information Retrieval on XML Data 17/48
Example (Step III – XML-QL query Q4):
<results>
CONSTRUCT <item>$b</>
WHERE <q3.tuple><b>$b</></> in "q3.xml“
</>
XML-QL query Q4
<results><item>Traditional Ukrainian cookery</>
<item>Being and nothingness</></>
Final XML OUTPUT
<q3><tuple><b>Traditional Ukrainian cookery</></>
<tuple><b>Being and nothingness</></></>
ELIXIR – The ELIXIR query processing algorithm (VII)ELIXIR – The ELIXIR query processing algorithm (VII)
Ranked Information Retrieval on XML Data 18/48
ELIXIR – Experiments, ConclusionELIXIR – Experiments, Conclusion
Experiments:
Total processing time …
– … depends on details of each query and input data
– … increases marginal with number of answers r
– … increases linearly with number of similarity join predicates
– Partition (Step 1) of initially query dominate (expensive parsing and traversing)
Ranked Information Retrieval on XML Data 19/48
ELIXIR – Experiments, Conclusion (II)ELIXIR – Experiments, Conclusion (II)
Conclusion:
• ELEXIR extends XML-QL by supporting IR-similarity-features for ranking
• similarity joins even between 2 variables (expressiveness)
• Algorithm:
– rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries.
– no full cross product, only filtered tuples of variable bindings (efficiency)
• But …
– only non-nested queries
– strict three-stage approach may be suboptimal in some cases (partition)
Ranked Information Retrieval on XML Data 20/48
XRANK:Ranked Keyword Search
over XML Documents
Ranked Information Retrieval on XML Data 21/48
IntroductionIntroduction
XRANK - Keyword Search over XML documents
results: XML elements that contain all searched keywords
ranking: at granularity of XML elements based on hyperlink structure
advantages: user does not have to learn a query language no knowledge about the structure of XML documents is needed
generalized keyword search engine(both HTML and XML are possible)
Ranked Information Retrieval on XML Data 22/48
• G = (V, CE, HE) : collection of XML documents• V : set of XML elements (tags and attributes)• CE : set of containment edges • HE : set of hyperlinked edges
• (u,v) in CE v is a sub-element of u• (u,v) in HE u contains a hyperlink to v• contains(v,k) v (in)directly contains the keyword k
Data ModelData Model
Ranked Information Retrieval on XML Data 23/48
Example: XML GraphExample: XML Graph
...
XML element value
Ranked Information Retrieval on XML Data 24/48
How to define results of keyword search queries overXML documents?
elements with at least one
sub-element containiningall keywords &
at least one sub-elementcontaining some
keywords
elements that contain all keywords –
no sub-element contains all keywords!
⋃
Keyword Query Results (1)Keyword Query Results (1)
Ranked Information Retrieval on XML Data 25/48
Ranking ElementsRanking Elements
How to rank XML elements?
extension of PageRank at the granularity of elements objective importance of XML elements based on hyperlinked and nested structure of XML elements
ElemRank
Ranked Information Retrieval on XML Data 26/48
n : # XML elementsnc(u) : # sub-elements of unh (u) : # outgoing hyperlinks from u
CE-1 : (v,u) | (u,v) CE “reverse containment edges“E : HE CE CE -1
u
nc(u) = 3
nh(u) = 3
containment edge reverse containment edge hyperlink edge
ElemRank (1)ElemRank (1)
Ranked Information Retrieval on XML Data 27/48
: prob. for following a hyperlink 1- - - : prob. for a random jump : prob. for using a containment edge : prob. for using a reverse containment edge
containment edge reverse containment edge hyperlink edge
/ 3 + ε / 10
/ 3 + ε/10
/ 3 + ε /
10
/
1 +
ε /
10
ε
/ 3 + ε / 10
/ 3 + ε / 10
/ 3 + ε / 10
ε / 10
ε / 10
ElemRank (2)ElemRank (2)
Ranked Information Retrieval on XML Data 28/48
e(u)
nh(u)
e(u)
nc(u)
ElemRank e(v) =
(0 ≤ , , ≤ 1)
random navigation
via hyperlinks
via forward containment
edges
(u,v) HE (u,v) CE (u,v) CE-1
e(u)
1
via reverse containment
edges
(1- - - ) * 1/n + * ∑ + * ∑ + * ∑
ElemRank (3)ElemRank (3)
Ranked Information Retrieval on XML Data 29/48
ranking functions should take into account: result specifity hyperlinks keyword proximity
based on hyperlinked structure result specifity
contains(v,k)
∃ sequence (v1,v2), ..., (vn-1,vn) s.t. vn directly contains k
r(v,k) = ElemRank(vn) * decayn-1 (0 ≤ decay ≤ 1)
Ranking Function (1)Ranking Function (1)
Ranked Information Retrieval on XML Data 30/48
• m occurences of keyword k computation of r1, ..., rm
r*(v,k) = f(r1, ..., rm)
• query q consists of keywords k1, ..., kn
R(v,q) = ( r*(v,ki)) * p(v,k1, ..., kn)
keyword proximity
p = proximity measure
(with accumulation function f - e.g. max or sum)
Ranking Function (2)Ranking Function (2)
Ranked Information Retrieval on XML Data 31/48
<CDs>
<CD id = “1“> <title> R.E.M. – Out Of Time </title>
<song> <title> Radio Song </title> <time> 4:12 </time> </song>
<song> <title> Losing My Religion </title> <time> 4:26 </time> </song> ... </CD>
<CD id = “2“> <title> R.E.M. – Automatic For... </title> ... </CD> ...</CDs>
Ranked Information Retrieval on XML Data 32/48
ElemRank computation
XML documents
index structures &algorithms
Query Evaluator
XML elements
with ElemRanks
data acces
keyword search query
ranked result list
XRANK ArchitectureXRANK Architecture
Ranked Information Retrieval on XML Data 33/48
• naïve inverted list: contains all XML elements that contain the keyword
key1 elem11 elem12 ...
key2 elem21 elem22 ...
etc.
space overhead spurious results inaccurate ranking
Naïve ApproachNaïve Approach
Ranked Information Retrieval on XML Data 34/48
<CDs>
<CD><CD>
...
... ...
0
0.00.1
<title> <title><song> <song>0.1.0
R.E.M. – Automatic For The People
0.0.2
0.0.2.1<time>
0.0.2.0<title>
4:26Losing My Religion
0.0.0 0.0.1
0.0.1.10.0.1.0
4:12Radio Song
R.E.M. – Out Of Time
<time><title>
Dewey IDsDewey IDs
Ranked Information Retrieval on XML Data 35/48
• Dewey inverted list:• contains the Dewey IDs of all XML elements that directly contain the keyword• sorted by Dewey ID (ascending)
Dewey ID ElemRank position list
R.E.M.
Religion
0.0.0
0.1.0
75
80
[0]
[0]
…
Dewey ID ElemRank position list
0.0.2.0 88 [2]
…
DIL – Data StructureDIL – Data Structure
Ranked Information Retrieval on XML Data 36/48
• key idea: computation of longest common prefix (lcp) of Dewey IDs
Dew
eyID
ran
k [1
]
ran
k [2
]
po
sLis
t [1
]
po
sLis
t [2
]
po
t_re
sult
1.
0
0
0
75
70
65 0
0
0 y
n
n
DIL – Query Processing (1) DIL – Query Processing (1)
Ranked Information Retrieval on XML Data 37/48
y
Dew
eyID
ran
k [1
]
ran
k [2
]
po
sLis
t [1
]
po
sLis
t [2
]
po
t_re
sult
Dew
eyID
ran
k [1
]
ran
k [2
]
po
sLis
t [1
]
po
sLis
t [2
]
po
t_re
sult
1. 2.
0
0
0
0
2
0
0
75
70
65 0
0
0 y
n
n 70
65
0
0
88
83
78
73 n
n
n
2
2
2
2lcp
DIL – Query Processing (2) DIL – Query Processing (2)
Ranked Information Retrieval on XML Data 38/48
y
y
Dew
eyID
ran
k [1
]
ran
k [2
]
po
sLis
t [1
]
po
sLis
t [2
]
po
t_re
sult
Dew
eyID
ran
k [1
]
ran
k [2
]
po
sLis
t [1
]
po
sLis
t [2
]
po
t_re
sult
1.
3.
2.
0
0
0
0
2
0
0
75
70
65 0
0
0 y
n
n
0
0
1
70
65
0
0
88
83
78
73 n
n
n
2
2
2
2
80
75
70 73
0
0 n
n
20 0.0 , 0
lcp
lcp
DIL – Query Processing (3) DIL – Query Processing (3)
Ranked Information Retrieval on XML Data 39/48
• ranked Dewey inverted list:• each Dewey ID in the list has a position in the B+-tree• B+-tree sorted by Dewey ID (ascending)• inverted list sorted by ElemRank (descending)
Dewey IDElemRank
R.E.M.80
75
0.1.0
0.0.0
…
0.0.00.1.0 …
B+-tree onDewey IDs
RDIL – Data StructureRDIL – Data Structure
Ranked Information Retrieval on XML Data 40/48
key1 key3
entry21
entry22
entry23
entry31
entry32
entry33
sort
ed b
y E
lem
Ran
k
...
...
key2
entry11
entry12
entry13
...
B+ B+B+on D
ewey ID
s
RDIL – Query Processing (1) RDIL – Query Processing (1)
lcp with Dewey ID11
result heap
Ranked Information Retrieval on XML Data 41/48
key1 key3
entry31
entry32
entry33
sort
ed b
y E
lem
Ran
k
...
...
key2
...
B+ B+B+on D
ewey ID
s
RDIL – Query Processing (2) RDIL – Query Processing (2)
lcp with Dewey ID21
result heap
entry22
entry23
entry21entry11
entry12
entry13
etc.
Ranked Information Retrieval on XML Data 42/48
key1 key3
entry21
entry22
entry23
entry31
entry32
entry33
sort
ed b
y E
lem
Ran
k
...
...
key2
...
B+ B+B+on D
ewey ID
s
RDIL – Query Processing (3) RDIL – Query Processing (3)
entry11
entry12
entry13
∑ Ranking = threshold Ωmax. reachable Ranking ≤
Ranked Information Retrieval on XML Data 43/48
RDIL algorithm stops
if
threshold Ω < lowest ElemRank in result heap
because
max. reachable ranking ≤ Ω < lowest ElemRank in result heap
max. reachable ranking < lowest ElemRank in result heap
!
RDIL – Query Processing (4) RDIL – Query Processing (4)
Ranked Information Retrieval on XML Data 44/48
DIL / RDIL ElemRank computation
XML documents
Query Evaluator
data acces
keyword search query
ranked result list
XML elements
with ElemRanks
XRANK ArchitectureXRANK Architecture
Ranked Information Retrieval on XML Data 45/48
high keyword correlation:
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5
number of keywords
ex
ecu
tio
n t
ime
(se
c.)
DIL
RDIL
Experimental Results (1)Experimental Results (1)
Ranked Information Retrieval on XML Data 46/48
low keyword correlation:
0
0,4
0,8
1,2
1,6
2
1 2 3 4 5
number of keywords
ex
ecu
tio
n t
ime
(se
c.)
DIL
RDIL
Experimental Results (2)Experimental Results (2)
Ranked Information Retrieval on XML Data 47/48
DIL RDIL
• inverted lists sorted by Dewey ID
• compute longest common prefix on Dewey IDs
• extracts the minimum of all remaining Dewey IDs
• all lists are completely scanned
• outperforms RDIL if keyword correlation is low
• inverted lists sorted by ElemRank
• chooses next list sequentially
• stops if a certain threshold is reached
• outperforms DIL if keyword correlation is high
Comparison DIL - RDILComparison DIL - RDIL
Ranked Information Retrieval on XML Data 48/48
2 Approaches
ELIXIR:– SQL-like structure based
search– extends XML-QL by
supporting IR-similarity-features for ranking
– ranked results based only on textual similarity (even between 2 variables)
XRANK:– keyword based search à la
Google– ranked results based on
textual similarity– hierarchical and
hyperlinked structure
ConclusionConclusion
top related