martin theobald max planck institute for computer science stanford university joint work with ralf...
TRANSCRIPT
Martin Theobald
Max Planck Institute for Computer ScienceStanford University
Joint work withRalf Schenkel, Gerhard Weikum
TopXEfficient & Versatile
Top-k Query Processing for Semistructured Data
TopXEfficient & Versatile
Top-k Query Processing for Semistructured Data
“Native XML data base systems can store schemaless data ... ”
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML-QL: A Query Language for XML.”
“Native XML Data Bases.”
“Proc. Query Languages Workshop, W3C,1998.”
“XML queries with an expres- sive power similar to that of Datalog …”
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment”
itempar
title inproc
title
//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]
“What does XML add for retrieval? It adds formal ways …”
“w3c.org/xml”
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “The
XML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
bib
“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
RANKINGRANKING
VAGUENESSVAGUENESS
PRUNINGPRUNING
• Extend existing threshold algorithms for inverted lists [Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01]
to XML data and XPath-like full-text search• Non-schematic, heterogeneous data sources• Efficiently support IR-style vague search
• Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures• Exploit cheap disk space for redundant index structures
Goal: Efficiently retrieve the best (top-k) results of a similarity query
XML-IR: History and Related WorkIR on structured docs (SGML):
1995
2000
2005
IR on XML:
Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ...
XML query languages:
XQuery 1.0 (W3C)XPath 2.0 (W3C)
NEXI (INEX Benchmark)
XPath 2.0 &XQuery 1.0
Full-Text(W3C)
XPath 1.0 (W3C)
XML-QL (AT&T Labs)
Web query languages:
Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)
TeXQuery (AT&T Labs)
WebSQL (U Toronto)
XIRQL & HyRex (U Dortmund)XXL & TopX (U Saarland / MPII)ApproXQL (U Berlin / U Munich)ELIXIR (U Dublin)JuruXML (IBM Haifa )XSearch (Hebrew U)Timber (U Michigan)XRank & Quark (Cornell U)FleXPath (AT&T Labs)XKeyword (UCSD)
OED etc. (U Waterloo)HySpirit (U Dortmund)HyperStorM (GMD Darmstadt)WHIRL (CMU)
Ontology/Large Thesaurus
WordNet,OpenCyc, etc.
Ontology/Large Thesaurus
WordNet,OpenCyc, etc.
SASA
DBMS / Inverted ListsUnified Text & XML SchemaDBMS / Inverted ListsUnified Text & XML Schema
Random Access
Probabilistic Candidate
Pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Dynamic Query
Expansion
Top-kXPath
Processing
Top-kXPath
Processing Top-kQueueTop-kQueue
Scan Threads
Auxiliary PredicatesAuxiliary Predicates
CandidateCache
CandidateCache
CandidateQueue
Inde
Tim
eQ
uery
Pro
cess
ing
Tim
e
Indexer/Crawler Indexer/Crawler
Frontends• Web Interface • Web Service • API
Frontends• Web Interface • Web Service • API
• Selectivities• Histograms• Correlations
• Selectivities• Histograms• Correlations
Index Metadata
TopX Query Processor
TopX Query Processor
Sequential Access
RARA
RARA
1
2
3
4
Probabilistic Candidate
Pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Dynamic Query
Expansion
Top-kXPath
Processing
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
Experiments:TREC & INEXBenchmarks
5
Data ModelData Model
XML trees (no XLinks or ID/IDref attributes) Pre-/postorder node labels Redundant full-content text nodes (w/stemming, no stopwords)
<article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>
“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base
native xml data base system store
schemaless data“
“xml data
manage”
articlearticle
titletitle absabs secsec
“xml manage system vary
wide expressivepower“
“native xml data base”
“native xml data base system store schemaless data“
titletitle parpar
1 6
2 1 3 2 4 5
5 3 6 4
“xml data manage xml manage system
vary wide expressive power native xml native xml data base system store schemaless data“ftf (“xml”,
article1 ) = 4ftf (“xml”, article1 ) = 4
ftf (“xml”, sec4 ) = 2ftf (“xml”, sec4 ) = 2
“native xml data base native xml data
base system store
schemaless data“
Scoring Model [INEX ’06/’07]Scoring Model [INEX ’06/’07]
XML-specific extension to Okapi BM25 (originating from probabilistic IR on unstructured text)
ftf instead of tf ef instead of df Element-type specific length normalization Tunable parameters k1 and b
bib[“transactions”]vs.
par[“transactions”]
bib[“transactions”]vs.
par[“transactions”]
Naive “Merge-then-Sort” approach in between O(mn) and O(mn2) runtimeand O(mn) access cost
Fagin’s NRA [PODS ´01] at a GlanceFagin’s NRA [PODS ´01] at a Glance
Inverted Index
s(t1, d10) = 0.8s(t2,d10) = 0.6s(t3,d10) = 0.7
…
Corpus: d1,…,dn
Query: q = (t1,t2,t3)
Rank
Doc
Worst-score
Best-score
1 d78
0.9 2.4
2 d64
0.8 2.4
3 d10
0.7 2.4
Rank
Doc
Worst-score
Best-score
1 d78
1.4 2.0
2 d23
1.4 1.9
3 d64
0.8 2.1
4 d10
0.7 2.1
Rank
Doc
Worst-score
Best-score
1 d10
2.1 2.1
2 d78
1.4 2.0
3 d23
1.4 1.8
4 d64
1.2 2.0
…
…
Scan depth 1Scan
depth 1Scan
depth 2Scan
depth 2Scan
depth 3Scan
depth 3
k = 1t1d780.9
d10.7
d880.2
d130.2
d780.1
d990.2
d340.1
d230.8
d100.8
t2d640.8
d230.6
d100.6
t3d100.7
d780.5
d640.4
Find the top-k documents that maximize s(t1,dj ) + s(t2,dj ) + ... + s(tm,dj )
non-conjunctive (“andish”) evaluations
1. NRA(q,L):2. scan all lists Li (i = 1..m) in parallel & consider doc d at pos i
3. E(d) := E(d) {i};4. highi = s(ti,d); 5. worstscore(d) := ∑ s(t,d) | E(d);6. bestscore(d) := worstscore(d) + ∑ high | E(d);7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’ top-k};10. else if bestscore(d) > min-k then11. candidates := candidates {d}; 12. if max {bestscore(d’) | d’ candidates} min-k then13. return top-k;
1. NRA(q,L):2. scan all lists Li (i = 1..m) in parallel & consider doc d at pos i
3. E(d) := E(d) {i};4. highi = s(ti,d); 5. worstscore(d) := ∑ s(t,d) | E(d);6. bestscore(d) := worstscore(d) + ∑ high | E(d);7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’ top-k};10. else if bestscore(d) > min-k then11. candidates := candidates {d}; 12. if max {bestscore(d’) | d’ candidates} min-k then13. return top-k;
STOP!STOP!
Inverted Block-Index for Content & StructureInverted Block-Index for Content & Structure
Mostly Sorted (=sequential) Access to large element blocks on disk Group elements in descending order of (maxscore, docid) Block-scan all elements per doc for a given (tag, term) key
Stored as inverted files or database tables Two B+tree indexes over the full range of attributes (IOTs in Oracle)
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
SASA SASA SASA
RARARARA
RARA
//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]
Navigational Element IndexNavigational Element Index
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
sec
Additional index for tag paths RAs on B+tree index using (docid, tag) as key Few & judiciously scheduled “expensive predicate” probes
Schema-oblivious indexing & querying Non-schematic XML data (no DTD required) Supports full NEXI syntax & all 13 XPath axes (+level)
title[“native”] par[“retrieval”]eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
RARA
RARARARA
SASA SASA
//sec[about(.//title, “native”] //par[about(.//, “retrieval”)]
1.0
worst=0.9best=2.9
46 worst=0.5best=2.5
9
TopX Query Processing Example TopX Query Processing Example
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 14 10
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
worst=1.0best=2.8
3
worst=0.9best=2.8
216
171 worst=0.85best=2.75
72
worst=0.8best=2.65
worst=0.9best=2.8
46
2851
worst=0.5best=2.4
9doc2 doc17 doc1worst=0.9
best=2.75
216
doc5worst=1.0best=2.75
3
doc3
worst=0.9best=2.7
46
2851
worst=0.5best=2.3
9 worst=0.85best=2.65
171worst=1.7best=2.5
46
28
worst=0.5best=1.3
9 worst=0.9best=2.55
216
worst=1.0best=2.65
3
worst=0.85best=2.45
171
worst=0.8best=2.45
72
worst=0.8best=1.6
72
worst=0.1best=0.9
84
worst=0.9best=1.8
216
worst=1.0best=1.9
3
worst=2.2best=2.2
46
2851
worst=0.5best=0.5
9 worst=1.0best=1.6
3
worst=0.85best=2.15
171 worst=1.6best=2.1
171
182
worst=0.9best=1.0
216
worst=0.0best=2.9
Pseudo-
docworst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35
sec[“xml”] title[“native”]
Top-2 results
worst=0.946 worst=0.59worst=0.9
216
worst=1.746
28
worst=1.0
3
worst=1.6171
182
par[“retrieval”]1.0 1.0 1.00.9
0.850.1
0.90.80.5
0.8
0.75
min-2=0.0min-2=0.5min-2=0.9min-2=1.6
Candidate queue
worst=2.246
2851
min-2=1.0
//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]
Probabilistic Candidate
Pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Dynamic Query
Expansion
Top-kXPath
Processing
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
Experiments:TREC & INEXBenchmarks
5
… … …
1.0
0.9
0.8
0.8
1.0
0.9
0.9
0.2
1.0
0.9
0.7
0.6
SA Scheduling Look-ahead Δi through precomputed
score histograms Knapsack-based optimization of
Score Reduction
RA Scheduling 2-phase probing: Schedule RAs “late & last”, i.e.,
cleanup the queue if
Extended probabilistic cost model for integrating SA & RA scheduling
Index Access Scheduling [VLDB ’06]Index Access Scheduling [VLDB ’06]
RARA
InvertedBlock Index
Δ3,3 = 0.2Δ1,3 = 0.8Δ1,3 = 0.8
SA
SASA
SA SA
SA
Probabilistic Candidate
Pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Dynamic Query
Expansion
Top-kXPath
Processing
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
Experiments:TREC & INEXBenchmarks
5
Probabilistic Candidate Pruning [VLDB ’04]Probabilistic Candidate Pruning [VLDB ’04]
samp
ling
eid … max-score
216 0.9
72 0.8
51 0.5
eid … max-score
3 1.0
28 0.8
182 0.75
title[“native”]
par[“retrieval”]0
f1
1 high1
f2
high21 0
2 0δ(d)
Convolutions of score distributions (assuming independence)
Indexing Time Query Processing Time
Probabilistic candidate pruning:Drop d from the candidate queue if
P [d gets in the final top-k] < εWith probabilistic guarantees for precision & recall
Probabilistic candidate pruning:Drop d from the candidate queue if
P [d gets in the final top-k] < εWith probabilistic guarantees for precision & recall
P [d gets in the final top-k] =
Probabilistic Candidate
Pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Dynamic Query
Expansion
Top-kXPath
Processing
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
Experiments:TREC & INEXBenchmarks
5
Dynamic Query Expansion [SIGIR ’05]Dynamic Query Expansion [SIGIR ’05]
Incremental merging of inverted lists for expansion ti,1...ti,m
in descending order of s(tij, d)
Best-match score aggregation
Specialized expansion operators Incremental Merge operator Nested Top-k operator (efficient phrase matching) Boolean (but ranked) retrieval mode Supports any sorted inverted index for
text, structured records & XML
d42
d11
d92
...d
21
d78
d10
d11
...d
1
d37
d42
d32
...d
87
disaster
accident
fire
transport
d66
d93
d95
...d
101
tun
nel
d95
d17
d11
...d
99
Top-k (transport, tunnel,
~disaster)
Top-k (transport, tunnel,
~disaster)
d42 d11 d92 d37 …
~disaster
Incr. Merge
TREC Robust Topic #363
SASA
SA
SA SA
SA
Incremental Merge OperatorIncremental Merge Operator
~t
Large corpusterm correlations
Large corpusterm correlations
sim(t, t1 ) = 1.0 sim(t, t1 ) = 1.0
~t = { t1, t2, t3 }
sim(t, t2 ) = 0.9 sim(t, t2 ) = 0.9
sim(t, t3 ) = 0.5 sim(t, t3 ) = 0.5
t1 ...d780.9
d10.4
d880.3
d230.8
d100.8
0.4
t3 ...d990.7
d340.6
d110.9
d780.9
d640.7
d780.9
d230.8
d100.8
d640.72
d230.72
d100.63
d110.45
d780.45
d10.4 ...
SA
...d120.2
d780.1
d640.8
d230.8
d100.7t2
0.9
0.72
0.350.45
Thesaurus lookups/Relevance feedbackThesaurus lookups/Relevance feedback
Index list metadata(e.g., histograms)
Index list metadata(e.g., histograms)
d880.3
Expansion terms
Expansion similarities
Initial high-scores
0.18
Meta histogramsseamlessly integrate Incremental Merge operatorsinto probabilistic scheduling and candidate pruning
Meta histogramsseamlessly integrate Incremental Merge operatorsinto probabilistic scheduling and candidate pruning
Probabilistic Candidate
Pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Dynamic Query
Expansion
Top-kXPath
Processing
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
Experiments:TREC & INEXBenchmarks
5
TREC Terabyte Benchmark ’05/’06 TREC Terabyte Benchmark ’05/’06
Extensive crawl over the .gov domain (2004) 25 Mio documents—426 GB text data
50 ad-hoc-style keyword queries reintroduction of gray wolves Massachusetts textile mills
Primary cost metricsCost = #SA + cR/cS #RA
Wall clock runtime
TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]
TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]
TREC TerabyteWall clock runtimes [VLDB ‘06/TREC ’06] TREC TerabyteWall clock runtimes [VLDB ‘06/TREC ’06]
INEX Benchmark ‘06/’07INEX Benchmark ‘06/’07
New XMLified Wikipedia corpus 660,000 documents w/ 130,000,000 elements—6.6 GB XML data 125 NEXI queries, each as content-only (CO) and content-and-structure (CAS) formulation
CO: +“state machine” figure Mealy Moore
CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )]
Primary cost metricCost = #SA + cR/cS #RA
TopX vs. Full-MergeTopX vs. Full-Merge
0
5
10
15
20
25
30
35
40
10 20 50 100 500 1,000
(Mill
ion
s)
k
Cos
t
CAS - Full MergeCO - Full MergeCAS - TopX - ε=0.0CO - TopX - ε=0.0CAS - TopX - ε=0.1CO - TopX - ε=0.1
Significant cost savings for large ranges of k CAS cheaper than CO !
Static vs. Dynamic Expansions Static vs. Dynamic Expansions
Query expansions with up to m=292 keywords & phrases
Balanced amount of sorted vs. random disk access
Adaptive scheduling wrt.
cR/cS cost ratio
Dynamic expansions outperform static expansions & full-merge in both efficiency & effectiveness
0
20
40
60
80
100
120
CAS -Full
Merge
CAS -TopX -Static
CAS -TopX -
Dynamic
(Mill
ions
)
# RA
# SA
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
ε
CAS - Rel. Precision
CO - Rel. Precision
CAS - Rel. Cost
CO - Rel. Cost
Efficiency vs. Effectiveness Efficiency vs. Effectiveness
Very good precision/runtime ratio for probabilistic pruning
Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)
Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)
Conclusions & Outlook Conclusions & Outlook Scalable XML-IR and vague search
Mature system, reference engine for INEX topic development & interactive tracks
Efficient and versatile Java prototype for text, XML, and structured data (Oracle backend)
Very efficient prototype reimplementation for text data in C++ (over own file structures) C++ version for XML currently in production at MPI
More features Graph top-k, proximity search, XQuery subset,…