martin theobald max planck institute for computer science stanford university joint work with ralf...

Martin Theobald

Max Planck Institute for Computer ScienceStanford University

Joint work withRalf Schenkel, Gerhard Weikum

TopXEfficient & Versatile

Top-k Query Processing for Semistructured Data

TopXEfficient & Versatile

Top-k Query Processing for Semistructured Data

“Native XML data base systems can store schemaless data ... ”

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML-QL: A Query Language for XML.”

“Native XML Data Bases.”

“Proc. Query Languages Workshop, W3C,1998.”

“XML queries with an expressive power similar to that of Datalog …”

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment”

itempar

title inproc

title

//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]

“What does XML add for retrieval? It adds formal ways …”

“w3c.org/xml”

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “The

XML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

bib

“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”

RANKINGRANKING

VAGUENESSVAGUENESS

PRUNINGPRUNING

• Extend existing threshold algorithms for inverted lists [Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01]

to XML data and XPath-like full-text search• Non-schematic, heterogeneous data sources• Efficiently support IR-style vague search

• Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures• Exploit cheap disk space for redundant index structures

Goal: Efficiently retrieve the best (top-k) results of a similarity query

XML-IR: History and Related WorkIR on structured docs (SGML):

1995

2000

2005

IR on XML:

Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ...

XML query languages:

XQuery 1.0 (W3C)XPath 2.0 (W3C)

NEXI (INEX Benchmark)

XPath 2.0 &XQuery 1.0

Full-Text(W3C)

XPath 1.0 (W3C)

XML-QL (AT&T Labs)

Web query languages:

Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)

TeXQuery (AT&T Labs)

WebSQL (U Toronto)

XIRQL & HyRex (U Dortmund)XXL & TopX (U Saarland / MPII)ApproXQL (U Berlin / U Munich)ELIXIR (U Dublin)JuruXML (IBM Haifa )XSearch (Hebrew U)Timber (U Michigan)XRank & Quark (Cornell U)FleXPath (AT&T Labs)XKeyword (UCSD)

OED etc. (U Waterloo)HySpirit (U Dortmund)HyperStorM (GMD Darmstadt)WHIRL (CMU)

Ontology/Large Thesaurus

WordNet,OpenCyc, etc.

Ontology/Large Thesaurus

WordNet,OpenCyc, etc.

SASA

DBMS / Inverted ListsUnified Text & XML SchemaDBMS / Inverted ListsUnified Text & XML Schema

Random Access

Probabilistic Candidate

Pruning


Pruning

Probabilistic Index AccessScheduling


Dynamic Query

Expansion

Dynamic Query

Expansion

Top-kXPath

Processing

Top-kXPath

Processing Top-kQueueTop-kQueue

Scan Threads

Auxiliary PredicatesAuxiliary Predicates

CandidateCache

CandidateCache

CandidateQueue

Inde

xing

Tim

eQ

uery

Pro

cess

ing

Tim

e

Indexer/Crawler Indexer/Crawler

Frontends• Web Interface • Web Service • API

Frontends• Web Interface • Web Service • API

• Selectivities• Histograms• Correlations

• Selectivities• Histograms• Correlations

Index Metadata

TopX Query Processor

TopX Query Processor

Sequential Access

RARA

RARA

1

2

3

4


Pruning


Pruning



Dynamic Query

Expansion

Dynamic Query

Expansion

Top-kXPath

Processing

Top-kXPath

Processing

1

2

3

4

Experiments:TREC & INEXBenchmarks


5

Data ModelData Model

XML trees (no XLinks or ID/IDref attributes) Pre-/postorder node labels Redundant full-content text nodes (w/stemming, no stopwords)

<article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>

“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base

native xml data base system store

schemaless data“

“xml data

manage”

articlearticle

titletitle absabs secsec

“xml manage system vary

wide expressivepower“

“native xml data base”

“native xml data base system store schemaless data“

titletitle parpar

1 6

2 1 3 2 4 5

5 3 6 4

“xml data manage xml manage system

vary wide expressive power native xml native xml data base system store schemaless data“ftf (“xml”,

article1 ) = 4ftf (“xml”, article1 ) = 4

ftf (“xml”, sec4 ) = 2ftf (“xml”, sec4 ) = 2

“native xml data base native xml data

base system store

schemaless data“

Scoring Model [INEX ’06/’07]Scoring Model [INEX ’06/’07]

XML-specific extension to Okapi BM25 (originating from probabilistic IR on unstructured text)

ftf instead of tf ef instead of df Element-type specific length normalization Tunable parameters k1 and b

bib[“transactions”]vs.

par[“transactions”]

bib[“transactions”]vs.

par[“transactions”]

Naive “Merge-then-Sort” approach in between O(mn) and O(mn2) runtimeand O(mn) access cost

Fagin’s NRA [PODS ´01] at a GlanceFagin’s NRA [PODS ´01] at a Glance

Inverted Index

s(t1, d10) = 0.8s(t2,d10) = 0.6s(t3,d10) = 0.7

…

Corpus: d1,…,dn

Query: q = (t1,t2,t3)

Rank

Doc

Worst-score

Best-score

1 d78

0.9 2.4

2 d64

0.8 2.4

3 d10

0.7 2.4

Rank

Doc

Worst-score

Best-score

1 d78

1.4 2.0

2 d23

1.4 1.9

3 d64

0.8 2.1

4 d10

0.7 2.1

Rank

Doc

Worst-score

Best-score

1 d10

2.1 2.1

2 d78

1.4 2.0

3 d23

1.4 1.8

4 d64

1.2 2.0

…

…

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 2Scan

depth 3Scan

depth 3

k = 1t1d780.9

d10.7

d880.2

d130.2

d780.1

d990.2

d340.1

d230.8

d100.8

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

Find the top-k documents that maximize s(t1,dj ) + s(t2,dj ) + ... + s(tm,dj )

non-conjunctive (“andish”) evaluations

1. NRA(q,L):2. scan all lists Li (i = 1..m) in parallel & consider doc d at pos i

3. E(d) := E(d) {i};4. highi = s(ti,d); 5. worstscore(d) := ∑ s(t,d) | E(d);6. bestscore(d) := worstscore(d) + ∑ high | E(d);7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’ top-k};10. else if bestscore(d) > min-k then11. candidates := candidates {d}; 12. if max {bestscore(d’) | d’ candidates} min-k then13. return top-k;

1. NRA(q,L):2. scan all lists Li (i = 1..m) in parallel & consider doc d at pos i

3. E(d) := E(d) {i};4. highi = s(ti,d); 5. worstscore(d) := ∑ s(t,d) | E(d);6. bestscore(d) := worstscore(d) + ∑ high | E(d);7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’ top-k};10. else if bestscore(d) > min-k then11. candidates := candidates {d}; 12. if max {bestscore(d’) | d’ candidates} min-k then13. return top-k;

STOP!STOP!

Inverted Block-Index for Content & StructureInverted Block-Index for Content & Structure

Mostly Sorted (=sequential) Access to large element blocks on disk Group elements in descending order of (maxscore, docid) Block-scan all elements per doc for a given (tag, term) key

Stored as inverted files or database tables Two B+tree indexes over the full range of attributes (IOTs in Oracle)

eid docid score pre post max-score

46 2 0.9 2 15 0.9

9 2 0.5 10 8 0.9

171 5 0.85 1 20 0.85

84 3 0.1 1 12 0.1

sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-

score

216 17 0.9 2 15 0.9

72 3 0.8 14 10 0.8

51 2 0.5 4 12 0.5

671 31 0.4 12 23 0.4


3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

96 4 0.75 6 4 0.75

SASA SASA SASA

RARARARA

RARA

//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]

Navigational Element IndexNavigational Element Index

eid docid pre post

46 2 2 15

9 2 10 8

171 5 1 20

84 3 1 12

sec

Additional index for tag paths RAs on B+tree index using (docid, tag) as key Few & judiciously scheduled “expensive predicate” probes

Schema-oblivious indexing & querying Non-schematic XML data (no DTD required) Supports full NEXI syntax & all 13 XPath axes (+level)

title[“native”] par[“retrieval”]eid docid score pre post max-

score

216 17 0.9 2 15 0.9

72 3 0.8 14 10 0.8

51 2 0.5 4 12 0.5

671 31 0.4 12 23 0.4


3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

96 4 0.75 6 4 0.75

RARA

RARARARA

SASA SASA

//sec[about(.//title, “native”] //par[about(.//, “retrieval”)]

1.0

worst=0.9best=2.9

46 worst=0.5best=2.5

9

TopX Query Processing Example TopX Query Processing Example

eid docid score pre post

46 2 0.9 2 15

9 2 0.5 10 8

171 5 0.85 1 20

84 3 0.1 1 12


216 17 0.9 2 15

72 3 0.8 14 10

51 2 0.5 4 12

671 31 0.4 12 23


3 1 1.0 1 21

28 2 0.8 8 14

182 5 0.75 3 7

96 4 0.75 6 4

worst=1.0best=2.8

3

worst=0.9best=2.8

216

171 worst=0.85best=2.75

72

worst=0.8best=2.65

worst=0.9best=2.8

46

2851

worst=0.5best=2.4

9doc2 doc17 doc1worst=0.9

best=2.75

216

doc5worst=1.0best=2.75

3

doc3

worst=0.9best=2.7

46

2851

worst=0.5best=2.3


171worst=1.7best=2.5

46

28

worst=0.5best=1.3


216

worst=1.0best=2.65

3

worst=0.85best=2.45

171

worst=0.8best=2.45

72

worst=0.8best=1.6

72

worst=0.1best=0.9

84

worst=0.9best=1.8

216

worst=1.0best=1.9

3

worst=2.2best=2.2

46

2851

worst=0.5best=0.5

9 worst=1.0best=1.6

3

worst=0.85best=2.15


171

182

worst=0.9best=1.0

216

worst=0.0best=2.9

Pseudo-

docworst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35

sec[“xml”] title[“native”]

Top-2 results

worst=0.946 worst=0.59worst=0.9

216

worst=1.746

28

worst=1.0

3

worst=1.6171

182

par[“retrieval”]1.0 1.0 1.00.9

0.850.1

0.90.80.5

0.8

0.75

min-2=0.0min-2=0.5min-2=0.9min-2=1.6

Candidate queue

worst=2.246

2851

min-2=1.0

//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]


Pruning


Pruning



Dynamic Query

Expansion

Dynamic Query

Expansion

Top-kXPath

Processing

Top-kXPath

Processing

1

2

3

4



5

… … …

1.0

0.9

0.8

0.8

1.0

0.9

0.9

0.2

1.0

0.9

0.7

0.6

SA Scheduling Look-ahead Δi through precomputed

score histograms Knapsack-based optimization of

Score Reduction

RA Scheduling 2-phase probing: Schedule RAs “late & last”, i.e.,

cleanup the queue if

Extended probabilistic cost model for integrating SA & RA scheduling

Index Access Scheduling [VLDB ’06]Index Access Scheduling [VLDB ’06]

RARA

InvertedBlock Index

Δ3,3 = 0.2Δ1,3 = 0.8Δ1,3 = 0.8

SA

SASA

SA SA

SA


Pruning


Pruning



Dynamic Query

Expansion

Dynamic Query

Expansion

Top-kXPath

Processing

Top-kXPath

Processing

1

2

3

4



5

Probabilistic Candidate Pruning [VLDB ’04]Probabilistic Candidate Pruning [VLDB ’04]

samp

ling

eid … max-score

216 0.9

72 0.8

51 0.5

eid … max-score

3 1.0

28 0.8

182 0.75

title[“native”]

par[“retrieval”]0

f1

1 high1

f2

high21 0

2 0δ(d)

Convolutions of score distributions (assuming independence)

Indexing Time Query Processing Time

Probabilistic candidate pruning:Drop d from the candidate queue if

P [d gets in the final top-k] < εWith probabilistic guarantees for precision & recall

Probabilistic candidate pruning:Drop d from the candidate queue if

P [d gets in the final top-k] < εWith probabilistic guarantees for precision & recall

P [d gets in the final top-k] =


Pruning


Pruning



Dynamic Query

Expansion

Dynamic Query

Expansion

Top-kXPath

Processing

Top-kXPath

Processing

1

2

3

4



5

Dynamic Query Expansion [SIGIR ’05]Dynamic Query Expansion [SIGIR ’05]

Incremental merging of inverted lists for expansion ti,1...ti,m

in descending order of s(tij, d)

Best-match score aggregation

Specialized expansion operators Incremental Merge operator Nested Top-k operator (efficient phrase matching) Boolean (but ranked) retrieval mode Supports any sorted inverted index for

text, structured records & XML

d42

d11

d92

...d

21

d78

d10

d11

...d

1

d37

d42

d32

...d

87

disaster

accident

fire

transport

d66

d93

d95

...d

101

tun

nel

d95

d17

d11

...d

99

Top-k (transport, tunnel,

~disaster)

Top-k (transport, tunnel,

~disaster)

d42 d11 d92 d37 …

~disaster

Incr. Merge

TREC Robust Topic #363

SASA

SA

SA SA

SA

Incremental Merge OperatorIncremental Merge Operator

~t

Large corpusterm correlations

Large corpusterm correlations

sim(t, t1 ) = 1.0 sim(t, t1 ) = 1.0

~t = { t1, t2, t3 }

sim(t, t2 ) = 0.9 sim(t, t2 ) = 0.9

sim(t, t3 ) = 0.5 sim(t, t3 ) = 0.5

t1 ...d780.9

d10.4

d880.3

d230.8

d100.8

0.4

t3 ...d990.7

d340.6

d110.9

d780.9

d640.7

d780.9

d230.8

d100.8

d640.72

d230.72

d100.63

d110.45

d780.45

d10.4 ...

SA

...d120.2

d780.1

d640.8

d230.8

d100.7t2

0.9

0.72

0.350.45

Thesaurus lookups/Relevance feedbackThesaurus lookups/Relevance feedback

Index list metadata(e.g., histograms)

Index list metadata(e.g., histograms)

d880.3

Expansion terms

Expansion similarities

Initial high-scores

0.18

Meta histogramsseamlessly integrate Incremental Merge operatorsinto probabilistic scheduling and candidate pruning

Meta histogramsseamlessly integrate Incremental Merge operatorsinto probabilistic scheduling and candidate pruning


Pruning


Pruning



Dynamic Query

Expansion

Dynamic Query

Expansion

Top-kXPath

Processing

Top-kXPath

Processing

1

2

3

4



5

TREC Terabyte Benchmark ’05/’06 TREC Terabyte Benchmark ’05/’06

Extensive crawl over the .gov domain (2004) 25 Mio documents—426 GB text data

50 ad-hoc-style keyword queries reintroduction of gray wolves Massachusetts textile mills

Primary cost metricsCost = #SA + cR/cS #RA

Wall clock runtime

TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]

TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]

TREC TerabyteWall clock runtimes [VLDB ‘06/TREC ’06] TREC TerabyteWall clock runtimes [VLDB ‘06/TREC ’06]

INEX Benchmark ‘06/’07INEX Benchmark ‘06/’07

New XMLified Wikipedia corpus 660,000 documents w/ 130,000,000 elements—6.6 GB XML data 125 NEXI queries, each as content-only (CO) and content-and-structure (CAS) formulation

CO: +“state machine” figure Mealy Moore

CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )]

Primary cost metricCost = #SA + cR/cS #RA

TopX vs. Full-MergeTopX vs. Full-Merge

0

5

10

15

20

25

30

35

40

10 20 50 100 500 1,000

(Mill

ion

s)

k

Cos

t

CAS - Full MergeCO - Full MergeCAS - TopX - ε=0.0CO - TopX - ε=0.0CAS - TopX - ε=0.1CO - TopX - ε=0.1

Significant cost savings for large ranges of k CAS cheaper than CO !

Static vs. Dynamic Expansions Static vs. Dynamic Expansions

Query expansions with up to m=292 keywords & phrases

Balanced amount of sorted vs. random disk access

Adaptive scheduling wrt.

cR/cS cost ratio

Dynamic expansions outperform static expansions & full-merge in both efficiency & effectiveness

0

20

40

60

80

100

120

CAS -Full

Merge

CAS -TopX -Static

CAS -TopX -

Dynamic

(Mill

ions

)

# RA

# SA

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

ε

CAS - Rel. Precision

CO - Rel. Precision

CAS - Rel. Cost

CO - Rel. Cost

Efficiency vs. Effectiveness Efficiency vs. Effectiveness

Very good precision/runtime ratio for probabilistic pruning

Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)

Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)

Conclusions & Outlook Conclusions & Outlook Scalable XML-IR and vague search

Mature system, reference engine for INEX topic development & interactive tracks

Efficient and versatile Java prototype for text, XML, and structured data (Oracle backend)

Very efficient prototype reimplementation for text data in C++ (over own file structures) C++ version for XML currently in production at MPI

More features Graph top-k, proximity search, XQuery subset,…

martin theobald max planck institute for computer science stanford university joint work with ralf...

Documents

xml query languages

xml retrieval par

xml queries

native xml data bases

native xml data base

native xml databases

schemaless data

w3c sec