xml prefiltering as a string matching problem christoph koch 1, stefanie scherzinger 2, michael...

Post on 29-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

XML Prefiltering as a String Matching Problem

Christoph Koch1, Stefanie Scherzinger2, Michael Schmidt3

1Cornell University 2IBM Boeblingen 3Freiburg University

24th International Conference on Data Engineering

April 9, Cancun (Mexico), 2008

2

XML data often processed ad-hoc, e.g. in streaming scenarios and main memory-based processors

Low main memory consumption then becomes the key prerequisite to performance

XML prefiltering as an established technique that aims at decreasing main memory consumption

Motivation

We present a novel approach to XML prefiltering based on efficient string matching techniques

3

Buffer only data that is relevant to query evaluation

Prefiltering/Projection Statical analysis of the XQuery/XPath expression Identifiy parts of the input document that are relevant

to query evaluation Discard parts of the input document that are not

relevant to query evaluation

A. Marian and J. SiméonProjecting XML DocumentsIn Proc. VLDB’03, pages 213–224, 2003

S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena Accelerating Queries by Pruning XML DocumentsTKDE, 54(2):211–240, 2005

V. Benzaken, G. Castagna, D. Colazzo, and K. NguyenType-Based XML ProjectionIn Proc. VLDB’06, 2006

XML Prefiltering

4

<q> { /site//australia//description} </q>

XQuery

Relevant Paths{ /site//australia//description# }

site

regions

XML Document

XML Prefiltering

africa asia australia

description

„PDA“

item

5

Existing Approaches1. Analysis of the input query, extraction of relevant paths

2. Tokenization of the input document

3. Compilation of an automaton that projects the document token by token

XML Prefiltering

Our Approach1. Analysis of the input query, extraction of relevant paths

2. Use efficient string matching techniques to locate the relevant parts of the input document (without parsing and tokenizing the document)

Challenge: take string matching algorithms to the second dimension, to navigate in tree-structured data

6

String Matching Techniques

Example: Boyer-Moore Algorithm

S t r i n g m a t c h i n g f o r b e g i n n e

b e g i n

Search for keyword

--- length of keyword = 5

1 5 10 15 20 25

r s

Similar algorithms exist for multi-keyword search (e.g., Commentz-Walter Algorithm)

b e g i n b e g i n b e g i nb e g i n

b e g i n

b e g i n

match

7

String Matching and XML Prefiltering

String matching techniques have originally been designed for search in flat and unstructered text

But: XML is structured and prefiltering requires us to keep track of axis relations in the input paths (such as child and descendant relations)

XML schema knowledge (e.g., in the form of DTDs) provides us with structural information that can be

exploited for target-oriented search

8

The Runtime Automaton

<!DOCTYPE site [ <!ELEMENT site (regions)> <!ELEMENT regions (africa, asia, australia)> <!ELEMENT africa (item*)> <!ELEMENT asia (item*)> <!ELEMENT australia (item*)> <!ELEMENT item (location,name,payment, description,shipping,incategory+)> <!ELEMENT incategory EMPTY> <!ATTLIST incategory category ID #REQUIRED>... ]>

We restrict to non-recursive DTDs, which can be transformed to finite automaton Ideas also applicable in the context of recursive DTDs

Fragment of the XMark DTD

9

The Runtime Automaton

<site>

<regions>

<africa>

</africa><asia>

</asia>

<australia>

</site>

</australia>

<item>

<item>

<location> </location>

</name>

<description>

<shipping>

<name>

</payment><payment>

</description>

</shipping>

<incategory></incategory>

<incategory>

<item>

</item>

</item>

</item>

(<item> child tags)

(<item> child tags)

10

The Runtime Automaton

<site>

<regions>

<africa>

</africa><asia>

</asia>

<australia>

</site>

</australia>

<item>

<item>

<location> </location>

</name>

<description>

<shipping>

<name>

</payment><payment>

</description>

</shipping>

<incategory></incategory>

<incategory>

<item>

</item>

</item>

</item>

(<item> child tags)

(<item> child tags)

Search for string “<site”

11

The Runtime Automaton

<site>

<regions>

<africa>

</africa><asia>

</asia>

<australia>

</site>

</australia>

<item>

<item>

<location> </location>

</name>

<description>

<shipping>

<name>

</payment><payment>

</description>

</shipping>

<incategory></incategory>

<incategory>

<item>

</item>

</item>

</item>

(<item> child tags)

(<item> child tags)

Search for strings “<item”

and “</australia”

in parallel

12

The Runtime Automaton

<site>

<regions>

<africa>

</africa><asia>

</asia>

<australia>

</site>

</australia>

<item>

<item>

<location> </location>

</name>

<description>

<shipping>

<name>

</payment><payment>

</description>

</shipping>

<incategory></incategory>

<incategory>

<item>

</item>

</item>

</item>

(<item> child tags)

(<item> child tags)

{ /site //australia //description# }

13

The Runtime Automaton

<site>

<regions>

<africa>

</africa><asia>

</asia>

<australia>

</site>

</australia>

<item>

<location> </location>

</name>

<description>

<shipping>

<name>

</payment><payment>

</description>

</shipping>

<incategory></incategory>

<incategory>

</item>

{ /site //australia //description# }

14

<australia>

</site>

</australia>

<item>

<location> </location>

</name>

<description>

<shipping>

<name>

</payment><payment>

</description>

</shipping>

<incategory></incategory>

<incategory>

</item>

The Runtime Automaton

<site>

{ /site //australia //description# }

15

</site>

</australia>

<description>

<shipping>

</description>

</shipping>

<incategory></incategory>

<incategory>

</item>

The Runtime Automaton

<site>

<australia>

{ /site //australia //description# }

16

</site>

</australia>

<description>

</description>

The Runtime Automaton

<australia>

<site>

<description>

</australia>

{ /site //australia //description# }

17

Static Compilation into Lookup Tables

Automaton As <site> p0

p0 <australia> p1

p1 <description> p2

p1 </australia> q1

p2 </description> q2

q2 <description> p2

q2 </australia> q1

q1 </site> q0

Frontier Vocabulary Vs {<site>}

p0 {<australia>}

p1 {<description>,</australia>}

p2 {</description>}

q0 {}

q1 {</site>}

q2 {<description>, </australia>}

Action Table Ts no operation

p0 copy tag

p1 copy tag

p2 copy on

q0 copy tag

q1 copy tag

q2 copy off

s p1

q1q0 </site> </australia>

<description>

q2

p2

</description>

<australia>p0<site>

<description></australia>

18

Static Compilation into Lookup Tables

s p1

<australia>p0

<site>

<regions> <africa> </africa> <asia> </asia>

<australia>

s p0<site>

p1Extract from the original runtime automaton

Extract from the optimized runtime automaton

Shortest possible XML string between <site> and <australia>:

s=“<regions><africa/><asia/>” with |s|=25

Initially skip 25 characters

Initial Jump Table Jp0 25

q2 43

other states 0

19

The Runtime Algorithm

q := s; // current statec := 0; // cursor position

while q is not final dobegin(1) Perform initial jump J[q](2) Perform keyword search for tags V[q] until a tag t is matched (starting from current cursor position c)(3) Assign q := A[q, t](4) Perform action T[q] end

Lean runtime algorithm

Operates on top of the precompiled tables

Uses efficient string-matching techniques to locate keywords (step (2))

Runtime Core Algorithm

20

<site><regions><africa><item><location>United States</location><na

me>T V</name><payment>Creditcard</payment><description>15’’LCD-Fla

tPanel</description><shipping>Within country</shipping><incategory

category="3"/></item></africa><asia/><australia><item ><location>

A Sample Run

while q is not final dobegin(1) Perform initial jump J[q](2) Perform keyword search for tags V[q] until tag t is matched (3) Assign q := A[q, t](4) Perform action T[q] end

Current state: q = s

Initial Jump: J[q=s] = 0

Frontier Voc.: V[q=s] = {<site>}

Current state: q = p0

Initial Jump: J[q=p0] = 25

Frontier Voc.: V[q]={<australia>}

Current state: q = p1

Initial Jump: J[q=p1] = 0

Frontier Voc.: V[q=p1] = {</australia>, <description>}

Matched tag „<site>“

A[s,<site>] = p0

Matched tag „<australia>“:

A[p0,<australia>] = p1

T[q=p0] = copy tag (<site>)T[q=p1] = copy tag (<australia>)

{ /site //australia //description# }

21

A Sample Run

Egypt</location><name>PDA</name><payment>Check</payment><descripti

on>Palm Zire 71</description><shipping/><incategory category="3"/>

</item></australia></regions></site>

while q is not final dobegin(1) Perform initial jump J[q](2) Perform keyword search for tags V[q] until tag t is matched (3) Assign q := A[q, t](4) Perform action T[q] end

Current state: q = p2

Initial Jump: J[q=p2] = 0

Frontier Voc.: V[q=p2] = {</description>}

Matched tag „<description>“:

A[p1,<description>] = p2

Current state: q = p1

Initial Jump: J[q=p1] = 0

Frontier Voc.: V[q=p1] = {</australia>, <description>}

Matched tag „</description>“:

A[p2,<description>] = q2

copy on

copy off

T[q=p2] = copy onT[q=p2] = copy off

{ /site //australia //description# }

22

Experiments

Prototype implementation in C++: SMP Settings

Core2 Duo IBM ThinkPad Z61p T2500 2.00GHz CPU with 1GB RAM Ubuntu Linux 6.06 LTS

Data sets: XMark, Medline, Proteine Sequence Document Sizes: 1MB up to 5,000MB Queries: XMark queries, user-defined XPath queries Query Engines

XQuery: Qizx/open, Saxon XPath: SPEX

23

XM1 XM5 XM10 XM14 XM20

Proj. Size 67.64MB 22.10MB 307.63MB 1357.28MB 38.52MB

Memory 1.64MB 1.68MB 1.96MB 1.64MB 1.67MB

Elapsed Time 4min 12s 4min 12s 4min 55s 5min 21s 4min 10s

Usr+Sys 31.00s 19.91s 54.94s 53.71s 31.67s

CPU 12.52% 8.05% 13.85% 17.07% 12.92%

Char. Comp. 18.86% 9.87% 22.38% 21.24% 18.67%

Experimental Results

Projection of a 5,000MB XMark document for selected XMark benchmark queries

Projection Characteristics for Selected XMark Benchmark Queries

24

Experimental Results

Throughput comparison SMP projection for XMark

(average over all queries on 5,000MB document)

vs. Bare XML document

tokenization performed by the Xerces C++ parser

SMP is faster than all projection systems that rely on a prior tokenization of the input XML document

25

QizX XQuery EngineSucess TimeFail MemFail

1000MB without projection 0 0 18

1000MB with projection 18 0 05000MB without projection 0 0 185000MB with projection 15 2 1

Success Rates for 18 XMark Queries with and without Projection, where TimeFail: >1hour, MemFail: >1GB

Experimental Results

When coupled with projection, in-memory XQuery engines scale up to documents in the Gigabyte range

26

Experimental Results

Throughput improvement 656MB Medline document 5 user-defined XPath queries Evaluated with the SPEX

XPath engine

27

Summary

Efficient string matching techniques, originally designed for keyword search in flat text, can be used for search and navigation in unparsed XML documents

A novel approach to XML prefiltering on top on these ideas reduces XML prefiltering to a sequence of simple string matching tasks

Extensive experimental evaluation demonstrates persistently high throughput and scalability of our XML

prefiltering system significant improvements for both XQuery and XPath

engines when coupled with document prefiltering

Thank You for Your Attention!

Y. Diao et. al.: “Path Sharing and Predicate evaluation for High-Performance XML Filtering” in TODS, 2003.

T. J. Green et al.: “Processing XML streams with deterministic automata and stream indexes” in TODS, 2004.

D. Olteanu: “SPEX: Streamed and Progressive Evaluation of XPath” in TKDE, 2007.

X. Li and G. Agrawal: “Efficient Evaluation of XQuery over Streaming Data” in VLDB, 2005.

A. Marian and J. Simeon: “Projecting XML Documents” in VLDB, 2003.

V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen: “Type-Based XML Projection” in VLDB, 2006.

M. Schmidt, S. Scherzinger, and C. Koch: “Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation” in ICDE, 2007.

A. V. Aho: “Algorithms for finding patterns in strings” in Handbook of Theoretical. Comp. Sc., Volume A, 1990.

B. W. Watson and G. Zwaan: “A taxonomy of sublinear multiple keyword pattern matching algorithms” in Sci. Comput. Program., 1996.

D. E. Knuth, J. H. Morris (Jr.), and V. R. Pratt: “Fast Pattern Matching in Strings” in SIAM J. Computing, 1977.

R. S. Boyer and J. S. Moore: “A Fast String Searching Algorithm,” in Commun. ACM, 1977.

A. V. Aho and M. J. Corasick: “Efficient string matching: An aid to bibliographic search” CACM, 1975.

B. Commentz-Walter: “A String Matching Algorithm Fast on the Average” in Proc. ICALP, 1979.

A. Berlea and H. Seidl: “Binary Queries for Document Trees” in Nordic J. of Computing, 2004.

J. Jaakkola and P. Kilpelainen: “Nested text-region algebra” TR C-1999-2, Univ. of Helsinki, 1999.

M. Takeda et al: “Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts” in Proc. SPIRE, 2002.

M. Altinel et. al.: “Efficient Filtering of XML Documents for Selective Dissemination of Information” in ICDE, 2000.

A. Bruggemann-Klein and D. Wood: “One-Unambiguous Regular Languages” in Inform. and Comp., 1998.

J.-M. Champarnaud: “Subset Construction Complexity for Homogeneous Automata, Position Automata and ZPC-Structures” in Theor. Comput. Sci., 2001.

Additional Resources

The Runtime Automaton

In some cases, intermediate states must be kept to keep track of axis relation

<a> <b>

<c> <b> </b> </c>

</b> </a>

</a>

{ /a/b }

<a> <b> </b> </a>

NOT CORRECT! <a>

<c>

<b></b>

</c>

</a>

The Runtime Automaton

In some cases, intermediate states must be kept to keep track of axis relation

<a> <b>

<c> <b> </b> </c>

</b> </a>

</a>

{ /a/b }

<a> <b>

<c> </c>

</b> </a>

</a>

CORRECT <a>

<c>

<b></b>

</c>

</a>

Medline XPath Queries

M1 /MedlineCitationSet//CollectionTitle

M2 /MedlineCitationSet//DataBank[DataBankName/text()=“PDB”] /AccessionNumberList

M3 /MedlineCitationSet//PersonalNameSubjectList /PersonalNameSubject[LastName/text()=“Hippocrates” or DatesAssociatedWithName=“Oct2006”] /TitleAssociatedWithName

M4 /MedlineCitationSet//CopyrightInformation[contains(text(),“NASA”)]

M5 /MedlineCitationSet/MedlineCitation[ contains(MedlineJournalInfo//text(),“Sterilization”)]/DateCompleted

XMark Queries

let $auction := doc("auction.xml") return

for $b in $auction/site/people/person[@id = "person0"]

return $b/name/text()

let $auction := doc("auction.xml") return

count(

for $i in $auction/site/closed_auctions/closed_auction

where $i/price/text() >= 40

return $i/price

)

XM1

XM5

XMark Queries

let $auction := doc("auction.xml") return for $i in distinct-values($auction/site/people/person/profile/interest/@category) let $p := for $t in $auction/site/people/person where $t/profile/interest/@category = $i return <personne> <statistiques> <sexe>{$t/profile/gender/text()}</sexe> <age>{$t/profile/age/text()}</age> <education>{$t/profile/education/text()}</education> <revenu>{fn:data($t/profile/@income)}</revenu> </statistiques> <coordonnees> <nom>{$t/name/text()}</nom> <rue>{$t/address/street/text()}</rue> <ville>{$t/address/city/text()}</ville> <pays>{$t/address/country/text()}</pays> <reseau> <courrier>{$t/emailaddress/text()}</courrier> <pagePerso>{$t/homepage/text()}</pagePerso> </reseau> </coordonnees> <cartePaiement>{$t/creditcard/text()}</cartePaiement> </personne>return <categorie>{<id>{$i}</id>, $p}</categorie>

XM10

XMark Queries

let $auction := doc("auction.xml")

return for $i in $auction/site//item

where contains(string(exactly-one($i/description)),"gold")

return $i/name/text()

XM14

XMark Queries

let $auction := doc("auction.xml")

return <result>

<preferred>

{count($auction/site/people/person/profile[@income >= 100000])}

</preferred>

<standard>

{ count($auction/site/people/person/profile[@income<100000 and

@income >= 30000] ) } </standard>

<challenge>

{count($auction/site/people/person/profile[@income < 30000])}

</challenge>

<na>

{count(for $p in $auction/site/people/person

where empty($p/profile/@income)

return $p)}

</na>

</result>

XM20

top related