xml prefiltering as a string matching problem christoph koch 1, stefanie scherzinger 2, michael...

XML Prefiltering as a String Matching Problem

Christoph Koch1, Stefanie Scherzinger2, Michael Schmidt3

1Cornell University 2IBM Boeblingen 3Freiburg University

24th International Conference on Data Engineering

April 9, Cancun (Mexico), 2008

XML data often processed ad-hoc, e.g. in streaming scenarios and main memory-based processors

Low main memory consumption then becomes the key prerequisite to performance

XML prefiltering as an established technique that aims at decreasing main memory consumption

Motivation

We present a novel approach to XML prefiltering based on efficient string matching techniques

Buffer only data that is relevant to query evaluation

Prefiltering/Projection Statical analysis of the XQuery/XPath expression Identifiy parts of the input document that are relevant

to query evaluation Discard parts of the input document that are not

relevant to query evaluation

A. Marian and J. SiméonProjecting XML DocumentsIn Proc. VLDB’03, pages 213–224, 2003

S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena Accelerating Queries by Pruning XML DocumentsTKDE, 54(2):211–240, 2005

V. Benzaken, G. Castagna, D. Colazzo, and K. NguyenType-Based XML ProjectionIn Proc. VLDB’06, 2006

XML Prefiltering

<q> { /site//australia//description} </q>

XQuery

Relevant Paths{ /site//australia//description# }

regions

XML Document

XML Prefiltering

africa asia australia

description

„PDA“

Existing Approaches1. Analysis of the input query, extraction of relevant paths

2. Tokenization of the input document

3. Compilation of an automaton that projects the document token by token

XML Prefiltering

Our Approach1. Analysis of the input query, extraction of relevant paths

2. Use efficient string matching techniques to locate the relevant parts of the input document (without parsing and tokenizing the document)

Challenge: take string matching algorithms to the second dimension, to navigate in tree-structured data

String Matching Techniques

Example: Boyer-Moore Algorithm

S t r i n g m a t c h i n g f o r b e g i n n e

b e g i n

Search for keyword

--- length of keyword = 5

1 5 10 15 20 25

Similar algorithms exist for multi-keyword search (e.g., Commentz-Walter Algorithm)

b e g i n b e g i n b e g i nb e g i n

b e g i n

String Matching and XML Prefiltering

String matching techniques have originally been designed for search in flat and unstructered text

But: XML is structured and prefiltering requires us to keep track of axis relations in the input paths (such as child and descendant relations)

XML schema knowledge (e.g., in the form of DTDs) provides us with structural information that can be

exploited for target-oriented search

The Runtime Automaton

<!DOCTYPE site [ <!ELEMENT site (regions)> <!ELEMENT regions (africa, asia, australia)> <!ELEMENT africa (item*)> <!ELEMENT asia (item*)> <!ELEMENT australia (item*)> <!ELEMENT item (location,name,payment, description,shipping,incategory+)> <!ELEMENT incategory EMPTY> <!ATTLIST incategory category ID #REQUIRED>... ]>

We restrict to non-recursive DTDs, which can be transformed to finite automaton Ideas also applicable in the context of recursive DTDs

Fragment of the XMark DTD

<site>

</africa><asia>

</asia>

</site>

</australia>

<item>

</name>

<name>

</payment><payment>

</description>

</shipping>

<item>

</item>

(<item> child tags)

<site>

</africa><asia>

</asia>

</site>

</australia>

<item>

</name>

<name>

</payment><payment>

</description>

</shipping>

<item>

</item>

(<item> child tags)

Search for string “<site”

<site>

</africa><asia>

</asia>

</site>

</australia>

<item>

</name>

<name>

</payment><payment>

</description>

</shipping>

<item>

</item>

(<item> child tags)

Search for strings “<item”

and “</australia”

in parallel

<site>

</africa><asia>

</asia>

</site>

</australia>

<item>

</name>

<name>

</payment><payment>

</description>

</shipping>

<item>

</item>

(<item> child tags)

{ /site //australia //description# }

<site>

</africa><asia>

</asia>

</site>

</australia>

<item>

</name>

<name>

</payment><payment>

</description>

</shipping>

</item>

</site>

</australia>

<item>

</name>

<name>

</payment><payment>

</description>

</shipping>

</item>

<site>

</site>

</australia>

</description>

</shipping>

</item>

<site>

</site>

</australia>

</description>

<site>

</australia>

Static Compilation into Lookup Tables

Automaton As <site> p0

p0 <australia> p1

p1 <description> p2

p1 </australia> q1

p2 </description> q2

q2 <description> p2

q2 </australia> q1

q1 </site> q0

Frontier Vocabulary Vs {<site>}

p0 {<australia>}

p1 {<description>,</australia>}

p2 {</description>}

q1 {</site>}

q2 {<description>, </australia>}

Action Table Ts no operation

p0 copy tag

p1 copy tag

p2 copy on

q0 copy tag

q1 copy tag

q2 copy off

q1q0 </site> </australia>

</description>

Static Compilation into Lookup Tables

<australia>p0

<site>

s p0<site>

p1Extract from the original runtime automaton

Extract from the optimized runtime automaton

Shortest possible XML string between <site> and <australia>:

s=“<regions><africa/><asia/>” with |s|=25

Initially skip 25 characters

Initial Jump Table Jp0 25

other states 0

The Runtime Algorithm

q := s; // current statec := 0; // cursor position

while q is not final dobegin(1) Perform initial jump J[q](2) Perform keyword search for tags V[q] until a tag t is matched (starting from current cursor position c)(3) Assign q := A[q, t](4) Perform action T[q] end

Lean runtime algorithm

Operates on top of the precompiled tables

Uses efficient string-matching techniques to locate keywords (step (2))

Runtime Core Algorithm

<site><regions><africa><item><location>United States</location><na

me>T V</name><payment>Creditcard</payment><description>15’’LCD-Fla

tPanel</description><shipping>Within country</shipping><incategory

category="3"/></item></africa><asia/><australia><item ><location>

A Sample Run

while q is not final dobegin(1) Perform initial jump J[q](2) Perform keyword search for tags V[q] until tag t is matched (3) Assign q := A[q, t](4) Perform action T[q] end

Current state: q = s

Initial Jump: J[q=s] = 0

Frontier Voc.: V[q=s] = {<site>}

Current state: q = p0

Initial Jump: J[q=p0] = 25

Frontier Voc.: V[q]={<australia>}

Frontier Voc.: V[q=p1] = {</australia>, <description>}

Matched tag „<site>“

A[s,<site>] = p0

Matched tag „<australia>“:

A[p0,<australia>] = p1

T[q=p0] = copy tag (<site>)T[q=p1] = copy tag (<australia>)

A Sample Run

Egypt</location><name>PDA</name><payment>Check</payment><descripti

on>Palm Zire 71</description><shipping/><incategory category="3"/>

</item></australia></regions></site>

while q is not final dobegin(1) Perform initial jump J[q](2) Perform keyword search for tags V[q] until tag t is matched (3) Assign q := A[q, t](4) Perform action T[q] end

Frontier Voc.: V[q=p2] = {</description>}

Matched tag „<description>“:

A[p1,<description>] = p2

Frontier Voc.: V[q=p1] = {</australia>, <description>}

Matched tag „</description>“:

A[p2,<description>] = q2

copy on

copy off

T[q=p2] = copy onT[q=p2] = copy off

Experiments

Prototype implementation in C++: SMP Settings

Core2 Duo IBM ThinkPad Z61p T2500 2.00GHz CPU with 1GB RAM Ubuntu Linux 6.06 LTS

Data sets: XMark, Medline, Proteine Sequence Document Sizes: 1MB up to 5,000MB Queries: XMark queries, user-defined XPath queries Query Engines

XQuery: Qizx/open, Saxon XPath: SPEX

XM1 XM5 XM10 XM14 XM20

Proj. Size 67.64MB 22.10MB 307.63MB 1357.28MB 38.52MB

Memory 1.64MB 1.68MB 1.96MB 1.64MB 1.67MB

Elapsed Time 4min 12s 4min 12s 4min 55s 5min 21s 4min 10s

Usr+Sys 31.00s 19.91s 54.94s 53.71s 31.67s

CPU 12.52% 8.05% 13.85% 17.07% 12.92%

Char. Comp. 18.86% 9.87% 22.38% 21.24% 18.67%

Experimental Results

Projection of a 5,000MB XMark document for selected XMark benchmark queries

Projection Characteristics for Selected XMark Benchmark Queries

Throughput comparison SMP projection for XMark

(average over all queries on 5,000MB document)

vs. Bare XML document

tokenization performed by the Xerces C++ parser

SMP is faster than all projection systems that rely on a prior tokenization of the input XML document

QizX XQuery EngineSucess TimeFail MemFail

1000MB without projection 0 0 18

1000MB with projection 18 0 05000MB without projection 0 0 185000MB with projection 15 2 1

Success Rates for 18 XMark Queries with and without Projection, where TimeFail: >1hour, MemFail: >1GB

When coupled with projection, in-memory XQuery engines scale up to documents in the Gigabyte range

Throughput improvement 656MB Medline document 5 user-defined XPath queries Evaluated with the SPEX

XPath engine

Summary

Efficient string matching techniques, originally designed for keyword search in flat text, can be used for search and navigation in unparsed XML documents

A novel approach to XML prefiltering on top on these ideas reduces XML prefiltering to a sequence of simple string matching tasks

Extensive experimental evaluation demonstrates persistently high throughput and scalability of our XML

prefiltering system significant improvements for both XQuery and XPath

engines when coupled with document prefiltering

Thank You for Your Attention!

Y. Diao et. al.: “Path Sharing and Predicate evaluation for High-Performance XML Filtering” in TODS, 2003.

T. J. Green et al.: “Processing XML streams with deterministic automata and stream indexes” in TODS, 2004.

D. Olteanu: “SPEX: Streamed and Progressive Evaluation of XPath” in TKDE, 2007.

X. Li and G. Agrawal: “Efficient Evaluation of XQuery over Streaming Data” in VLDB, 2005.

A. Marian and J. Simeon: “Projecting XML Documents” in VLDB, 2003.

V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen: “Type-Based XML Projection” in VLDB, 2006.

M. Schmidt, S. Scherzinger, and C. Koch: “Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation” in ICDE, 2007.

A. V. Aho: “Algorithms for finding patterns in strings” in Handbook of Theoretical. Comp. Sc., Volume A, 1990.

B. W. Watson and G. Zwaan: “A taxonomy of sublinear multiple keyword pattern matching algorithms” in Sci. Comput. Program., 1996.

D. E. Knuth, J. H. Morris (Jr.), and V. R. Pratt: “Fast Pattern Matching in Strings” in SIAM J. Computing, 1977.

R. S. Boyer and J. S. Moore: “A Fast String Searching Algorithm,” in Commun. ACM, 1977.

A. V. Aho and M. J. Corasick: “Efficient string matching: An aid to bibliographic search” CACM, 1975.

B. Commentz-Walter: “A String Matching Algorithm Fast on the Average” in Proc. ICALP, 1979.

A. Berlea and H. Seidl: “Binary Queries for Document Trees” in Nordic J. of Computing, 2004.

J. Jaakkola and P. Kilpelainen: “Nested text-region algebra” TR C-1999-2, Univ. of Helsinki, 1999.

M. Takeda et al: “Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts” in Proc. SPIRE, 2002.

M. Altinel et. al.: “Efficient Filtering of XML Documents for Selective Dissemination of Information” in ICDE, 2000.

A. Bruggemann-Klein and D. Wood: “One-Unambiguous Regular Languages” in Inform. and Comp., 1998.

J.-M. Champarnaud: “Subset Construction Complexity for Homogeneous Automata, Position Automata and ZPC-Structures” in Theor. Comput. Sci., 2001.

Additional Resources

In some cases, intermediate states must be kept to keep track of axis relation

<a>

</a>

{ /a/b }

NOT CORRECT! <a>

In some cases, intermediate states must be kept to keep track of axis relation

<a>

</a>

{ /a/b }

<a>

</a>

CORRECT <a>

Medline XPath Queries

M1 /MedlineCitationSet//CollectionTitle

M2 /MedlineCitationSet//DataBank[DataBankName/text()=“PDB”] /AccessionNumberList

M3 /MedlineCitationSet//PersonalNameSubjectList /PersonalNameSubject[LastName/text()=“Hippocrates” or DatesAssociatedWithName=“Oct2006”] /TitleAssociatedWithName

M4 /MedlineCitationSet//CopyrightInformation[contains(text(),“NASA”)]

M5 /MedlineCitationSet/MedlineCitation[ contains(MedlineJournalInfo//text(),“Sterilization”)]/DateCompleted

XMark Queries

let $auction := doc("auction.xml") return

for $b in $auction/site/people/person[@id = "person0"]

return $b/name/text()

let $auction := doc("auction.xml") return

count(

for $i in $auction/site/closed_auctions/closed_auction

where $i/price/text() >= 40

return $i/price

XMark Queries

let $auction := doc("auction.xml") return for $i in distinct-values($auction/site/people/person/profile/interest/@category) let $p := for $t in $auction/site/people/person where $t/profile/interest/@category = $i return <personne> <statistiques> <sexe>{$t/profile/gender/text()}</sexe> <age>{$t/profile/age/text()}</age> <education>{$t/profile/education/text()}</education> <revenu>{fn:data($t/profile/@income)}</revenu> </statistiques> <coordonnees> <nom>{$t/name/text()}</nom> <rue>{$t/address/street/text()}</rue> <ville>{$t/address/city/text()}</ville> <pays>{$t/address/country/text()}</pays> <reseau> <courrier>{$t/emailaddress/text()}</courrier> <pagePerso>{$t/homepage/text()}</pagePerso> </reseau> </coordonnees> <cartePaiement>{$t/creditcard/text()}</cartePaiement> </personne>return <categorie>{<id>{$i}</id>, $p}</categorie>

XMark Queries

let $auction := doc("auction.xml")

return for $i in $auction/site//item

where contains(string(exactly-one($i/description)),"gold")

return $i/name/text()

XMark Queries

let $auction := doc("auction.xml")

return <result>

{count($auction/site/people/person/profile[@income >= 100000])}

</preferred>

{ count($auction/site/people/person/profile[@income<100000 and

@income >= 30000] ) } </standard>

{count($auction/site/people/person/profile[@income < 30000])}

</challenge>

{count(for $p in $auction/site/people/person

where empty($p/profile/@income)

return $p)}

</result>

xml prefiltering as a string matching problem christoph koch 1, stefanie scherzinger 2, michael...

xml data

input query

xml documentstkde

string matching algorithms

site australia description

input paths

relevant parts

input documentcompilation

Documents

photonic analog-to-digital conversion with equivalent analog...

music and media martin scherzinger [email protected]

lea steffen, stefan scherzinger, felix exner

quadrature prefiltering for high quality antialiasing

performance spectrum - scherzinger

openfoam on · pdf filedr. markus bühler power...

murrey math presentation (2) - mql5 09, 1998 · murreymath...

nicole scherzinger digital booklet big fat lie

aeronautical climatological information st.gallen...

picture perfect rgb rendering using spectral prefiltering...

aalborg universitet pll with maf-based prefiltering stage...

for hp field personnel worldwide november 1, · pdf...

the prefiltering techniques in emotion based place...

20140113 dokumentation allstar cup 1 -...

performance spectrum energy - scherzinger€¦ ·...

musicandmedia& scherzinger@nyu.edu&...

antialiasing - department of computer science and...

calact 2009 spring conference & expo april 17 - 19, 2009 san...

realize your potential - university of the virgin...

aaronbrowne - sharleen collins academy copy · campbell,...