gpx-matcher - a generic boolean predicate-based xpath expression matcher mohammad sadoghi, ioana...
Post on 21-Jan-2016
233 Views
Preview:
TRANSCRIPT
GPX-Matcher - A Generic Boolean Predicate-based
XPath Expression Matcher
Mohammad Sadoghi, Ioana Burcea, and Hans-Arno JacobsenMiddleware Systems Research Group
University of Toronto
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
EDBT’2011
An X-ToPSS Project
http://msrg.org/tags/x-topss
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGThe Problem in a Nutshell
XPath Expressions (XPE)(Millions of XPE)
XML Filtering
Matched XPE
XML
Matched Subscriptions
Event/Publication
Subscriptions(Boolean Expressions)
Pub/Sub Engine
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGPublish/Subscribe Systems
Broker
Publisher Publisher
Subscriber Subscriber
Subscriptions
Publications
NotificationNotification
IBM=84
MSFT=27 INTC=19 JNJ=58ORCL=12
HON=24
AMGN=58
Stock marketsNYSE
NASDAQTSX
Subscriptions:IBM > 85
ORCL < 10JNJ > 60
3X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGPub/Sub Matching Algorithms• Rete algorithm [Forgy, late 70s]
– A graph-structure to correlate events, process rules (solves a more general problem)
• SIFT [Yan et al. TODS‘94]– Predicate counting et al.
• Gough algorithm [Gough et al. ACSC‘95]– Based on a finite state representation of subscriptions
• Gryphon algorithm [Aguilera, et al. PODC‘99]– Decision tree over predicates
• Clustering algorithm [Fabret et al. SIGMOD‘01]– Clusters subscriptions based on common predicates
• k-Index [Whang et al. VLDB‘09]• Hardware-based matching acceleration [Sadoghi et al. VLDB‘10]• BE-Tree [Sadoghi & Jacobsen, SIGMOD’2011]
4X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGThe Key Question?
Can XML Filtering be benefited from the efficient publish/subscribe matching
algorithms that have been developed for more than three decades?
5X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML Filtering Challenges
• Filter XML according to XPEs
• Efficiently, at Internet-scale, for millions of XPEs, and for many XML documents per unit of time
6X-ToPSS & GPX-Matcher
XPath Expressions (XPE)(Millions of XPE)
Matched XPE
XML
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML Filtering Systems• Growing need for XML filtering
– Application-level firewalls– Maleware detection and prevention– Document routing– RSS aggregators– XML-based messaging and application integration
• Selected industry players (XML appliances)– SolaceSystems– IBM DataPower– Talerian– Sarvega (Intel)
7X-ToPSS & GPX-Matcher
• XML filtering systems are publish/subscribe systems
• XPath & XML are subscription and publication, respectively
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGThe Core Problem
• XML Document Filtering Problem– Given a set of XPath expressions Q and an XML
document d, find all expressions in Q that are matched by d
• An expressions q is matched by an XML document d if and only if q selects a non-empty set of nodes in d– XPath expressions are used to select entire
documents or fragments of documents
8X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGAgenda
• Supported XPath Language• Mapping XML Filtering to Pub/Sub Matching
– XPath encoding– XML encoding
• Experimental results• Outlook
9X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML and XPath
<section>
<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>
section
subsection
figure
figure
XML fragment XML tree XML paths
section-subsection-figure
section-figure
XPath queries
/section/subsection/figure
section/figure
/section//subsection/figure
section//figure
/section/*/figure
*/figure
location step
child operator
descendent operator wildcards
absolute query
relative query
10
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
X-ToPSS & GPX-Matcher 11
XPath 2.0 Subset Considered• Absolute path expressions
– /a/b• Relative path expressions
– a/b/c• Descendant operators in path expressions
– a/b//a/d• Wildcards in path expressions
– a/*/*/b• Not discussed, but shown how to address
– Filter predicates in path expressions• <path>[@x>1]/<path>
– Nested path filters (the XPE becomes a tree)• <path>[a/b]/<path>
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGAgenda
• Supported XPath Language• Mapping XML Filtering to Pub/Sub Matching
– XPath encoding– XML encoding
• Experimental results• Outlook
12X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGOur Question(s)
• How can we map XPath expressions onto subscriptions?– Conjunctive Boolean formula over predicates– S = (a1 op v1) (a2 op v2) … (an op vn)
• How can we map XML documents onto publications?– Set of attribute-value pairs– P = {(a1, v1), (a2, v2), …, (am, vm)}
13X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGPredicate Calculus
• Single-tag predicate
• Double-tags predicate
• End-tag predicate
• Length-constraint predicate
voppt
v opppdtt
),( 21
vpt
vlength
14X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGSingle-tag Predicate Example
• XPath expression/b/…
• Predicate
1 bp
15X-ToPSS & GPX-Matcher
Tag b at position 1
b
a
c
d
b-a-c
(b, 1), (a, 2), (c, 3)
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGDouble-tags Predicate Example I
• XPath expression… a/b …
• Predicate
1 ),( ppdba
16X-ToPSS & GPX-Matcher
Distance between Tag a and Tag b is one location step
x
a d
x-a-b
(x, 1), (a, 2), (b, 3)
b
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGDouble-tags Predicate Example II
17X-ToPSS & GPX-Matcher
Distance between Tag a and Tag b is at
least one location step
• XPath expressiona//b
• Predicate
1 ),( ppdba
a
x
b
d
a-x-b
(a, 1), (x, 2), (b, 3)
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGEnd-tag Predicate Example
• XPath expression/a/*/*
• Predicate
2ap
18X-ToPSS & GPX-Matcher
Tag a at least two location steps away
from path end
a
x
y
d
a-x-y
(a, 1), (x, 2), (y, 3), (length, 3)
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGLength-constraint Predicate Example
• XPath expression*/*/*
• Predicate 3 length
19X-ToPSS & GPX-Matcher
Length of the path is at least 3
x
y
z
d
x-y-z
(x, 1), (y, 2), (z, 3) (length, 3)
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
Putting it Together:XPath Query Encoding Example
Q1: a/b//aQ2: a//b/dQ3: a/*/*/*//b/d
Q1: a1/b1//a2
Q2: a1//b1/d1
Q3: a1/*/*/*//b1/d1
1),( 21 ab
ppdQ1: 1),( 11 ba
ppd
P3 P4
P4P5 4),( 11 ba
ppd 1),( 11 db
ppdQ3:
1),( 11 ba
ppd 1),( 11 db
ppdQ2:
Our XPath encoding grows linearly in the size of the XPath expression
P1 P2
20
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML Document Path Encodinga-b-c-d
a1-b1-c1-d1
(length, 4),
(a1, 1), (b1, 2), (c1, 3), (d1, 4)
(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),
(b1, c1, 1), (b1, d1, 2),
(c1, d1, 1)
The resulting attribute-value “pairs” set has O(n2) tags.
Without duplicate tags
(i.e., all occurrence
numbers are 1)
Document path
Attribute-value pair
Publication
21
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
Mapping XML Filtering to Pub/Sub Matching
Matched XPE
XML
Matched Subscriptions
Event/Publication
Subscriptions(Boolean Expressions)
Pub/Sub Engine
XPath Expressions (XPE) (Millions of XPE)
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGMatching Algorithms
• Pick any pub/sub matching algorithm• We used
– Counting algorithm [exact origin is unknown]– Clustering algorithm [Fabret, Jacobsen et al.,
2001]• Both are two-phased matching algorithms
1. Predicate matching: Match all predicates.2. Subscriptions matching: Match subscriptions
using the result from step 1.
23X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
Predicate Matching: Single Tag Predicate
=
1 3 42 Predicate value
voppt vpt vlength
(length, 4),
(a1, 1), (b1, 2), (c1, 3), (d1, 4)(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),
(b1, c1, 1), (b1, d1, 2),
(c1, d1, 1)
a
Publication:
Predicate bit vector
Hash on the tag
i
i
1
1 ap
c 3 cp
0 0 0
24
with id i
j
with id j
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGSubscription Matching:
Clustering Algorithm• Cluster queries based on the access predicates• Access predicates shared by all queries in cluster• Only check clusters whose access predicates are matched• Open Question: how to choose an effective access predicate
Access predicates
false
false
pipi
25X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGExperimental Evaluation• All algorithms implemented in C
– GPX – the base encoding with counting– GPX-ap – the base encoding with clustering (access pred.)– YFilter & BPA
• DTDs used for generating workloads– NITF DTD (News Industry Data Format)– PSD DTD (Protein Sequence Database)
• Total filtering time averaged over 500 XML documents– XML parsing time is negligible in
the overall filtering time• Intel Quad-Core 2.66 GHz, 4GB
encodedXPath expressions
XML
26X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGScalability in Number of XPEsAll XPEs are distinct
27X-ToPSS & GPX-Matcher
1 ms vs.
18 msap on first
ap on last
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGScalability in Number of XPEsXPEs workload contains duplicates
28X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGEffect of Path Length
29X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGEffect of Wildcards
X-ToPSS & GPX-Matcher 30
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGConclusions
• Novel XML/XPath encoding• Leverages existing matching techniques• Differs significantly from predominantly
automata-based related work• Outperforms related approach by an order of
magnitude under many experimental conditions
31X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGThank You!
• To learn more about X-ToPSS, please see– http://msrg.org/tags/x-topss
32X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
X-ToPSS & GPX-Matcher 33
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGAgenda
• XML-based Filtering Systems• Mapping XML Filtering to Pub/Sub Matching
– XPath encoding– XML encoding
• Experimental results• Outlook
34X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGContent-based Publish/Subscribe
• Subscription: Boolean expressions (i.e., an attribute-operator-value triple)
(subject = news) (topic = travel) (date > 21.2.2011)
• Publication (a.k.a. event): Sets of attribute-value pairs
(subject, news), (topic, travel), (date, 21.2.2011), …
35X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGThe Pub/Sub Matching Problem
• Given an event, e, and a set of subscriptions, S, determine all subscriptions, s S, that match e.
subscriptions
event / publication
matches 36X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGWide Applicability• Selective information dissemination• Location-based services• Personalization, alerting services• Application integration• Service & resource discovery• Network and distributed system management• Monitoring, surveillance, and control • Network and distributed system management• Workforce management• Workload management & job scheduling• Business activity monitoring• Business process management, monitoring, and execution
X-ToPSS & GPX-Matcher 37
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGMatching Algorithm Techniques
• Amortized storage & processing• Access predicates• Cost model-driven subscription partitioning• Cache-conscious data structure layout• Asynchronous cache-level pre-fetching • Event queue re-ordering and batch processing• Parallelization of algorithms for SMP & multi-core• FPGA-based acceleration (hardware-level)
38X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGeXtensible Markup Language
• XML – de facto standard for data exchange– Web Services, data and application integration,
information dissemination
• XPath – XML query language– Also used as basis for other query languages (e.g.,
XQuery, Xpointer, XSLT et al.)
39X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML and XPath
<section>
<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>
XML fragment
section
subsection
figure
figure
XML tree XML paths
section-subsection-figure
section-figure
XPath queries
/section/subsection/figure
section/figure
/section//subsection/figure
section//figure
/section/*/figure
*/figure
40X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML and XPath
<section>
<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>
section
subsection
figure
figure
XML fragment
XML tree XML paths
section-subsection-figure
section-figure
XPath queries
/section/subsection/figure
section/figure
/section//subsection/figure
section//figure
/section/*/figure
*/figure
location step
child operator
41
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGXML and XPath
<section>
<subsection> <figure> … </figure> </subsection> <figure> … </figure></section>
section
subsection
figure
figure
XML fragment XML tree XML paths
section-subsection-figure
section-figure
XPath queries
/section/subsection/figure
section/figure
/section//subsection/figure
section//figure
/section/*/figure
*/figure
location step
child operator
descendent operator 42
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGOur Research Goal
• Solve the XML filtering problem using content-based pub/sub matching algorithm.
• Why– Build on and exploit several decades worth of
insights, rather than construct special purpose solutions.
43X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGIn a Nutshell
encodedXPath expressions
section
subsection
figure
figure
section-subsection-figure
section-figure
44X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGSpecial purpose XML/XPath Filtering Algorithm
• XFilter [Altinel et al. VLDB‘00]• WebFilter [Pereira et al. VLDB’01]• YFilter [Diao et al. TODS‘03]• XTrie [Chan et al. ICDE‘03]• AFilter [Candan et al. VLDB‘06]• BPA [Huo & Jacobsen, ICDE‘06]• BoXFilter [Moro et al. VLDB‘07]• pFiST [Kwon et al. DKE’08]
45X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGFrom XML Filtering to Publish/Subscribe Matching
• XPath expressions are encoded in a predicate calculus
• XML documents are expressed as a set of paths from the root to a leave in the document tree– Each path is translated into sets of attribute-value
pairs (tags and their location in the path)
• Matching algorithm– The attribute-value pairs are matched against the
predicates with traditional pub/sub matching algorithms
46X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGPossibly Extensions
• Extend predicate calculus to encompass other XPath 2.0 features
• Alternative encodings• Exploit DTD or schema information• Exploit information about XPath expressions
processed
47X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGX-ToPSS: XML-based Toronto Publish/Subscribe System
• Distributed, content-based publish/subscribe (cf. ICDCS’08)– Exploit DTDs (Document Type Definition) to optimize
subscription routing in distributed pub/sub systems– Explain covering and merging optimizations for
XML/XPath• Alternative predicate-based XML/Xpath
matching algorithm that cannot exploit traditional pub/sub schemes (cf. ICDE’06)
• Encoding presented herein, cf. EDBT’2011 (forthcoming)
http://msrg.org/tags/x-topss 48
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGExample: XPath Query Encoding
a 1b1
b 1a2
1d1
1),( 21 ab
ppd
1),( 11 ba
ppd
4),( 11 ba
ppd
1),( 11 db
ppd
1),( 11 ba
ppdP1
P3
P4
P5
P2
=
=
=
1 3 421
2
3
4
5
Predicate identifier (pid)
49X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
data tuplessubscriptions
query publication
Query and subscription are very similar.
Data tuples and publication are very similar.
However, the two problem statements are inverse.
That’s Like Data Base Querying !!
sets of tuples
Abo
ut p
ast
Abo
ut f
utur
e
sets of tuples
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
X(length, 5),
(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)
(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1, 4),
(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),
(c1, b2, 1), (c1, d1, 2),
(b2, d1, 1)
a-b-c-b-d
a1-b1-c1-b2-d1
a1-b1-c1-b2-d1 a1-b1-c1-b1-d1
a1- -c1-b1-d1
(length, 5),
(a1, 1), (c1, 3), (b1, 4), (d1, 5)
(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),
(c1, b1, 1), (c1, d1, 2),
(b1, d1, 1)
XML Document Path Encoding ExampleWith duplicate tags
51X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
Example - XML Document Path Encoding(with Duplicates)a-b-c-b-d
a1-b1-c1-b2-d1
a1-b1-c1-b2-d1 a1-b1-c1-b1-d1X
a1- -c1-b1-d1(length, 5),
(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)
(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1, 4),
(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),
(c1, b2, 1), (c1, d1, 2),
(b2, d1, 1)
(length, 5),
(a1, 1), (c1, 3), (b1, 4), (d1, 5)
(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),
(c1, b1, 1), (c1, d1, 2),
(b1, d1, 1)
a1-b1-c1-b2-d1
(length, 5),
(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)
(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1, 4),
(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),
(c1, b2, 1), (c1, d1, 2),
(b2, d1, 1)
X
(length, 5),
(a1, 1), (c1, 3), (b1, 4), (d1, 5)
(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),
(c1, b1, 1), (c1, d1, 2),
(b1, d1, 1)52X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
Predicate Matching: Double Tags Predicate
a 1b1
b 1a2
1d1
Hash on the first tag
Hash on (occ # first tag,
second tag, occ # second tag)
=
=
=
1 3 42
Predicate operator
Predicate value
v opppdtt
),( 21
(length, 4),
(a1, 1), (b1, 2), (c1, 3), (d1, 4)
(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),
(b1, c1, 1), (b1, d1, 2),
(c1, d1, 1)
Publication:
53X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGMatching Algorithm
1. Match all predicates (predicate matching) and record results in predicate bit vector
2. Match subscriptions based on predicate bit vector (subscriptions matching)
From here on forward, nothing new really (we re-use pub/sub matching algorithms, as promised.)
X-ToPSS & GPX-Matcher 54
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
Subscription Matching: Counting Algorithm
Q1
Q2
Q3
222
For each query record the number of predicates
0$1
For each query count the number of satisfied
predicates
= Q2 is matched
1
5
34
2Q1
Q1
Q2
Q2, Q3
Q3
For each predicate associate queries that
contain itPredicates
453
432
211
PPQ
PPQ
PPQ
P3 P4match
55X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGRelated Work: YfilterQ1: a/b//aQ2: a//b/dQ3: a/*/*/*//b/d
a b ε
**
ε*
b d
*
a
Q2, Q3
ε*
*
Q1
56
[Diao et al. TODS‘03]
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORGLonger-term Vision
• Map matching problems for different languages onto an efficient pub/sub matching kernel
• For example, for:– Graph-structured query / data (RSS, RQL)– Tree-structured query / data (XML / XPath)– Regular expressions / sentences– Etc.
57X-ToPSS & GPX-Matcher
top related