lazy query evaluation for active xml
DESCRIPTION
Lazy Query Evaluation for Active XML. Abiteboul, Benjelloun, Cautis, Manolescu, Milo, Preda INRIA Futurs. presented by: Grigoris Karvounarakis Univ. of Pennsylvania CIS 650 October 14, 2004. Active XML. function nodes. - PowerPoint PPT PresentationTRANSCRIPT
1
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Lazy Query Evaluation for Active XMLAbiteboul, Benjelloun, Cautis, Manolescu, Milo, Preda
INRIA Futurs
presented by: Grigoris Karvounarakis Univ. of Pennsylvania CIS 650
October 14, 2004
CIS 650 2
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Active XML
function nodes
CIS 650 3
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Tree Pattern Queries
result nodes
descendant edge
CIS 650 4
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Tree Pattern Queries
Similar to Pattern Trees from TAX/TLC algebra+ variable nodes, used to bind variables to sub-trees(variable nodes with the same name must be mapped to elements with the same tag name)
+ result nodes Embedding (of a query q into a doc d) = Match Result of embedding = bindings of output
variables on witness tree
CIS 650 5
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
No embedding …
CIS 650 6
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
No embedding …
… but if we evaluate
1
CIS 650 7
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Embedding Example
CIS 650 8
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Embedding Example
CIS 650 9
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Embedding Example
X Y
CIS 650 10
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Relevant rewriting
(getNearbyRestos) is a relevant function node
In general, a function node is relevant, if there exists some rewriting of the document where some of the nodes it produces belongs to a match
Rewriting the document by invoking relevant function nodes produces relevant rewritings d1 !v1 d2 !v2 … dn
A document that contains no calls that are relevant to a query q is said to be complete for q
1
CIS 650 11
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Problem definition
Given an Active XML document d and a query q, find an efficient way to evaluate the query over the document
Naïve approach: interleave query evaluation with function calls
Better: try to compute (a superset of) the relevant functions calls for q and execute q over the rewriting of d (that results from executing these function calls)
CIS 650 12
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Problem definition
Given an Active XML document d and a query q, find an efficient way to evaluate the query over the document
Naïve approach: interleave query evaluation with function calls Better: try to compute (a superset of) the relevant functions
calls for q and execute q over the rewriting of d (that results from executing these function calls)
Efficiency tradeoff time to compute approximation of set of relevant functions
(larger for more accurate approx) time to execute the function calls (smaller for more accurate
approx) and time to execute query over resulting rewriting of document (smaller document for more accurate approx)
CIS 650 13
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Outline
Definitions Finding relevant calls Sequencing relevant calls Improving accuracy Reducing detection time Conclusions - Discussion
CIS 650 14
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Linear Path Queries
/*()
/nyHotels/*()
/nyHotels/hotel/*()
/nyHotels/hotel/name/*()
/nyHotels/hotel/rating/*()
/nyHotels/hotel/nearby/*()
/nyHotels/hotel/nearby//*()
/nyHotels/hotel/nearby//restaurant/*()
/nyHotels/hotel/nearby//restaurant/name/*()
/nyHotels/hotel/nearby//restaurant/address/*()
/nyHotels/hotel/nearby//restaurant/rating/*()
CIS 650 15
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Linear Path Queries
Correct, but usually inaccurate Ignores filtering conditions in the path from the root or in other branches that could make some of the functions irrelevant (e.g. there is no chance that a getNearbyRestos() function node under a hotel is relevant, if the hotel rating is not “*****”)
CIS 650 16
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Node Focused Queries
For each node in the query tree, replace it with an OR node (to add a branch *() to match any functions, similarly with LPQs)
Then, for every node v in the resulting query tree, create qv = q – {v and its subtree}, with output node fv pointing at the position of the *() OR-sibling of v
Each such query tree involves the path from the root to the node (as in LPQ) + any parts of the tree that would have to be matched anyway, for the whole query tree to match.
CIS 650 17
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
NFQ Example
nyHotels
hotel
name nearby
“Best Western”“*****”restaurant
name address
rating
rating
“*****”
X Y
*
*
* *
*
*
* * *
*
CIS 650 18
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
nyHotels
hotel
name nearby
“Best Western”“*****”restaurant
name address
rating
rating
“*****”
X Y
*
*
* *
*
*
* * *
*
NFQ Example
CIS 650 19
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
nyHotels
NFQ Example
*
CIS 650 20
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
nyHotels
NFQ Example
*
CIS 650 21
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
nyHotels
*
NFQ Example
CIS 650 22
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
nyHotels
hotel
name nearby
“*****”restaurant
name address
rating
rating
“*****”
X Y
*
*
* *
*
*
* * *
*
Another NFQ Example
“Best Western”
CIS 650 23
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Another NFQ Example
nyHotels
hotel
name nearby
“*****”
rating
*
*
* *
*
*
*
“Best Western”
CIS 650 24
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Another NFQ Example
nyHotels
hotel
name nearby
“*****”
rating
*
*
* *
*
*
*
“Best Western”
CIS 650 25
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Another NFQ Example
nyHotels
hotel
name
nearby
“*****”
rating*
* *
**
“Best Western”
CIS 650 26
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Node Focused Queries
Assuming that functions can return data of arbitrary type, the function nodes that are relevant for a query q are precisely the ones retrieved by the NFQs of q
CIS 650 27
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Outline
Definitions Finding relevant calls Sequencing relevant calls Improving accuracy Reducing detection time Conclusions - Discussion
CIS 650 28
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Sequencing relevant calls
Naïve NFQA algorithm:1. Evaluate all NFQs2. Pick one of the returned functions, say fv
3. Evaluate the function and rewrite the document (d !fv d’)
4. Until all NFQs return empty results (i.e., there are no more relevant calls)
After every loop, although the NFQs remain the same, their result can change (since evaluating functions at step 3 above can introduce new function nodes or make some results irrelevant)
CIS 650 29
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Improving NFQA
“Predict” when NFQ results could not have possibly changed and avoid reevaluating them Identify dependences between NFQs and the effect
of executing functions they return
CIS 650 30
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Influence of NFQs
nyHotels
*
nyHotels
hotel
name
nearby
“*****”
rating*
* *
**
“Best Western”
NFQ1 NFQ2
NFQ1 can influence NFQ2, but not vice versa
CIS 650 31
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Influence of NFQs
NFQ1 may influence NFQ2 iff the output function node of NFQ1 is an ancestor (in the query tree) of the output node of NFQ2
Two NFQs belong in the same layer if they may influence (directly or transitively) each other. Inside every layer, we have to reevaluate every NFQ
after every function call Multiple equivalent NFQs (i.e., in the same layer) can
only exist under //– so that, not knowing the output type, both nodes could appear as descendants of each other, e.g. //a, //b: in /a/b, //a matches /a and //b matches /a/b, while in /b/a, //b matches /b and //a matches /b/a
CIS 650 32
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Influence of NFQs
L1 < L2 iff some NFQ in L1 may influence (directly or transitively) some NFQ in We have to process L1 before L2 (without having to
process L1 again afterwards) When processing L1 has finished, OR-nodes
corresponding to returned functions are redundant and thus NFQs in L2 can be simplified by removing them
CIS 650 33
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Parallelizing calls
Let qlin be the linear path from the root to the output node of NFQ q, not inclusive (note: qlin is a regular expression)
Two NFQs q, q’ that belong to the same layer are independent iff there are no common words in the regular languages of qlin, q’lin
E.g: //a, //b are independent But //a//c and //b//c are not: (e.g. both match /a/b/c)
If all NFQs in a layer are independent, we can call all functions returned by the same NFQ in a step of NFQA in parallel. Other sufficient conditions could exist, too …
CIS 650 34
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Outline
Definitions Finding relevant calls Sequencing relevant calls Improving accuracy Reducing detection time Conclusions - Discussion
CIS 650 35
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Using types
Use function return type to “predict” shape of data that a function call can return
Similar to check for existence of a possible rewriting If this shape cannot match the (corresponding part of) the query pattern, they can be discarded
In some cases, one can go further and restrict not only the output type but also the specific names of functions that could match
Refined NFQs Use set of function names of appropriate return type instead of *()
Use F-guides (later) to make them even more refined
CIS 650 36
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Refined NFQ example
nyHotels
hotel
name
nearby
“*****”
rating*
*
**
“Best Western”
*
CIS 650 37
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Refined NFQ example
nyHotels
hotel
name
nearby
“*****”
rating*
* getRating
getNearbyRestos
*
“Best Western”
CIS 650 38
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Pushing queries
Similar to pushing selections on scans in relational queries or pushing queries to data sources in mediator systems
Reduce amount of (useless) data that are transferred (assuming functions correspond to remote (web) services), by filtering irrelevant matches and projecting only on output variable nodes
CIS 650 39
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Outline
Definitions Finding relevant calls Sequencing relevant calls Improving accuracy Reducing detection time Conclusions - Discussion
CIS 650 40
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Lenient rewriting
Trade accuracy for efficiency Use XPath or LPQs instead of NFQ (faster processing)
Use a lenient form of type checking (ignoring order and cardinality of elements)
CIS 650 41
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Function call guides
Similar to dataguides for function calls One occurrence for each path that leads to some function node + pointers to function nodes
CIS 650 42
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Function call guides
Similar to dataguides for function calls One occurrence for each path that leads to some function node + pointers to function nodes
paths that don’t lead to functions are left out
CIS 650 43
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Function call guides
Similar to dataguides for function calls One occurrence for each path that leads to some function node + pointers to function nodes
pointers to getRating calls
pointers to getNearbyRestos, getNearbyMuseums calls
pointers to getHotels calls
CIS 650 44
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Function call guides
Use F-guides for: Generation of Refined NFQs (use return type within appropriate F-guide part to get only function names that can indeed appear in the corresponding tree fragment)
Efficient approximation of relevant function nodes: evaluate queries (NFQs) on F-guide evaluate queries on original document using LPQs
Initial filtering: Can get rid of NFQs for nodes that don’t have any children in the F-guide
CIS 650 45
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Conclusions
Active XML: Interesting new area Nothing fundamentally novel Applies known tools (distributed processing, lazy evaluation) in a new context, giving new life to documents
Greatest challenge: formulate the right research questions well
Answers to these well-formulated questions are fairly easy.
Contributions of this paper: Formulates such an interesting question Thorough understanding of different aspects of the problem (accuracy vs. performance and their effect to overall efficiency)
CIS 650 46
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis October 04
Questions?