xml path matching: implementing the x-scan operator and

XML Path Matching: Implementing the X-scan operator and investigating ways to

optimize X-scan

Participant Name: Guoquan Lee Participant Email: [email protected]

Faculty Advisor: Dr. Zachary G. Ives Faculty Advisor Email: [email protected]

1

Abstract The emergence of the Extensible Markup Language (XML) as a standard for data

representation on the Web is expected to facilitate data integration from different data sources. Data integration across a wide area necessitates a query processor that can query data sources on demand, receive streamed XML data from them, and combine and restructure the data in new XML output (Ives et al, 2002). The Tukwila data integration system proposed by Zachary G. Ives et al is the first system that focuses on network-bound, dynamic XML data sources. The Tukwila architecture extends adaptive query processing and relational-engine techniques into the XML domain. While the Tukwila system provides better overall query performance than existing systems (Ives et al, 2002), there are still aspects of the system that need to be investigated. In this project we focused on the x-scan operator. The x-scan operator takes in an XML text stream and a set of regular path expressions as inputs and incrementally outputs a stream of tuples assigning binding values to each variable in the XQuery expression. In this project, we extended x-scan to handle XQuery expressions with steps consisting of descendent-or-self axis. We then optimized the extended x-scan.

2

1 Introduction

The need for data integration has existed for years. However, the heterogeneity of the sources’ data formats has posed great difficulty in the pursuit to integrate data from different sources. The emergence of the Extensible Markup Language (XML) as a standard for data representation on the Web is expected to facilitate data integration from different data sources. Since XML is emerging as a standard for data exchange, there arises the need to have a query language designed specifically for XML data sources. To this end, the World Wide Web Consortium formed a working group to design XQuery, a query language for XML. 2 Related Work

Processing and integrating XML data poses many challenges. XML data is typically available across a network and the data can only be obtained through parsing a stream of XML data (Ives et al, 2002). Following Ives et al, we define such data sources as “network-bound” because query performance is bounded by the network speed and parsing times. Integration of network-bound XML data has yet to receive the attention of many researchers. “Most XML work concentrated on designing XML warehouses, exporting XML from relational databases, adding information retrieval-style indexing techniques to databases, and on supporting query subscriptions or continuous queries that provide new results as documents change or are added” (Ives et al, 2002). However, data integration over external data sources which may be dynamic necessitates a query processor that can request data from each of the sources, combine this data, and make further requests to the data sources if applicable. Thus, Ives et al propose an XML query processing architecture, implemented in the Tukwila system, which seeks to extend the relational processing models into the XML realm to provide adaptive XML query processing capabilities and support efficient network-bound querying for dynamic, external data sources. Features in the architecture include: support for efficient processing of scalar and structured XML content; a pair of streaming XML input operators which are the enablers of the adaptive query processing architecture; and a set of physical-level algebraic operators for combining and structuring XML content and supporting XQuery. While the Tukwila system provides better overall query performance than existing systems (Ives et al, 2002), there are still aspects of the system that need to be investigated. In this project, our focus is on enhancing the x-scan operator of the existing system.

Given an XML text stream and a set of regular path expressions as inputs, x-scan incrementally outputs a stream of tuples assigning binding values to each variable which has been matched in the XML stream. A binding value is typically a tree – in which case the tuple contains a reference to data within the Tukwila XML Tree Manager. The Tree Manager contains only a subset of the XML text that is fed into the system. XML text that are not required for further processing are not stored in the Tree Manager.

In this project we extended x-scan to handle a subset of XQuery expressions with steps consisting of descendent-or-self axis. The descendent-or-self axis (abbreviation is

3

//) matches all descendents of the current node as well as the current context node. While the descendent-or-self axis is commonly used (Brundage 2004), many XML processing systems do not support XQuery expressions with steps consisting of descendent-or-self axis because adding such a feature adds considerable complexity to XML path matching (Ives, 2004).

After extending x-scan to handle a subset of XQuery expressions with steps consisting of descendent-or-self axis, we optimized the new x-scan operator. 3 Technical Approach

The existing x-scan algorithm provides the theoretical foundations to build the extended x-scan; modifications were implemented to handle XQuery expressions with steps consisting of descendent-or-self axis. 3.1 Overview of the existing x-scan algorithm1

An illustration of the x-scan process is given in Figure 1. The XML stream is

processed by an event-driven Simple API for XML (SAX) parser, which creates a series of notifications. The XML data is stored in the XML Tree manager and is also matched against a series of finite state machines (responsible for XPath pattern matching). These state machines produce output binding values, which are then combined to produce binding tuples.

Figure 1: X-scan process2

Figure 2: XQuery Example3

1 Please refer to Ives (2002) for more details 2 Figure reproduced with permission from Ives (2002) 3 Figure reproduced with permission from Ives (2002)

4

Figure 3: State machines corresponding to XPaths in Figure 24

Basic XPath expressions are a restricted form of regular expressions. Thus x-scan

converts each XPath into a regular expression and generates its equivalent deterministic finite state machine. XPath expressions originating at the document root are initialized to the active mode, and the active machines’ states are updated as x-scan encounters subelements and attributes during document parsing. Figure 3 shows the state machines created for the example query of Figure 2. Initially, only the top-level machine (Mb in our example) is active. When any machine reaches an accepting state, it produces a binding for the variable associated with it. The machine then activates all of its dependent state machines, and they remain active while x-scan is scanning the value of the binding. In our example, Mn and Mt remain active while we scan children of b.

Associated with each machine is a table for binding values. As a machine reaches an accept state, it adds an entry containing its bound subtree values, and also an association with the entry’s parent binding (shown in Figure 1 as a dashed arrow from parent to child). In our example, Mb’s table would just store values of b, while Mn and Mt would store author/editor names and titles, respectively, and these would be associated with their corresponding b values. The final output of x-scan is essentially a join of the entries maintained by the machines, done for matching parent-child pairs. 3.2 The extended x-scan algorithm

In the existing Tukwila system, there is a consumer thread and a producer thread. In this project, we focus on the producer thread. The x-scan operator acts as the producer thread and “produces” binding tuples and places it into a shared buffer with the consumer thread. For the purpose of this project the consumer thread5 acts to remove binding tuples from a buffer that the x-scan operator. Further details on the consumer thread are out of this paper’s scope6.

Most of the underlying concepts for XML path matching for XQuery expressions with steps consisting of descendent-or-self axis remain the same (vis-à-vis XML path matching for XQuery expressions without steps consisting of descendent-or-self axis). For example, the creation of deterministic finite state machines from XPaths and the determination of when to turn on/off the finite state machine do not change. However, 4 Figure reproduced with permission from Ives (2002) 5 A class diagram for the consumer thread that will be used in this project is given in Appendix 1 6 For more details on the consumer thread, refer to Ives (2002).

5

the extension of the x-scan operator necessitates us to simulate multiple instances of each deterministic finite automata (DFA) running in parallel for each SAX event received. Thus, a new class diagram was built for the x-scan operator (depicted in Figure 4 and expressed in Unified Modeling Language). Each DFA can now have multiple instances of itself running.

Figure 4: Class diagram for the producer portion of the extended x-scan Referring to Figure 4, the XMLParser acts as the SAX parser that parses the input XML stream. The DFAManager’s role is to manage all the DFAs in the x-scan operation. Each DFAManager keeps track of the active DFAs during the execution of the x-scan process. Each DFA has a thread table entry which it uses to keep track of all its threads. Each DFA also keeps a list of the active threads. Each thread is in turn linked to an IOManager. The role of the IOManager is to write into the XML Tree Manager and the output buffer accessible to consumer thread. . 3.3 Principle technical challenges faced We faced 4 major technical challenges in this project. Firstly, as we were dealing with a substantial amount of data input, memory management was a challenging task. Secondly, implementing the x-scan operator required us to solve a problem that is similar to the producer-consumer problem in Computer Science and prevention of deadlocks and race conditions was not easy. Thirdly, extending x-scan to handle a restricted set of

6

XQuery expressions with // was another major technical challenge. Fourthly, the “multi-threading” approach that this feature necessitated meant that there was more bookkeeping to be kept than the existing x-scan. Thus, optimizing the extended x-scan was non-trivial. 3.3.1 Memory Management The extended x-scan operator was written in C++. While C++ gives us better performance in doing XML path matching, the tradeoff is that the programmer is in charge of memory management. Memory management is especially challenging when the XML files that we are processing might have sizes of a few hundred MB. It is hard to determine the point where a memory violation has occurred as the program does not exit immediately; and the effect is that the output produced by the program is inconsistent.

To aid us in memory management, we used Electric Fence to help us detect two common programming bugs: software that overruns the boundaries of a malloc() memory allocation, and software that touches a memory allocation that has been released by free().

Initially, our extended x-scan was not producing the right results when it was processing files of a few MB. We used Electric Fence to check for memory bugs and it pointed out exactly where our code touched memory allocations that have been freed. This problem was not made known to us when we tested our x-scan against file sizes smaller than 1KB. After rectifying the problems pointed out by Electric Fence, our extended x-scan produced the correct output.

We then moved on to test the extended x-scan against file sizes of a few hundred MB. We again realized that the extended x-scan was not producing the right output. In this case, usage of Electric Fence was not feasible as it uses at least two virtual memory pages for each of its allocations. Thus, to troubleshoot our extended x-scan with files of very large sizes, we used the GDB debugger which allows us to see what is going on ``inside'' another program while it executes. While GDB does not point out exactly where the memory bug was, we used it to detect where the segmentation fault occurred. From that point, modifications to the code were made to squash the memory bug. 3.3.2 The Pseudo Producer-Consumer Problem The integration of the x-scan operator into Tukwila brings us to encounter a problem that is similar to the well-known Producer-Consumer problem in Computer Science. In our case, the producer thread does XML path matching and produces the binding tuples; the consumer thread consumes the binding tuples for further processing in other components of the Tukwila system. The producer and consumer threads share a common buffer. The major difference between the problem that we are trying to solve and the classical producer-consumer problem is that we do not have a fixed-size buffer. Nonetheless, preventing race conditions and deadlock prevention is equally challenging. Algorithms to solve the producer-consumer problem were carefully studied and modified appropriately to serve X-scan’s purposes. In our solution, there are two global variables: num_tuples_produced (the number of binding tuples in the shared output buffer that the producer thread produces) and num_tuples_consumed (the number of binding tuples that the consumer thread has

7

processed from the shared output buffer). Only the producer thread updates the num_tuples_produced variable and only the consumer thread updates the num_tuples_consumed variable. For each binding tuple that the producer thread produces, it increments num_tuples_produced by 1. For each binding tuple that the consumer thread consumes, it increments num_tuples_consumed by 1. In addition, the consumer thread keeps a counter on its location in output buffer. Since we do not have a fixed-size buffer, there is no need for a producer thread to go to sleep after producing a certain number of binding tuples. Instead, the producer thread keeps on parsing and processing the XML stream until the “dispatcher” gives control to the consumer thread. When the consumer thread runs, it immediately computes and stores the difference of num_tuples_produced and num_tuples consumed into a variable y. The consumer thread then goes to the shared output buffer, seeks to its current position and processes y binding tuples from the output buffer. 3.3.3 Extending x-scan to handle a subset of //

Because the descendent-or-self axis (abbreviation is //) matches all descendents of the current node as well as the current context node, therefore, extension of the x-scan to handle // required us to handle the case where there could be multiple instances of the DFAs running in parallel for each SAX event received. To simulate multiple instances of the DFA carrying out XML path matching for each SAX event received, it suffices to spawn a new “thread” for each new instance of the DFA. We can think of the DFA as a process in a traditional operating system. Each DFA acts to group resources together. In this scenario, the resources include the state function, the set of states, the start state and the end state. Multiple instances of the DFA running could be thought as multiple threads of execution. For our case, a thread has a cursor that keeps track of where it is currently in the DFA. Each thread also keeps a stack of its states that it passed to reach the current state7.

To show the interaction of the different objects in the new x-scan operator, we make use of a collaboration diagram expressed in UML (see Figure 5). We illustrate one of the scenarios that might take place in the x-scan execution. Assume that the XMLParser encounters a start element tag. The XMLParser object will then call the start_tag_handler method of the DFAManager. The DFAManager will iterate through its list of active DFA objects and send those objects a start_tag_received message. Each active DFA object will in turn send its active Thread objects a T_start_tag_received message. If bindings are found for the variable associated with the DFA and this DFA has no dependent DFAs, the Thread object will send a write_to_output_buffer message to the IOManager object which will in turn write the bindings tuple to the shared buffer with the consumer thread. Note that this collaboration diagram abstracts many of the details of the processing which occurs when the DFA/Context is sent a SAX event. The reader is referred to the state diagrams for the DFA and Thread included in Appendix 2 and 3 respectively for more details on the DFA/Thread responses to the events that the DFA/Thread might detect.

7 In the original x-scan, the DFA kept a stack of states which is passed to reach the current state. This stack of states is used for XML path matching because in XML path matching there is a need to go back to the previous state when an end tag is received by the DFA.

8

Figure 5: Collaboration Diagram of the extended x-scan process8

3.3.4 Optimization of x-scan 8 Figure 5 is drawn using UML. Following UML notation, the * in this figure means multiple objects of that class.

9

Optimizing x-scan is a challenging problem. The support of a subset of XQuery expressions with // necessitates the “multi-threading” approach that we use in the extended x-scan. The tradeoff of this is that more bookkeeping needs to be done. Thus, the extended x-scan is always going to be slower than the x-scan in the existing system. To improve the performance of the extended x-scan, we used a profiling tool called gprof. Profiling allows us to learn where our program spent its time and which functions called which other functions while it was executing. The information that gprof provides gave a strong indication to which methods are candidates for rewriting to make the extended x-scan run faster. The first optimization that we applied to the extended x-scan was to identify and use the data structures that are efficient in our context. For example, we switched to Hash tables instead of linked lists to do our bookkeeping. The second optimization was to reduce the I/O cost in writing to the tree manager by reducing duplication of information. For example, instead of writing the full name of an element’s start tag and end tag, we save on I/O costs by using “tokenization” (which is to map the name of the start tag to a number and establishing certain conventions to denote the end tag of the same element). The third optimization was to cut down on unnecessary object creation. For example, in cases where only one copy of a certain class is required, instead of creating multiple copies of a certain class, the Singleton Design pattern was used. References were passed as appropriate instead of doggedly creating new objects all the time. Reduction of unnecessary object creation reaped huge benefits in term of performance. Timings were collected to show the difference between the original extended x-scan and the optimized extended x-scan. The timings reported were from a 1.4GHz Linux Box. The XQuery expressions that were used were of the form:

for $x in document ("p.xml") /b/c, $y in $x/d

Note that p.xml denotes the xml file we are parsing; b, c and d are element names. The XML files were kindly provided by Dr. Zachary Ives. Please refer to Table 1.

File Size Optimized extended x-scan

Unoptimized extended x-scan9

10MB 1m36.091s Approximately 6m 62MB 5m19.809s Approximately 20 minutes

Table 1: Difference in Timing Results

Timings (average of 2 separate runs) for the optimized extended x-scan for files of various sizes are given in Table 2 below. 9 The figures given for unoptimized extended x-scan are rough approximations. We did not retain the slower version of x-scan and the figures given here are correct to the nearest minute. The important thing to note is that we managed to boost the performance of x-scan by around 4 times after optimization.

10

File Size (MB) Time 0.000078 0.012 seconds

0.004 0.034 seconds 0.2 1.310 seconds 2 14.028 seconds 6 38.205 seconds 10 1 minute 36.091seconds 52 3 minutes 58.653 seconds 62 5 minutes 19.809 seconds 216 8 minutes 3.215 seconds 325 16 minutes 2.315 seconds 511 39 minutes 43.001 seconds

Table 2: Timing Results 3.3.5 Interesting finding We ran the optimized extended x-scan on a 1.8GHz Linux machine and found that it was a full minute slower than the result attained using a 1.4 GHz Linux machine. This surprising result may be due to the smaller cache size of the 1.8GHz machine compared to the 1.4GHz machine (256KB versus 512KB). 4. Conclusion

In this project, we have essentially built a component of a larger system which does query processing. Specifically, we extended the x-scan operator to handle a subset of XQuery expressions with //. Support of // necessitated us to take on a “multi-threading” approach. In addition to attaining a working knowledge of XML, XQuery, and C++, we were also exposed to the internal workings of a query processor.

During the course of this project, we applied the concepts learned in Operating Systems to solve the task at hand. We realized that the interaction between the x-scan operator (which produces binding tuples) and other operators in Tukwila (which does further processing on the binding tuples produced by x-scan) could lead to deadlocks and race conditions. To prevent deadlocks and race conditions, we looked at the problem posed as a pseudo producer-consumer problem and tackled it by applying modifications to the algorithm used to solve the theoretical producer-consumer problem.

During the testing phase of our project, we learned that memory management becomes very important when we were dealing with files of large sizes. Bizarre results that were attained from the x-scan most often boils down to memory bugs. The big challenge was to spot the location where the memory violation occurred. Tools like GDB and Electric Fence were used to aid us in finding memory bugs.

We then proceeded to optimize this new implementation. Surprisingly, fancy algorithms were not needed to achieve a significant improvement in performance. We learned that looking at the code and reducing duplication of information and unnecessary object creation could boost performance greatly. Tukwila was built using Visual C++. Currently, work is under progress to make Tukwila run correctly on Linux. We believe that once Tukwila runs correctly on Linux,

11

integration of our extended x-scan into the system will be feasible. To ease the integration process, we have worked out a set of API that can be used to interpret the results that the extended x-scan generates. In retrospect, we could have written our own code for automata manipulation instead of using the third party Automata Standard Template Library (ASTL). For XML Path matching, we are primarily interested in a restricted form of regular expressions. We do not really need the flexibility provided by ASTL which supports the full range of regular expressions. Writing our own code for automata manipulation means greater flexibility in boosting the speed of x-scan as it might be the case that ASTL is not optimized for performance. Another design decision that we would have taken differently is to identify the pointers that had to be shared right at the start of the project. We started the project using shared pointers (provided by the Boost Library). Using shared pointers when not required meant that x-scan was running slower than it actually can as shared pointers imposes a performance overhead (due to the bookkeeping that the shared pointer does itself). It was only in the second semester when we were looking to enhance the performance of x-scan that we started using raw pointers. This was a slow and tedious process as we had to factor in memory management that we had not considered carefully in the start of the project and required substantial changes to the code. We felt that more time invested earlier in the project would have saved us time in effecting this change and given us more time to investigate other forms of optimization to x-scan. Currently, the new x-scan is able to handle a single // in the XQuery expression that is passed to it. It might be worthwhile to investigate ways to optimize the extended x-scan by thinking of an algorithm that does less bookkeeping as the current implementation does a lot of bookkeeping to support //. Also, it is possible to use this new x-scan as a foundation and extend it to handle multiple //. As the new x-scan handles a restricted set of predicate expression, another future extension could be allow x-scan to handle logical operators in the predicate. Indeed, there are many ways that x-scan can be extended to handle a less restricted set of XQuery; however, time and resources restrict us to expand it to support XQuery in its full glory. XQuery expressions that are most widely used are the prime candidates for future extensions to x-scan.

12

Appendix 1: Class diagram for the consumer portion of x-scan

13

Appendix 2: State chart diagram for the DFA expressed in UML

14

Appendix 3: State chart diagram for the Thread class expressed in UML

15

References Abiteboul, Serge., Buneman, Peter., and Suciu Dan. Data on the Web: From Relations to Semistructured Data and XML San Francisco, CA: Morgan Kaufmann

Publishers, 2000. Brundage, Michael. XQuery The XML Query Language Boston, MA: Addison Wesley,

2004 Chamberlin, D. “XQuery: An XML query language” IBM Systems Journal

41.4 (2002): 597-615 Gehrke, Johannes, and Raghu Ramakrishnan. Database Management Systems Third

Edition. New York, NY: McGraw Hill, 2003 Harold, Elliotte R., and W. Scot Means. XML in a nutshell: A Desktop Quick Reference

Second Edition. Sebastopol, CA: O’Reilly, 2002. Ives, Zachary G., Halevy, Alon Y., and Daniel S. Weld. “An XML Query Engine for

Network-Bound Data.” VLDB Journal, 11.4 (2002) Ives, Zachary G. Project meeting with author. November 2004. Priestley, Mark. Practical Object-Oriented Design with UML International Edition.

Singapore, Singapore: McGraw-Hill, 2000 Tanenbaum, Andrew S. Modern Operating Systems Second Edition Upper Saddle

River, New Jersey: Prentice-Hall, Inc, 2001

16

xml path matching: implementing the x-scan operator and

Documents