[ieee 2013 12th ieee international conference on trust, security and privacy in computing and...

On Change Detection of XML Schemas

Abdullah Baqasah, Eric Pardede, Wenny Rahayu Department of Computer Science and Computer

Engineering, La Trobe University Melbourne, Australia

{ambaqasah@students.\e.pardede@\w.rahayu@}latrobe.edu.au

Irena Holubova (Mlynkova) Department of Software Engineering, Charles University

Prague, Czech Republic [email protected]

Abstract— Change detection of XML data has emerged as an important research issue in the last decade, however the majority of change detection algorithms focuses on XML documents rather than schemas. This is because documents that contain data are deemed more significant than the schema itself. This paper looks at the problem from a different perspective by maintaining XML schema (XSD) changes and providing a more meaningful description of the detected changes. Our proposed algorithm XS-Diff uses the technique of storing XML Schema versions in a relational database where the detection and storage of delta changes are employed on relational tables. We demonstrate the correctness of the proposed algorithm through a set of synthetic data. Also, our experimental results show that XS-Diff, is a more meaningful method than other change detection methods for providing deltas that are optimal or near-optimal and semantically correct.

Keywords— Change detection; XML Schema; XML data modelling; Algorithm

I. INTRODUCTION XML (eXtensible Markup Language) has been widely used

for representing, storing and manipulating data from different data sources. Due to its dynamic property, XML data tends to change from time to time in different ways and the respective system is required to detect possible changes properly. As a consequence, there has been an increasing number of studies dedicated to detection of changes in XML documents [1], [2], [3], [4]. In some situations, rules imposed by XML schema languages, such as Document Type Definitions (DTDs) or XML Schemas (XSDs) [5], are utilized to specify and enforce XML document structure. But these schemas may also change over time to reflect real-world changes and changes by the user’s requirements, to correct mistakes in the initial design, or to possibly extend/restrict complexType definitions. Such tool for schema change detection can be useful at least in 1) maintaining (revalidating) associated XML documents when their schema evolves, 2) incremental maintenance of relational schema generated by schema-conscious approach for storing XML data, and 3) XML versioning.

In this paper, we address the problem of XML Schema change detection based on the three previous motivations. Specifically, in XML versioning management systems, where schema changes need to be tracked by means of comparing previous version with the edited one, the versioning system may use the change detection algorithm to merge multiple

versions of the same file that were edited by multiple users in parallel. Thus, the detection algorithm would be a useful tool in parallel processing of XML data. We proposed a relational-based algorithm called XS-Diff for detecting the changes to XML Schemas. The relational model is utilized because it is proven by the previous research works, such as [6] and [7], that DOM-based change detection methods (e.g., [1] and [2]) may often fail to find semantically correct and optimal changes. Moreover, they suffer from scalability problems since they are not able to handle large XML documents. We have critically evaluated XS-Diff algorithm in terms of correctness and the result quality. Our results show that the algorithm, compared to other available tools such as X-Diff and DeltaXML (a tool developed by [8]), can produce delta (the difference between the two XSD versions in form of relational tables) that is meaningful, with respect to XML Schema characteristics, and capable of detecting optimal or near-optimal delta changes.

The resulting delta can be used in different scenarios. For example, it can be used to graphically (as trees) represent the changes between any two XML Schemas. In the context of XML Schema versioning, it can be used to generate one version from a previous one. Furthermore, given a particular date/time during a series of the schema versions, a set of deltas can be used to rebuild the version at the nominated date/time. Schema versioning and its dependencies is out of the paper boundaries and considered as a future work.

The organisation of the paper is as follows. Section II describes briefly the related work in the research domain. The following section defines the XML Schema model that is going to be used in XS-Diff algorithm. Section IV presents the relational tables used for the detection process. The detection algorithm is then introduced along with a formal definition for the basic edit operations in Section V. In Section VI, we examine our approach through a set of synthetic XSDs that represent different schema changes and show how XS-Diff can overcome limitations with existing methods. In the final section we present a conclusion of the new XML Schema change detection approach and outline possible future works.

II. RELATED WORK XML Schema change detection problem is closely related

to the problem of XML data change detection. The types of schema change primitives are also related to the XML Schema evolution area. We shall discuss each of the related work in the following sub-sections.

2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications

978-0-7695-5022-0/13 $26.00 © 2013 IEEE

DOI 10.1109/TrustCom.2013.119

974

A. Change Detection in XML Documents Change detection has also been studied for XML

documents in XyDiff [1], X-Diff [2], and diffX [3] algorithms. XyDiff algorithm detects changes between XML documents in a bottom-up fashion of ordered documents. In ordered documents both the parent-child relationship and left-to-right ordering among siblings are significant, while only the parent-child relationship is important in unordered documents. The algorithm also takes advantage of XML specifications; for example, handling attributes and treating them differently than text and element nodes, and taking into account the DTD-ID attributes when matching elements. In addition to the normal insert, delete, and update operations, the algorithm supports a move operation. Regardless its high performance and a better speed, XyDiff may fail to find an optimal or near-optimal result (e.g. minimal edit script) due to the rapacious rules imposed by the algorithm.

In contract to XyDiff, X-Diff algorithm is proposed to handle unordered XML documents in a top-down fashion but it does not support the move operation. It achieves the optimal changes by integrating key XML structure characteristics with standard tree-to-tree correction methods. [3] states that the algorithm is best suited for detecting changes between versions of XMLized relational data sets. In that case, the changes usually occur at the leaf nodes (the content) of the trees and the internal nodes (the structure) will remain intact, which does not fit with semi-structured XML documents.

Although XyDiff and X-Diff will detect differences between XML document versions, [6] argues that they may often fail to detect semantically correct and optimal changes. Furthermore, they suffer from scalability problems due to the use of DOM to represent the compared trees.

diffX algorithm performs the isolated tree fragment mapping technique [3] to minimize the size of the resulting edit script and optimize the runtime of mapping the nodes. The algorithm provides edit operations for both atomic nodes and non-atomic sub-trees. It considers primitive operations like insert, delete and move in addition to the more advanced operation replace which is introduced to enhance the minimality of the edit script. The algorithm can only handle ordered elements and unordered attributes. Thus, it does not hold well for XML Schemas where both ordered and unordered components must be handled.

DeltaXML , a commercial tool developed by [8], provides a comparison for XML documents and also represents changes in XML. Operations supported by DeltaXML are insert, delete and update, while move is not supported. As a result, in case an XML document node (element or attribute) is moved from one part of the tree into another, the tool will detect that change as a deletion from the first tree followed by an insertion to the second. Therefore, it generates delta that is non-optimal due to the number of operations required for the transformation.

To address the scalability problems of XML documents change detection techniques described above, a number of algorithms have been proposed using the relational database systems. The idea is to exploit the relational environment to store XML data and then apply a set of SQL queries to find

changes between the stored documents. Xandy-U [9] and Xandy-O [10] are two methods proposed for detecting changes to unordered and ordered XML documents, respectively, using schema-oblivious technique to store the data in the relational tables. Authors of Xandy-U and Xandy-O also propose HELIOS [4] and OXONE [11] using the schema-conscious approach, for detecting changes to unordered and ordered XML documents respectively. The algorithm for detecting changes in all four methods is quite the same and consists of two phases: the best matching sub-trees computation phase and the change detection phase.

In case of ordinal XML documents with insert, delete, update, and move general operations, the previous methods may be used perfectly. However, XML Schema, which is a specialised format of XML documents, has a richer variety of edit operations. For example, in XML Schema there is a cardinality associated to elements. The cardinality (represented by minOccurs and maxOccurs attributes in the element declarations) is used to indicate the minimum and maximum occurrences of a specific element, and may change during the evolution process of the schema. Hence, XML document change detection can not directly be used to handle this kind of schema changes.

B. Change Detection in XML Schemas For the change detection of XML schemas (e.g. XML

Schema or DTD instances) a few works have been done such as DTD-Diff [6] and DiffDog [12]. DTD-Diff, as suggested by the name, is used to find changes between versions of a DTD. The algorithm takes two DTD files representing the first and second version as input and returns a list of changes. It defines types of changes for element type declaration, attribute declaration, and entity declaration. Because DTD is represented as a content tree, the paper also addresses the types of changes to element contents. As XML Schema format is different to that of a DTD, our proposed changes between the old and the new versions are different from the ones that occur in DTD versions. As a matter of fact, XML Schema has richer variety of change operations compared to DTD. That motivates us to explore the possibility of creating a new approach developed particularly to handle XML Schemas.

Altova provides a commercial tool called DiffDog [12] that includes XML Schema differencing functionality. The DiffDog helps to efficiently update associated XML documents when their schemas evolve. The tool maps the root elements of both versions of the schema and then maps corresponding child elements. It also maps global elements and complex types. Regardless its power and efficiency on propagating changes to the related XML documents, it ignores any global attribute declarations or simple type definitions matching between the two versions. This is because the main purpose of the tool is to automatically identify identical elements, and lets the users map differences and generate XSL transformations to update XML data files. Because the generated file is used for different purpose as mentioned above, it does not contain all information required to transform the first schema into the second one which is an essential goal in this paper.

975

C. XML Schema Evolution XML Schema change detection cannot be handled well

without the investigation of techniques that are used to manage XML schema evolution. The evolution concept is important because our proposed change operations are based on the evolution primitives defined by those techniques. For example, [13] studied the impact of XML Schema updates on the validity of a related documents. The authors first devised a set of atomic evolution primitives to be applied to the basic components of the schema. Then, they presented some high level evolution primitives that consist of the atomic ones. The resulting approach keeps track of schema changes and identifies the document portions affected by those updates are then require revalidation.

In that work, the evolution primitives have been classified into three main categories: insertion, modification, and deletion of the main XML Schema components: element, simple type, and complex type. We enhance the proposed classification by considering more schema components such as, attribute, model groups, and restriction facets. Thus, our approach allows more granularity in identifying the schema changes.

III. XML SCHEMA MODEL

A. XML Schema Components and Tree Representation According to [14], the constructs of XML Schema can be

categorised into basic, advanced, and auxiliary constructs. The key emphasis is currently put on: element, attribute, simpleType and complexType basic components. element components are used to assign names and simple/complex data types to XML elements, whereas attribute components are used to assign names and simple data types to XML attributes. To define a simple data type for element/attribute, simpleType components are used, whereas complexType components are used when we define the content model of elements (i.e., sequence, choice, all, or their combinations) and their possible attributes.

We also study other complementary schema components (appearing as child nodes of the main components in the

schema tree) including model groups and facets. Model groups such as sequence, choice, or all always appear as a part of a complex type definition or as a part of a named model group. Facets, on the other hand, always appear as leaf nodes in the schema tree and are used to restrict complex or simple types. Examples of facets are: minInclusive, maxInclusive, enumeration, and pattern.

For detecting XML Schema changes, we first define its tree model as follows:

Def. 1 (XSD Tree). An XSD is a tree T = (AD, ED, CT, ST, MG, F), where

AD is a set of attribute declaration nodes of the form ad=(adN, adT, adF, adD, adU), where adN is the name of the node, adT is the type of the node (i.e. built-in or user-derived type simple type) and adF, adD, and adU are values of fixed, default, and use attributes of the attribute declaration node respectively, ED is a set of element declaration nodes of the form ed=(edN, edT, edMn, edMx, edO), where edN is the name of the node, edT is the type of the node (elements may have simple or complex types), edMn and edMx are values of the minOccurs and maxOccurs attributes of the element declaration node respectively and edO (used only when the element is declared locally) is the order of the element node among its siblings, CT is a set of complex type definition nodes of the form ct=(ctN, ctP, ctT, ctD, ctB), where ctN is the name (if defined globally) of the complex type node, ctP is the path expression id of the parent (i.e. element or schema) of the node, ctT={complexContent, simpleContent} is the content type of the node, ctD={restriction, extension} is the derivation method, and ctB is the base type of the node, ST is the set of simple type definition nodes of the form st=(stN, stP, stD, stB), where stN is the name (if defined globally) of the simple type node, stP is the path expression id of the parent (i.e. element or attribute) of the node, stD={restriction, list, union} is the derivation type of the node and stB is the value of 'base', 'itemType', or 'memberTypes' attributes,

TABLE I. TWO VERSIONS OF XML SCHEMA (LEFT: XSD1, RIGHT: XSD2)

<xs:schema> <xs:attribute name="A1" type="xs:string"/> <xs:element name="E1" type="E1T"/> <xs:complextype name="E1T"> <xs:sequence> <xs:element name="E2" type="xs:string"/> <xs:element name="E3" type="xs:string"/> <xs:element name="E4" type="E4T"/> </xs:sequence> <xs:attribute ref="A1"/> </xs:complextype> <xs:complextype name="E4T"> <xs:sequence> <xs:element name="E5" maxOccurs="5"> <xs:complextype> <xs:sequence> <xs:element name="E6" type="xs:string"/> <xs:element name="E7"> <xs:simpletype> <xs:restriction base="xs:integer"> <xs:maxexclusive value="100"/> </xs:restriction> </xs:simpletype> </xs:element> </xs:sequence> </xs:complextype> </xs:element> </xs:sequence> </xs:complextype> </xs:schema>

<xs:schema> <xs:element name="E6" type="xs:string"/> <xs:element name="E1" type="E1T"/> <xs:complextype name="E1T"> <xs:sequence> <xs:element name="E8"> <xs:complextype> <xs:sequence> <xs:element name="E2" type="xs:string"/> <xs:element name="E3" type="xs:string"/> </xs:sequence> </xs:complextype> </xs:element> <xs:element name="E4" type="E4T"/> </xs:sequence> <xs:attribute name="A1" type="xs:string"/> </xs:complextype> <xs:complextype name="E4T"> <xs:sequence> <xs:element name="E5" maxOccurs="10"> <xs:complextype> <xs:sequence> <xs:element ref="E6"/> <xs:element name="E7"> <xs:simpletype> <xs:restriction base="xs:integer"> <xs:maxexclusive value="100"/> <xs:mininclusive value="1"/> </xs:restriction> </xs:simpletype> </xs:element> </xs:sequence> </xs:complextype> </xs:element></xs:sequence></xs:complextype> </xs:schema>

976

ad2

root

T1

ed3 ct4 A1 E1 E1T

mg5 seq

ed6 E2

ed7 ed8

ad9

ct10

mg11

ed12

ct13 [CT]

mg14 seq

ed15 E6

st17 [ST]

f18 maxexclusive

ed16 E7

ad19 A2

E3 E4

ref=A1

E4T

seq

E5

local-to-global

global-to-local

root

T2

ed20E6

ed3 ct4

mg5

ed21 E8

ct22

mg23

ed24 ed25

ed8

ad9

ct10

mg11

ed12

ct13

mg14

ed15 ref=E6

ed16

st17

f18maxexclusive

f26mininclusive

E1 E1T

seq

[CT]

seq

E2 E3

E4

A1

E4T

seq

E5

[CT]

seq

E7

[ST]

Legend xxxx nodetypepathid migrated/moved node deleted node inserted node updated node unchanged node

1 2 3 1 2

local order local order

1 2

1

1 2

1

1 2

Fig. 1. XML Schema trees T1 and T2 of the versions in TABLE I (T1 represents XSD1 and T2 represents XSD2)

MG is the set of model group operators of the form mg=(mgT, mgMn, mgMx, mgO), where mgT={sequence, choice, all} is the operator of the model group defined under a particular complex type node, mgMn, mgMx respectively are the 'minOccurs' and 'maxOccurs' attributes of the model group node and mgO is the order of the model group node among its siblings, and F is the set of facet nodes of the form f=(fN, fV), where fN is the name of the facet and fV is its value.

Table I and Fig. 1 show the input XSD versions and their

corresponding tree representations respectively. Each node in Fig. 1 holds information about that node in the schema (e.g., node ad2 in T1 represents the attribute node with “A1” as a name and “string” as a data type). In the next stage, the storage tables are created on the basis of these node types. Thus, we can easily match nodes with similar types (e.g., match attribute nodes from both versions) to find changes between them. At this point, we explain how to achieve the first step of parsing the input schemas and creating the corresponding trees. In what follows, we elaborate how the underlying relational storage structure is used to store XSD trees in detail.

IV. XML SCHEMA STORAGE IN RDBMS The design of the relational tables used by XS-Diff

algorithm plays a key role in the change detection process. XML documents can be stored in relational databases using two approaches: 1) a structure-mapping approach, where a database schema is created based on the document structure or DTD and 2) a model-mapping approach, where the database schema is created based on the constructs of the XML document model (e.g., elements, attributes, and text nodes). We believe that model-mapping is a more suitable approach for our method to efficiently store XML Schema instances because the DTD or any structure definition for the schemas is not available. For this reason, our relational model XS-Rel extends

a scalable technique XRel [15] to store and retrieve XML documents using relational databases. Similar to XRel approach, the path expression is used as a unit of decomposition of XML Schema trees. For instance, the path from the root node schema to the first element node E1 in Fig. 1 can be denoted #/schema#/element[E1]. We notice that XML Schema is an XML document designed for a specific purpose of defining the legal structure of XML documents. Therefore, certain rules should be followed when writing an XML Schema (e.g., only element, attribute, simpleType and complexType nodes may exist at a top level of the schema tree). Based on the previous observation, we alter the XRel storage technique to our XS-Rel to be more suitable for storing and later querying XML Schema components. For example, unlike XRel where tables are created based on the type of XML document nodes (i.e., element, attribute, or text node), XS-Rel creates tables based on the XML Schema components presented in the previous section. The relational schema for XS-Rel is defined as follows:

� attributedecl(xsid, pathid, pPathid, name, type, fixed, default, use, isRef, isGlob)

� elementdecl(xsid, pathid, pPathid, name, type, minO, maxO, locO, isRef, isGlob)

� complextype(xsid, pathid, pPathid, name, contType, der, base, isGlob)

� simpletype(xsid, pathid, pPathid, name, der, base, isGlob)

� modelgroup(xsid, pathid, pPathid, modelG, minO, maxO, locO)

� facet(xsid, pathid, pPathid, name, value)

� xsdocument(xsid, documentpath)

� path(pathid, pathexp)

The first six relations reflect the node types presented in Def. 1. To capture the exact position of each node in the two input versions, we store the unique schema id (assigned by the system) in xsid field and the unique path id in pathid field.

977

Both xsid and pathid form the primary key for each one of the six relations. Additional information are also stored in those tables to be used through the detection process. For example, isRef field is used to indicate whether the attribute/element declaration has a ref attribute or not (see node ad9 in T1 in Fig. 1).

The xsdocument table is used to store both the unique identifier of the schema xsid and the respective XML Schema file path documentpath. This will allow us to store information about XML Schema versions to be compared. The path table is used to store all unique paths from the root node (schema) to every node in the tree. It contains two attributes: pathid and pathexp to store path unique ids and path expressions respectively. Examples of node unique paths extracted from Fig. 1 are shown in Table II.

Path expressions in Table II can be divided in to three groups. The first group contains nodes with unique path ids: 2, 6, and 19 located only in T1 tree of Fig. 1. Nodes of this group are affected by the migration, move, or deletion operations, and thus not found at the same positions in T2. In the second group, nodes with path ids: 21 and 24 are affected by the migration, move, or insertion operations and located only in T2 tree. Finally, nodes 4 and 12 in the last group represent matched nodes in both T1 and T2.

In the next section we briefly describe XS-Diff algorithm phases and define their necessary change operations to be used in the change detection model.

V. XS-DIFF ALGORITHM In this section, we present our novel algorithm XS-Diff to

detect changes between two XML Schemas that are stored in the relational database. For simplicity and better understanding, the algorithm is split into five different phases: (1) Find matching components, (2) Detect migrated and moved components, (3) Detect deleted and inserted components, (4) Detect element order changes, and (5) Detect updated components. Each phase of the proposed algorithm will be discussed in detail in the following sub-sections. The pseudo-code of XS-Diff algorithm is depicted in Table III.

TABLE II. EXAMPLE OF T1 AND T2 TREE NODES DECOMPOSED IN ‘PATH’ TABLE

pathid pathexp 2 #/schema#/attribute[A1] 4 #/schema#/ct[E1T] 6 #/schema#/ct[E1T]#/seq[0]#/element[E2]12 #/schema#/ct[E4T]#/seq[0]#/element[E5]19 #/schema#/ct[E4T]#/seq[0]#/element[E5]#/ct[CT]

#/attribute[A2] 21 #/schema#/ct[E1T]#/seq[0]#/element[E8]24 #/schema#/ct[E1T]#/seq[0]#/element[E8]#/ct[CT]

#/seq[0]#/element[E8]

A. Phase 1: Finding Matching Components In this phase, we only match identical components (nodes)

in the first and second versions of the schema (as shown in Def. 2). We base our matching on the equivalence of path expressions (signatures) from the first and second versions. For example, E5 element (node ed12 in Fig. 1) from the first and second tree versions of the schema share the same pathid 12

(as shown in Table II). Hence, we store the resulting match in a temporary table corresponding to the type of the matched components (in this example temp_match_elementdecl). Simultaneously, we store the unique path (pathid 12) for the matched nodes in the path table as seen in the example in Table II. We apply a specific SQL queries for the matching and storing phase. Note that two matched components are considered identical, if they share exactly the same path even if they do not have the same values for their properties (e.g., edMnOld and edMnNew values representing the old and new version of minOccurs respectively in the temp_match_elementdecl table). This allows us in a later stage to speed up finding updated nodes by using these temporary tables. Line 08 in the algorithm performs this step. In what follows, we formally define the first phase of the change detection model.

Def. 2 (Finding Matching Components). Let S1 and S2 be the two matched sets of nodes (e.g. AD1 and AD2) in the first and second versions of XML schema tress T1 and T2 respectively. si��S1 and sj��S2 are matching components iff si and sj have the same node signature e.g., pathid(si) = pathid(sj).

B. Phase 2: Detection of Migrated and Moved components A component is intuitively considered to be moved, if its

old path in the first version changes to the new path in the second one. In this phase, we distinguish between two kinds of move changes: migration and move between sub-trees. To clarify further, migration treats the changes of the scope of element, attribute, or simple/complex type from local to global (and vice versa). This type of changes does not exist between XML document versions, therefore, we devise it as a new type of changes in our XS-Diff algorithm. Lines 11 to 28 in the algorithm in Table III show the steps to store migrated components. For example, line 25 is executed to detect the migration change of the attribute A1 in Fig. 1 from global (node ad2 in T1) to local (node ad2 in T2). Likewise, to detect the migration of element E6 from local (node ed15 in T1) to global (node ed20 in T2), line 22 is performed. Note that, this operation requires node ed15 in T2 to be updated so it contains the reference to the new global element.

The move operation on the other hand, is widely adopted in many XML document change detection approaches, such as [1] and [3]. We consider the movement of element, attribute, model group, or facet nodes from one part of the tree to another. For example, elements E2 and E3 (nodes ed6 and ed7 respectively) in T1 are grouped under a new element E8 in T2 and thus moved to the new positions (nodes ed24 and ed25) in T2 in Fig. 1. Lines 29 to 38 in the algorithm store the moved components and the definitions for migration and move operations are omitted due to the limitation of space.

C. Phase 3: Detection of Deleted and Inserted Components Intuitively, the deleted components are only available in the

old tree version T1, whereas inserted components are only available in the new T2. The idea behind detecting deletion changes for elements and attributes is to eliminate the set of elements/attributes existing in the old version T1 from the sets of matched nodes, migrated nodes, and moved nodes. Doing such, we ensure that nodes in T1 do not have any matched

978

nodes in T2 and they are neither migrated nor moved between the two versions of the tree. Thus, we guarantee a more accurate and optimal delta by avoiding undesirable repeated operations, such as detection of the move followed by detection of the delete and insert of the same node.

Deletion is also possible for simple and complex types. In this case, we eliminate the set of nodes in T1 from the set of matched nodes and the set of migrated nodes. For simplicity, we assume that simple/complex type nodes ideally perform migration instead of move operation to change their positions. Model group and facet nodes can also be deleted because of deletion of their parents (complex types and simple types respectively). An example of component deletion can be seen in A2 attribute (node ad19) that is deleted from T1 in Fig. 1.

We can think of the insertion operation as the reverse of deletion. Inserted components are only available in the new version of the schema T2. Element E8 at node ed21 and its complex type at node ct22 and the sequence model group at node mg23 in T2 Fig. 1 are examples of components insertions.

TABLE III. XS-DIFF ALGORITHM

01 Input: xs1, xs2 schema documents 02 Output: migration and move tables, deletion and 03 insertion tables, element order changes 04 table, and update tables containing changes 05 06 /* Phase1: find matching components */ 07 For every node n in xs1 and xs2 parsed schemas, do 08 temp_match_<n> = find_unique_<n>_path(); 09 10 /* Phase2: find migrated and moved components */ 11 For every node n in xs1 and xs2, do { 12 If(isSimpleType(n)) then { 13 migrated_simpletype = find_gtl_simpletype(); 14 migrated_simpletype = find_ltg_simpletype(); 15 } 16 Else if(isComplexType(n)) then { 17 migrated_complextype = find_gtl_complextype(); 18 migrated_complextype = find_ltg_complextype(); 19 } 20 Else if(isElement(n)) then { 21 migrated_element = find_gtl_element(); 22 migrated_element = find_ltg_element(); 23 } 24 Else if(isAttribute(n)) then { 25 migrated_attribute = find_gtl_attribute(); 26 migrated_attribute = find_ltg_attribute(); 27 } 28 } 29 For every node n in xs1 and xs2, do { 30 If(isElement(n)), then 31 moved_components = find_moved_element(); 32 Else if(isAttribute(n)), then 33 moved_components = find_moved_attribute(); 34 Else if(isFacet(n)), then 35 moved_components = find_moved_facet(); 36 Else if(isModelgroup(n)), then 37 moved_components = find_moved_modelgroup(); 38 } 39 40 /* Phase3: find deleted and inserted components*/ 41 For every node n in xs1 and not in xs2, do { 42 If(isElement(n)), then 43 deleted_elements= find_deleted_element(); 44 Else if(isAttribute(n)), then 45 deleted_attributes= find_deleted_attribute();

46 Else if(isComplextype(n)), then 47 deleted_complext= find_deleted_complext(); 48 Else if(isSimpletype(n)), then 49 deleted_simplet= find_deleted_simplet(); 50 Else if(isModelgroup(n)), then 51 deleted_modelgroup= find_deleted_modelgroup(); 52 Else if(isFacet(n)), then 53 deleted_facet= find_deleted_facet(); 54 } 55 /* insertion is similar to deletion and ommited 56 for paper limit */ 57 58 /* Phase4: find element order changes */ 59 /* Step1: populate elem_order_changes table*/ 60 elem_order_changes=populate_element_order_c(); 61 /* Step2.a: update elem_order_changes affected 62 by sibling move */ 63 int countMoved = 0; 64 For every tuple e in elem_order_changes, do { 65 For every tuple m in moved_components table,do { 66 If((epPathid==moldPPathid)&&(eoldLocOrder>moldLocOrder)), then 67 countMoved++; 68 } 69 update_elem_order_changes_by_move(); 70 countMoved = 0; 71 } 72 /* Step 2.b: update elem_order_changes affected 73 by sibling delete */ 74 int countDeleted = 0; 75 For every tuple e in elem_order_changes, do { 76 For every tuple d in deleted_elements, do { 77 If((epPathid==dpPathid)&&(eoldLocOrder>dlocOrder)), then 78 countDeleted++; 79 } 80 update_elem_order_changes_by_delete(); 81 countDeleted = 0; 82 } 83 /* Step 2.c: update elem_order_changes affected 84 by sibling insert */ 85 int countInserted = 0; 86 For every tuple e in elem_order_changes, do { 87 For every tuple i in inserted_elements, do { 88 If((epPathid==ipPathid)&&(eoldLocOrder>=ilocOrder)), then 89 countInserted++; 90 } 91 update_elem_order_changes_by_insert(); 92 countInserted = 0; 93 } 94 /* Step 3: prune element_order_changes table by 95 deleting identical oldLocOrder fields */ 96 prune_element_order_changes(); 97 98 /* Phase 5: find updated components */ 99 For every node n in xs1 and xs2, do { 100 If(isAttribute(n)), then 101 updated_attributes= find_updated_attribute(); 102 Else if(isElement(n)), then 103 updated_elements= find_updated_elements(); 104 Else if(isComplextype(n)), then 105 updated_complext= find_updated_complext(); 106 Else if(isSimplet(n)), then 107 updated_simplet= find_updated_simplet(); 108 Else if(isModelgroup(n)), then 109 updated_modelgroups= find_updated_modelgroup(); 110}

D. Phase 4: Detection of Element Order Changes The intuitive method to determine the element order change

is to check the locO (local order appears under each local element node in both trees in Fig. 1) column of the elementdecl table defined in Section IV. If the local order in

979

the first version is different from the one in the second version, then it will be reported as an order change. However, this method may result in detection of non-optimal deltas where unnecessary operations are calculated. For example, E2 (node ed6) and E3 (node ed7) in T1 in Fig. 1 are considered moved nodes. Also E8 (node ed21) in T2 in the same figure is a newly inserted node. If we do not take such move and insert operations into account, we may detect that E4 (node ed8) has changed its local order from 3 into 2. Hence, the generated delta will include an additional change order operation.

To overcome this problem, we adapt the idea from [10] by simulating the insertions, deletions, and moves occurring to the same parent and then detect the order changes. Lines 60 to 96 in the algorithm show the three steps to resolve the order change problem. In the first step, the algorithm stores all matched elements that have the same parent nodes (only elements under sequence model group are considered) in elem_order_changes table. The following is the relational schema for this table:

elem_order_changes (pathid, pPathid, name, oldLocO, newLocO)

In the second step, the algorithm sequentially updates oldLocO column based on the number of move, insert, and delete operations of the sibling elements, and uses specific SQL queries to gain that particular goal. Finally, table elem_order_changes is pruned by deleting elements that have the same values of their oldLocO and newLocO columns.

E. Phase 5: Detection of Updated Components The concept of an updated component in XML Schema has

some similarity to updating a component of an XML document in general. In XML document, the update operation is only applicable to attribute and text nodes because they are considered leaf nodes [2]. For instance, attributes from both the old and the new version are considered updated if their values are modified. In our context XML Schema, we define the update operation based on the node paths located in the path table. To catch the update changes for any schema components, we first check its corresponding temporary table (e.g., temp_match_elementdecl is the temporary matching table for the element declaration component) to see if there are any changes to its properties. Lines 99 to 110 in the algorithm are used in this phase.

An example of component update can be seen in element E5 in both schemas XSD1 and XSD2 in Table I. The value of the attribute maxOccurs has changed from 5 in the first schema version into 10 in the second one.

VI. EXPERIMENTAL RESULTS In this section we evaluate XS-Diff algorithm in terms of

correctness and result quality. Correctness as suggested by [16] can be defined as a property which ensures that the differencing algorithm can find a set of operations that is sufficient to transform the old version into the new version of the XML document. On the other hand, the result quality examines the minimality and the semantic correctness of the generated delta. To do that, we compare the algorithm with two available tools X-Diff and DeltaXML discussed in the related

work section. Although methods for XML schema differencing such as DTD-Diff and DiffDog are available, we do not compare the algorithm with them for the following reasons. In case of DTD-Diff algorithm, it is obvious that DTD changes are much different than XML Schema changes. While in DiffDog tool, the output file (XSLT) that is produced by the tool cannot be compared to our resulting delta. The XSLT file in this case can only be used to update the related XML data files which is the main purpose of that tool and does not contain all information needed to transform one XSD into another. Furthermore, the automatic detection of XML Schema changes is absent in DiffDog and the user has to map differences manually before generating the XSLT transformations.

A. Experimental Settings and Datasets XS-Diff algorithm is implemented using SQL queries on

MySQL 5.5.24-log RDBMS. We use Java programming language with the XSOM (XML Schema Object Model) parser [17] to parse and store the input schemas in XS-Rel relational tables. All experiments are based on XSD files modified from the ones available on W3C1 and DATYPIC2 websites.

B. Correctness Analysis Correctness, as defined earlier, holds true in the XS-Diff

algorithm since the inputs (XSDs) are in XML document format. In order to verify the correctness of our algorithm, we run it on a set of XSDs. On each sample, we vary the number of schema components and the type of changes for each component to ensure that all possible changes are covered. The list of edit operations grouped by the type of the schema component is introduced in Table IV.

In the first case C1, we concentrate on element/attribute changes including migration, move, deletion, insertion, and update changes. element order changes that are advised previously in Phase 4 of the algorithm are also examined. This case contains 19 kinds of edit operations with the expected occurrences of those operations at 28. As it can be seen in Table IV, that XS-Diff detects all changes (� E = � D) and records them as tuples in XS-Rel delta tables.

In the next case C2, we demonstrate simple type changes by covering migration, deletion, insertion, and update changes. The related facet changes are partially covered in this case. The results were also promising since XS-Diff detects all targeted changes and reports 24 total occurrences of these changes.

We focus on complex type and model group changes in case C3. As shown in column C3, the total amount of 25 expected changes are detected correctly by the algorithm. We notice that some of these operations are affected by others. For example, model group move operation is clearly affected by complex type migration. Typical XML document change detection tools would not register this kind of association between schema components and this would clearly affect the overall result quality of the detection system. The result quality is discussed in the next sub-section.

1 http://www.w3.org/TR/xmlschema-0/ 2 http://www.datypic.com/books/defxmlschema1/

980

TABLE IV. PROPOSED EDIT OPERATIONS AND THEIR APPLICATION ON SAMPLES (C: COVERED, E: EXPECTED

OCCURENCES, D: DETECTED OCCURENCES)

Change Operation

C1 C2 C3 C4 C5 C E D C E D C E D C E D C E D

attibute migration move deletion insertion update

� 2 2 � 1 1 � 1 1 � 1 1 � 3 3

0 0 0 0 0 0 � 1 1 � 1 1

0 0 � 2 2 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 � 1 1 0 0 0 0 0 0

element migration move deletion insertion orderchange update

� 2 2 � 2 2 � 1 1 � 2 2 � 2 2 � 4 4

0 0 0 0 � 2 2 � 1 1 0 0 � 3 3

0 0 � 3 3 � 2 2 � 2 2 0 0 � 4 4

0 0 � 2 2 � 1 1 � 3 3 � 2 1 0 0

0 0 � 1 1 � 1 1 � 1 1 � 3 3 0 0

complex type migration deletion insertion update

0 0 0 0 � 1 1 0 0

0 0 0 0 0 0 0 0

� 3 3 � 1 1 � 1 1 � 1 1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

simple type migration deletion insertion update

0 0 � 1 1 � 1 1 0 0

� 4 4 � 1 1 � 1 1 � 1 1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

model group move deletion insertion update

0 0 0 0 � 1 1 0 0

0 0 0 0 0 0 0 0

� 1 1 � 1 1 � 1 1 � 1 1

0 0 � 1 1 � 2 2 0 0

0 0 0 0 0 0 0 0

restriction facet move deletion insertion

0 0 � 1 1 � 2 2

� 7 7 0 0 � 2 2

� 2 2 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

Total 28 28 24 24 25 25 11 10 7 7

Case C4 is dedicated to catching a nesting and un-nesting model groups changes, and its consequences on other operations such as element order change. It can be seen in column C4 of the table that the algorithm detects a total of ten change occurrences and only misses one of the two element order changes. This is because the element related to the missed change was not in the same level of the other sibling elements but as a child of a choice model group defined in the same level. To explain this further, consider element E3 in Fig. 2, after modifying the choice model group (deleting node cho and its element child E4) E3 is assumed to be in a third order. Our algorithm, as stated in Phase 4, compares siblings of elements under the same sequence model group, although such order change (moving the order of element E3 from 3 to 4) is interesting and should be considered as a future enhancement to the algorithm.

Finally, we consider attribute/element move with presence of their global definitions in case C5. As shown in the Table IV, the algorithm correctly finds a total of seven changes suggested by this case.

C. Result Quality In this sub-section we discuss our experimental results

against other tools available for XML change detection, namely, X-Diff [2] and DeltaXML commercial system [8]. We compare XS-Diff algorithm to these methods based on availability of the source/tool, the similarity of the input format (XML in our case), and the job of the resulting output delta (to translate one version of XML into another). We study the result quality of the delta generated by XS-Diff from two different perspectives: minimality and semantic correctness of the generated operations. The minimality can be defined by the number of edit operations generated by the algorithm versus the optimal number. Although this notion is important in some applications to save the space required for storing changes, it becomes less significant if the generated changes are semantically incorrect, especially in our context XML Schemas.

1) Minimality Having the previous definition of minimality in mind, we

calculate the result quality from the ratio between the number of change operations found by XS-Diff and the optimal number. In order to count the optimal operations for each test, we create a set of minimal operations: insert, delete, update, and move that are supported by the majority of the previous works on XML change detection. These operations are also applicable on XML Schema changes and thus used as a guideline for introducing the optimal operations for the schema. Based on that, we test the result quality of the generated operations through the five cases discussed in the previous sub-section and the ratios are plotted in Fig. 3.

We observe that XS-Diff is capable of detecting optimal or near-optimal delta changes. The ratio of X-Diff ranges from 0.5 to 1. This is because X-Diff supports two composite operations on non-leaf nodes: insert subtree and delete subtree. Although composite operations can clearly minimise the delta by combining atomic operations, it may lead to deltas that are semantically incorrect as we will see in the following sub-section. In XS-Diff, such operations are not supported since the proposed changes are limited to atomic nodes. The ratio of DeltaXML is greater than the optimal one in C1, C2, and C5 cases. This is because DeltaXML does not support move operation which generates such change as a sequence of deletion and insertion operations. Thus, XS-Diff supports move operation along with introducing the new set of migration operations to create optimal delta.

We also notice that DeltaXML, on the other hand, produces a ratio less than 1 in C3 and C4. In case C3, complex type components are examined while model group components are tested in both C3 and C4 cases. From the XML Schema point of view, the most important change for complex type is the migration change. Since DeltaXML is mainly designed for detecting XML document changes, it does not support such kind of XML Schema changes, meaning that it detects the migration change as a pair of sub-tree insertion and deletion.

root T1

CT

1

. . .

seq

E1 E2 cho E5 2 3 4

E3 E4

rootT2

CT

1

...

seq

E1 E2 E5 E32 3 4

moved local order

deleted

element order change

Fig. 2. Effect of model group (cho) delete on element (E3) order change

981

Again, this clearly reduces the ratio of DeltaXML to 0.7 in C3 and less than 0.4 in C4, but it may result in semantically incorrect changes.

2) Semantic Correctness To best understand the semantic correctness importance in

this study, consider the following example.

Fig. 3. Result quality

Consider the two element declarations E2 and E3 at nodes ed6 and ed7 respectively, in T1 in Fig. 1. They are joined together under the new inserted element named E8 at node ed21 in T2. X-Diff does not investigate this part in the tree. Instead, it deletes the whole subtree rooted at node ct4 from the T1 and inserts the updated subtree to T2. Similarly, DeltaXML updates the name of E2 in T1 to the new value E8 in T2, deletes the type string of E2, and inserts the complex type subtree rooted at node ct22 to T2. However, the changes detected by these methods are semantically incorrect since they do not reflect the actual schema changes. The more meaningful operations should have been to insert E8 at node ed21, insert its complex type ct22, insert the model group mg23, and finally move both E2 and E3 from the old positions ed6 and ed7 to the new positions ed24 and ed25 respectively. As a summary, we found that typical XML document change detection methods may not be suitable for detecting changes to XML Schemas. By measuring the result quality of XS-Diff in comparison to other methods, we show that XS-Diff, in addition to its high level of correctness, seems to be a promising solution to efficiently detect XML Schema changes.

VII. CONCLUSION AND FUTURE WORK Since XML document change detection tools are not

intentionally designed for detecting changes to XSDs, they do not take the semantic and structural issues of such input schemas into account. In this paper, we propose XS-Diff algorithm for detecting meaningful changes to XML Schemas using relational databases. The algorithm can be utilized in XML documents revalidation, the incremental maintenance of relational schema generated by schema-conscious approach for storing XML data, and the traditional support of XML versioning. The key feature of XS-Diff is that it computes the changes by considering the tree structure of XML Schema.

We proved the correctness of XS-Diff by testing it on different XSDs. Also, the result quality is examined in

comparison to other XML document change detection tools, in particular X-Diff and DeltaXML. We notice that both X-Diff and DeltaXML produce deltas that fall below the optimal mark. As a future work, we plan to improve the algorithm by incorporating more advanced schema components such as named model and attribute groups.

REFERENCES [1] G. Cobena, S. Abiteboul, and A. Marian, "Detecting changes in XML

documents", in Proceedings of the 18th International Conference on Data Engineering, 2002. pp. 41-52.

[2] Y. Wang, D.J. DeWitt, and J.-Y. Cai, "X-Diff: an effective change detection algorithm for XML documents", in Proceedings of the 19th International Conference on Data Engineering, 2003. pp. 519-530.

[3] R. Al-Ekram, A. Adma, and O. Baysal, "diffX: an algorithm to detect changes in multi-version XML documents", in Proceedings of the 2005 Conference of the Centre for Advanced Studies on Collaborative Research, 2005. pp. 1-11.

[4] E. Leonardi and S.S. Bhowmick, "Detecting changes on unordered XML documents using relational databases: a schema-conscious approach", in Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005. pp. 509-516.

[5] W3C. "XML Schema". [Available from: http://www.w3.org/standards/xml/schema].

[6] E. Leonardi, T.T. Hoai, S.S. Bhowmick, and S. Madria, "DTD-Diff: a change detection algorithm for DTDs", Data and Knowledge Engineering, vol. 61, no. 2, pp. 384-402, 2007.

[7] E. Leonardi and S.S. Bhowmick, "XANADUE: a system for detecting changes to XML data in tree-unaware relational databases", in Proceedings of the 2007 ACM SIGMOD international conference on Management of data. 2007. pp. 1137-1140.

[8] "DeltaXML". [Available from: www.deltaxml.com]. [9] E. Leonardi, S.S. Bhowmick, and S. Madria, "Xandy: detecting changes

on large unordered XML documents using relational databases", in Database Systems for Advanced Applications. 2005. pp. 711-723.

[10] E. Leonardi and S.S. Bhowmick, "Xandy: a scalable change detection technique for ordered XML documents using relational databases", Data and Knowledge Engineering, vol. 59, no. 2, pp. 476 - 507, 2006.

[11] E. Leonardi and S.S. Bhowmick, "Oxone: a scalable solution for detecting superior quality deltas on ordered large XML documents", in Proceedings of the 25th International Conference on Conceptual Modeling, 2006. pp. 196-211.

[12] "Comparing XML Schemas with DiffDog". [Available from: http://www.altova.com/technote20.html].

[13] G. Guerrini, M. Mesiti, and D. Rossi, "Impact of XML schema evolution on valid documents", in Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, 2005. pp. 39-44.

[14] I. Mlynkova, "Equivalence of XSD constructs and its exploitation in similarity evaluation", in Proceedings of the OTM 2008 Confederated International Conferences Part II, 2008. pp. 1253-1270.

[15] M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura, "XRel: a path-based approach to storage and retrieval of XML documents using relational databases", ACM Transaction on Internet Technology, vol. 1, no. 1, pp. 110-141, 2001.

[16] G. Cobena, T. Abdessalem, and Y. Hinnach, "A comparative study for XML change detection", TR. 2002.

[17] "XML Schema Object Model". [Available from: http://xsom.java.net/].

0

0.25

0.5

0.75

1

1.25

1.5

C1 C2 C3 C4 C5

Ratio

XML�Schema�cases

XS�DiffX�DiffDeltaXML

982