structured content-based query answers for improving information quality

24
Structured content-based query answers for improving information quality Loan T. H. Vo & Jinli Cao & Wenny Rahayu Received: 16 October 2013 /Revised: 23 January 2014 /Accepted: 28 February 2014 # Springer Science+Business Media New York 2014 Abstract Extensible markup language (XML) has been widely adopted as a standard to exchange and integrate data over multiple sources. This allows users to explore large datasets through a declarative query interface, such as XQuery and XPath. However, the results of queries posted to such heterogeneous data sources are often inconsistent due to the anomalies arising from structural and semantic inconsistencies. This significantly affects the ability of the system to provide accurate query answers. Most of the prior work on finding consistent query answers (CQAs) lacks the full extensibility to find the CQAs relating to the requirements of data constraints holding conditionally on XML data with inconsistent structures. This paper proposes an approach, called SC2QA, which utilizes XML conditional functional dependency (XCSD) to compute consistent answers for queries posted to arbitrary XML data to improve information quality. An XCSD is a structured and content-based functional dependency holding conditionally on certain objects with diverse structures. The query answer is calculated by qualifying queries with appropriate information derived from the interaction between the query and the XCSDs. Experiments have been conducted on synthetic datasets to demonstrate the effectiveness of SC2QA. Keywords Data quality . Inconsistent data . Consistent query answers . Data repairs . Constraints 1 Introduction The problem of data consistency management in inconsistent data has been widely studied in the database community. Consistent data is formally obtained following two approaches, data repair and consistent query answers [3]. Data repair aims to find the consistent parts in an inconsistent data source which minimally differ from the original one [3, 24, 36]. The inconsistent source is often first transformed, by means of deletions or additions, into a World Wide Web DOI 10.1007/s11280-014-0287-z L. T. H. Vo(*) : J. Cao : W. Rahayu Department of Computer Science Engineering, La Trobe University, Melbourne, Australia e-mail: [email protected] J. Cao e-mail: [email protected] W. Rahayu e-mail: [email protected]

Upload: wenny

Post on 20-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured content-based query answers for improving information quality

Structured content-based query answers for improvinginformation quality

Loan T. H. Vo & Jinli Cao & Wenny Rahayu

Received: 16 October 2013 /Revised: 23 January 2014 /Accepted: 28 February 2014# Springer Science+Business Media New York 2014

Abstract Extensible markup language (XML) has been widely adopted as a standard toexchange and integrate data over multiple sources. This allows users to explore large datasetsthrough a declarative query interface, such as XQuery and XPath. However, the results ofqueries posted to such heterogeneous data sources are often inconsistent due to the anomaliesarising from structural and semantic inconsistencies. This significantly affects the ability of thesystem to provide accurate query answers. Most of the prior work on finding consistent queryanswers (CQAs) lacks the full extensibility to find the CQAs relating to the requirements ofdata constraints holding conditionally on XML data with inconsistent structures. This paperproposes an approach, called SC2QA, which utilizes XML conditional functional dependency(XCSD) to compute consistent answers for queries posted to arbitrary XML data to improveinformation quality. An XCSD is a structured and content-based functional dependencyholding conditionally on certain objects with diverse structures. The query answer is calculatedby qualifying queries with appropriate information derived from the interaction between thequery and the XCSDs. Experiments have been conducted on synthetic datasets to demonstratethe effectiveness of SC2QA.

Keywords Dataquality. Inconsistent data .Consistent queryanswers .Data repairs .Constraints

1 Introduction

The problem of data consistency management in inconsistent data has been widely studied inthe database community. Consistent data is formally obtained following two approaches, datarepair and consistent query answers [3]. Data repair aims to find the consistent parts in aninconsistent data source which minimally differ from the original one [3, 24, 36]. Theinconsistent source is often first transformed, by means of deletions or additions, into a

World Wide WebDOI 10.1007/s11280-014-0287-z

L. T. H. Vo (*) : J. Cao :W. RahayuDepartment of Computer Science Engineering, La Trobe University, Melbourne, Australiae-mail: [email protected]

J. Caoe-mail: [email protected]

W. Rahayue-mail: [email protected]

Page 2: Structured content-based query answers for improving information quality

consistent one which is then used for calculating query answers [10]. A consistent queryanswer (CQA) is defined as the common part of answers to the query on all possible repairs ofthe source [3]. Nevertheless, repairing data might also result in side-effects, for example, itmight produce incorrect answers to queries and introduce new inconsistencies. Finding allpossible repairs for inconsistent data to compute a consistent query answer is impossible andimpractical since an infinite number of repairs might exist. Restoring consistency in inconsis-tent data might also be a computationally complex and non-deterministic process. Moreover,one of the main goals of a database system is to compute answers to queries [24]. This meansthat finding consistent query answers is more important than repairing data. Hence, it ispreferable to leave the data inconsistencies to avoid losing information due to the data repairand instead, manage the potential inconsistencies in answers to queries posted to that source,that is, finding the parts of data which are consistent in query answers.

As XML data is often inconsistent with respect to a set of data constraints, data constraintsare often taken into account during the process of calculating query answers [20, 19, 27, 37].The work in [32] studies the problem of computing query answers from inconsistent data withrespect to a given DTD. Other work [19, 35] focuses on finding CQAs from inconsistent datawith respect to a set of XML functional dependencies (XFDs). Flesca et al. [19] introduce anapproach to compute repairs based on replacing the node associated and introducing functions,indicating the reliability of node information. Consistent query answers are then computed byremoving unreliable nodes which are marked by repairs. The work in [35] introduces theconcept of a minimal labeled XML document which is a representation of all possible repairswhich can be used to find consistent query answers. However, such existing work generallylacks the full extensibility of cases to find CQAs from inconsistent data with respect to dataconstraints holding conditionally on XML data with diverse structures. Such limitations arecaused for two main reasons. First, XFDs are defined to represent data constraints globallyenforced to the entire document [2, 38], whereas XML data are often obtained by integratingdata from different sources constrained by local data rules. Thus, they are unable, in somecases, to validate conditional semantics locally expressed in some fragments within an XMLdocument. Second, the existing XFD notions are incapable of validating data consistencies insources with diverse structures. This is because checking for data consistency against an XFDrequires objects to have perfectly identical structures [38], whereas XML data is organizedhierarchically, allowing a certain degree of freedom in the structural definition. Two structuresdescribing the same object are not completely equal [33, 42, 43]. In such cases, utilizing XFDspecifications cannot validate data consistency for calculating consistent query answers. Thispaper addresses the issue of computing consistent answers for queries posted to an inconsistentXML source with respect to a set of XCSDs. XCSDs are data constraints in which functionaldependencies are incorporated with conditions to specify the scope of constraints and with asimilarity threshold to specify similar objects on which the XCSDs holds. In other words,XCSDs are dependencies representing relationships between groups of similar objects underparticular conditions. Thus, XCSDs allow validating data inconsistency caused by semanticand structural inconsistencies. Semantic inconsistencies occur when business rules on the samedata vary across different fragments [36]. Structural inconsistencies arise when the same realworld concept is expressed in different ways, with different choices of elements and structures,that is, the same data is organized differently [33, 43].

Let us use examples to illustrate the influence of inconsistency caused by structural andsemantic inconsistency in XML data to the answers of queries posted to that data. Ourdiscussions are based on a simplified Flight Bookings data tree T, as shown in Figure 1. EachBooking contains information on Carrier, Departure, Arrival, Transit, Fare and Tax. The valuesof elements are recorded under the element names. The data tree T is inconsistent with respect

World Wide Web

Page 3: Structured content-based query answers for improving information quality

to three XCSDs, as shown in Figure 2. We suppose that all XCSDs have a similarity thresholdof 1, which means that representations of the same object must have the same structure.According to constraint 1, any Booking having the Carrier "Tiger Airways" has the Taxidentified by the Fare; however, Booking(2, 1) and Booking(12, 1) contain the same Fare of"600" but the values of Tax are different, which are 150 and 0, respectively. These areinconsistent with respect to constraint 1. Booking(2, 1) and Booking(12,1) are also inconsis-tent in their structures with respect to constraint 1. While Departure(4, 2) and Arrival(5, 2) aredirect children of Booking(2, 1), Departure(15, 3) and Arrival(16, 3) are the grandchildren ofBooking(12, 1). Considering an XPath query Q: /Bookings/Booking posted to T, for aconsistent query answer as in definition [3], the information of Departure, Arrival and Taxof Booking (2, 1) and Booking (12, 1) are excluded.

The same situation occurs to Booking(22, 1) with respect to constraint 2. That is, for anyBooking having the Carrier "Tiger Airways" and Departure of "SIN" only arrives at "MEL"';however, Booking(22, 1) contains Departure of "SIN" but Arrival of "SYD". Considering thequery Q:/Bookings/Booking posted to T, for a consistent query answer, Booking(22, 1) isexcluded. For constraint 3, if there exists a Booking having the Carrier "Air Asia", Departureof "KUL" and Arrival of "SYD", then there must exist a Transit node with a value of "MEL".

(1,0)Bookings

(2,1)Booking

(3, 2)Carrier"Tiger

Airways"

(7, 2)Tax

"150"

(6, 2)Fare

"600"

(5, 2)Arrival"MEL"

(4, 2)Departure

"SIN"

(12,1)Booking

(18, 2)Tax"0"

(17, 2)Fare

"600"

(14, 2)Trip

(16, 3)Arrival"MEL"

(15, 3)Departure

"SIN"

(13, 2)Carrier"Tiger

Airways"

(32,1)Booking

(33, 2)Carrier

"Air Asia"

(37, 2)Tax

"180"

(36, 2)Fare

"700"

(35, 2)Arrival"SYD"

(34, 2)Departure

"KUL"

(22,1)Booking

(23, 2)Carrier"Tiger

Airways"

(28, 2)Tax

"180"

(26, 2)Transit"MEL"

(25, 2)Arrival"SYD"

(24, 2)Departure

"SIN"

(27, 2)Fare

"600"

Figure 1 An inconsistent Flight Bookings data tree with respect to a set of XCSDs

Figure 2 XCSDs on the Flight Bookings data tree

World Wide Web

Page 4: Structured content-based query answers for improving information quality

Booking(32, 1) is missing the information of a Transit node which results in inaccurateanswers to queries relating to Booking(32, 1).

In this paper, we propose an approach called SC2QA to compute consistent answers toqueries posted to arbitrary XML data with respect to a set of XCSDs. Our approach is based onthe semantics of XCSDs to compute the query answers by qualifying the query with appro-priate information derived from the interaction between the query and the XCSDs. XCSDs arenot constraints on databases; they are constraints used to compute answers to queries which arespecified locally with the query at hand. SC2QA is flexible for users to specify a set of XCSDsat the query time. The semantics of XCSDs are integrated into the query planner to computethe query answer. The conditions in XCSDs are used to specify candidate objects qualified tothe query. The similarity threshold in XCSDs is used to indicate how similar objects can beconsidered to be qualified for queries, which allows the retrieval of information from objectswith diverse structures. The inconsistencies of the involved objects are repaired locally,following the semantics of each related XCSD, to obtain consistent information. A queryanswer, called a customized consistent query answer (CCQA), is calculated from these datawhich are considered to be consistent with respect to certain preferred XCSDs. We runexperiments on synthetic data to verify the effectiveness of SC2QA. We prove that thealgorithm is correct in the sense that the result retrieves consistent information.

The main contributions of this paper are as follows: (i) to present a precise definition of acustomized consistent answer, (ii) to propose a novel approach to compute the customizedconsistent query answers, (iii) to analyse the computational complexity of our approach, and(iv) to demonstrate the experiment evaluations. The rest of paper is organized into sevensections. Section 2 presents existing work related to our identified problem. Section 3 presentsthe preliminaries. Section 4 describes our proposed SC2QA to compute the customizedconsistent query answer. The complexity analysis of SC2QA is presented in section 5.Section 6 demonstrates the experiment evaluations. Section 7 concludes the paper.

2 Related work

In this section, we review existing work commonly used to manage consistent data ininconsistent sources. In particular, we consider data repair and consistent query answerapproaches in relational databases and XML data.

Relational database: the management of consistent data in inconsistent data has beenextensively studied in relational databases [1, 7, 8, 12]. Consistent data is formallyobtained following two directions, namely data repair and consistent query answers. Datarepair aims to find consistent parts from inconsistent data which differs from a giveninconsistent database in a minimal way [3]. A database D is consistent with respect to aset of integrity constraints ICs if D satisfies ICs. Otherwise, D is inconsistent with respectto ICs. R is a repair of D if R satisfies IC andΔ(D, R)= (D\R) ∪(R \ D) minimal under setinclusion. Computing data consistency with respect to ICs can be achieved only throughtuple deletions. That is, R is obtained fromD by eliminating tuples. R is considered to be aminimal repair of D if R satisfies ICs and is maximally contained in D, i.e. there does notexist R′ such that R′ satisfies ICs and R⊂R′ ⊂ D. For example, given inconsistent data D={(a, b, c),(a, c, d), (a, c, e), (b, g, h)}, D has two repairs R1= {(a, b, c), (b, g, h)} and R2={(a, c, d), (a, c, e), (b, g, h)}. Δ(D, R1) and Δ(D, R2) are minimal under set inclusion.

Since a large number of repairs might exist for an inconsistent database, most existingwork has only focused on computational methodologies to retrieve consistent answers for

World Wide Web

Page 5: Structured content-based query answers for improving information quality

a query posed on an inconsistent database, regardless of its inconsistency [3, 13]. Aconsistent query answer is defined as the common part of answers to the query on allpossible repairs of the source. Arenas et al. [3] introduce a query rewriting algorithm tocompute consistent query answers based on query rewriting. The basic idea is to enforceconstraints locally, at the level of data which appear in the query to avoid the explicitcomputation of data repairs. In particular, the original query Q posted to D is rewritteninto a new query Q′ such that the answers to Q′ in D are the consistent answers to Q fromD. Q′ is constructed by adding conditions from ICs to Q to enforce the satisfaction ofconstraints in ICs. However, the work in [3] has a very limited applicability since itapplies first order queries and does not include disjunction or quantification, or binaryuniversal integrity constraints.

Chomicki [13] presents a framework for computing consistent query answers based on agraph-theoretic representation of repairs. It considers relational algebra queries withoutprojection and denial constraints. This work handles union queries which can extractindefinite disjunctive information from an inconsistent database. Arenas et al. [4] applylogic programming based on answer sets to retrieve consistent information from aninconsistent database. This work concentrates mainly on logic programs for binary integrityconstraints. The work in [6] studies the decidability status of consistent query answering bycombining instances, ICs and query as input. The notion of consistent query answers arealso extended to the case of aggregate queries [5, 22]. Arenas et al. [5] investigate theproblem of consistent answers of aggregate queries in the presence of functional depen-dencies. The work in [22] provides data violating a set of aggregate constraints. Theseconstraints are defined on numerical attributes (such as Sale Price, Tax, etc.) and are notintrinsically involved in other forms of constraints. Most above cited approaches supposethat tuple insertions and deletions are the basic primitives for repairing inconsistent data.

Recently, database repairs and consistent query answering have been considered in thecontext of conditional functional dependencies (CFDs) [14, 17, 18, 26, 25]. CFD ap-proaches aim at capturing the consistency of data by enforcing bindings of semanticallyrelated values. Fan et al. [17] develop techniques to detect violations of CFDs in SQL by asingle query. Cong et al. [14] investigate repairing inconsistent databases based on CFDsdemonstrating that CFDs are more effective than traditional FDs in identifying andrepairing real-life inconsistent data. The work in [18] studies the interaction between recordmatching and data repairing. A uniform framework which unifies repairing and matchingoperations has been proposed to repair a database based on CFDs and master data. Thesealgorithms find deterministic fixes and reliable fixes based on confidence and entropyanalysis, respectively. The work in [26] focuses on attribute-based repairs which areobtained by modifying attribute values in a minimal way. Query answering is cleaned byidentifying impossible portions of a query answer. Kolahi et al. [25] propose a constant-factor approximation algorithm which can be used to find repairs with respect to condi-tional functional dependencies. However, due to the different structure of data and thedifferent nature of constraints, existing approaches to traditional structured querying inrelational databases cannot easily be applied to XML data [11]. This is due to two mainreasons. Firstly, different notions of equality are used for constraints. Whereas relationalequality simply is the equality of values, the equality of two objects in XML has to becompared according to both structure and data [47]. Secondly, CFD algorithms cannotscale well when the XML data structure is complex. This is because applying thesealgorithms to XML data requires an XML document be transformed into a single relationaltable. When the structure of schema is complex, the number of attributes in the transformedrelation is large. The number of tuples also increases multiplicatively when the XML

World Wide Web

Page 6: Structured content-based query answers for improving information quality

document contains data with complex data types (e.g. maxOccurs in XML Schema). Forexample, if each Booking contains two Trips (refer to Figure 2), then the number of tuplesin the transformed relation would double. Therefore, generalizing relational approaches towork on XML data is nontrivial.XML data: the notions of repair and consistent query answers have been generalized tothe context of XML data. The work in [21, 19, 34] finds consistent data with respect to aset of XML functional dependencies. The data repair in [19] is found based on replacingnode values and introducing functions, indicating the reliability of node information. Tanet al. [34] study the problem of data repair by making the smallest modifications in termsof repair cost. Flesca et al. [20] study the existence of repairs with respect to a set ofintegrity constraints and a DTD. The existence of repairs using minimal sets of updateoperations is investigated. The work in [29] considers the problem of data repair withrespect to a set of functional dependencies in the merged format of XML data. This workextends the XFDs to be satisfied by comparing sub-trees in a specified context of the data.Yakout et al. [44] present an approach using machine learning as a guide for repairingdata. However, such approaches may correct data improperly, and worse, might result inother inconsistencies when repairing the data. Moreover, the concept of data repair isoften used as an auxiliary notion to define consistent query answers and existingapproaches do not give any algorithms to compute data consistency.

The problem of finding consistent query answers can be considered as a principledway to manage data inconsistency [3, 32, 37, 46]. The work in [32] studies the problem ofcomputing query answers with respect to a given DTD. This work presents a validity-sensitive method of querying XML data, which extracts more information from invaliddata sources than the standard query evaluation. Tan et al. [37] propose an approach tocompute consistent query answers from virtually integrated data with respect to a set ofdata constraints. However, they do not take into account constraints which hold condi-tionally on similar objects, as in our work. Query rewriting techniques have been widelyused as powerful methods to calculate query answers [16, 15, 28, 46]. The work in [16,15, 28] introduces techniques for query rewriting in the representation of constraints. Yuet al. [46] propose a technique incorporating target constraints into query rewriting tocalculate query answers through target schemas. However, we found that such work isinapplicable for the scenarios which we consider. To the best of our knowledge, none ofthe existing work on finding query answers properly combines both structural and datasemantics to calculate query answers, as in our approach.

3 Preliminaries

In this section, we present some preliminary concepts, including: (i) XPath notations, (ii)definition of data tree, (iii) definition of node-value equality and definition of XML conditionalstructural functional dependency, (iv) a theorem about the superior of XCSDs to XCFDs whichis necessary to indicate that our approach SC2QA only needs to take into account theconsistency of data with respect to XCSDs, and (v) a revision of the structural similaritymethods which we used in our approach to measure the similarity of sub-trees.

3.1 XPath

We use XPath expression [41] to form a relative path; “.” (self): select the context node. “.//”:select the descendants of the context node, "[]": means qualifier and "*": means wildcards. For

World Wide Web

Page 7: Structured content-based query answers for improving information quality

example, .//Carrier: select Carrier descendants of the context node Booking; .//Trip/Departure:select all Departure elements which are children of Trip. An XPath is simple if it is free of ".//"and "*". Any XPath Q can be replaced by a set of simple XPath {Q1, Q2,…,Qn}.

Definition 1 (XML data tree)An XML data tree is defined as T= (V, E, F, root), where

& V is a finite set of nodes in T, each node v ∈V consists of a label l and an id that uniquelyidentify v in T. The id assigned to each node in the XML data tree, as shown in Figure 1, isin a pre-order traversal. Each id is a pair (order, depth), where order is an increasing integer(e.g. 1, 2, 3…) used as a key to identify a node in the tree; depth label is the number ofedges traversing from the root to that node in the tree, e.g. 1 assigning for /Bookings/Booking. The depth of the root is 0.

& E ⊆ V × V is the set of edges.& F is a set of value assignments, each f (v)= s ∈F is to assign a string s to each node v ∈V. If v

is a simple node or an attribute node, then s is the content of node v, otherwise if v hasmultiple descendant nodes, then s is a concatenation of all descendants’ content.

& root is a distinguished node called the root of the data tree.

An XML data tree defined as above possesses the following properties:For any nodes vi, vj ∈ V:

& If there exists an edge(vi, vj) ∈E, then vi is the parent node of vj, denoted as parent(vj), andvj is a child node of vi, denoted as child(vi).

& If there exists a set of nodes {vk1,..,vkn} such that vi = parent(vk1),..,vkn = parent(vj), then viis called an ancestor of vj, denoted as ancestor(vj) and vj is called a descendant of vi,denoted as descendant(vi).

& If vi and vj have the same parent, then vi and vj are called sibling nodes.& Given a path p= {v1v2…vn}, a path expression is denoted as path(p)= /l1/../ln, where lk is

the label of node vk for all k ∈[1,.., n].& Let v= (l, id, c) be a node of data tree T, where c is the content of v. If there exists a path p′

extending a path p by adding content c into the path expression of p such that p′= /li/../lj /c,then p′ is called a text path.

& {v[X]} is a set of nodes under the sub-tree rooted at v. If {v[X]} contains only one node, itis simply written as v[X].

In this paper, we investigate customized consistent query answers with respect to a set ofXCSDs. An XCSD can be used to express either XCFDs or XFDs. Therefore, we first reviewthe definition XCSDs [40] and point out the relationships amongst XFDs, XCFDs and XCSDs.

Definition 2 (XML Conditional Structural Functional Dependency)Given an XML data tree T= (V, E, F, root), an XML Conditional Structural Dependency

(XCSD) holding on T is defined as:

8 ¼ Pl : α½ � C½ �; X→Yð Þ;where

& α is a similarity threshold indicating that each path pi in 8 can be replaced by a similarpath pj, with the similarity between pi and pj being greater than or equal to α, α ∈(0, 1].

World Wide Web

Page 8: Structured content-based query answers for improving information quality

The greater value of α, the more similarity between the replaced path pj and the originalpath pi in 8 is required. The default value of α is 1, implying that the replaced paths haveto be exactly equivalent to the original path in 8 .

& C is a condition which is restrictive for the functional dependency Pl: X→ Y holding on asubset of T. The condition C has the form: C ¼ ex1θex2θ…θexn; where exi is a Booleanexpression associated to particular elements. “θ” is a logical operator either AND (^) or OR(∨). C is optional; if C is empty then 8 holds for the whole document.

& X and Yare groups of nodes under sub-trees rooted at node-label l and nodes of each grouphave similar root paths. X and Y are exclusive.

& X→ Y indicates a relationship between nodes in X and Y, such that any two sub-treessharing the same values for X also share the same values for Y, that is, the values of nodesin X uniquely identify the value of node in Y.

Theorem 1 An XCSD is superior to XML functional dependency (XFD) [45] and XMLconditional functional dependency (XCFD) [39].

Proof For an XCSD 8 ¼ Pl : α½ � C½ �; X→Yð Þ , suppose that there exist {X1, X2,…Xn} similarto X, and {Y1, Y2,…Ym} is similar to Y with respect to the value of the similarity threshold α.Then 8 can be expressed as a set of XCFDs Pl : C½ �; X i→Y j

� �, where i=1..n and j=1..m.An

XCFD can also be expressed as an XCSD with a similarity threshold of 1. Similarly, 8 can beexpressed as a set of XFDs in the form of Pl: (Xg → Yh), where g=1..n and h=1..m and anXFD Pl: X→ Y is a special case of XCSD with a similarity threshold of 1 and the condition Cis empty. □

XCSDs can be used to express the semantics of both XCFDs and XFDs. Therefore, in thispaper we only discuss finding CQAs posted to an inconsistent XML data with respect to a setof XCSDs. An XCSD might hold on an object represented by variable structures. In suchcases, checking for similar structures is necessary to validate the conformation of the object tothat XCSD. The similarity between sub-trees, is in general, independent of the particulartechnique adopted to measure the semantics between two XML elements. Any techniquewhose aim is to assess whether two elements refer to the same object can be used. In the nextsection, we review the similarity method [40] which we used in our approach to measure thestructural similarity between two sub-trees.

3.2 Structural similarity measurement

Our method follows the idea of structure-only XML similarity [9, 23, 31]. That is, thesimilarity between sub-trees is evaluated, based on their structural properties, and data valuesare disregarded. We consider that each sub-tree is a set of paths, and each path starts from theroot node and ends at the leaf nodes of the sub-tree. Subsequently, the similarity between twosub-trees is evaluated, based on the similarity of two corresponding sets of paths. The moresimilar paths the two sub-trees have, the more similar the two sub-trees are.

3.2.1 Sub-tree similarity

Given two sub-trees R and R′ rooted at nodes having the same node-label l in T. Rand R′ contain m and n paths respectively: R =(p1,..,pm) and R′ = (q1,..,qn), where

World Wide Web

Page 9: Structured content-based query answers for improving information quality

each path starts from the root node of the sub-tree. The similarity between two sub-trees R and R′, denoted by

dT R;R0ð Þ; is computed as : dT R;R0ð Þ ¼

Xi

wi:w0i

ffiffiffiffiffiffiffiffiffiffiffiffiffiXi

w2i

r:

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiXi

w0i2

s ;

where wi and wi′ are the path similarities’ weights of pi and qi in the correspondingsub-trees R and R′, and the value of dT(R, R′) [0, 1] represents that the similarity oftwo sub-trees changes from dissimilar to similar status. By defining dP(pi, qj) as thepath similarity of two paths pi and qj, the weight wi of path pi in R to R′ is calculatedas the maximum of all dP(pi ,qj), where 1≤j ≤n. The term of path similarity dP(pi ,qj)is described in the next subsection.

List 1 represents the SubTree_Similarity algorithm to calculate the similarity between twosub-trees. The algorithm first calculates the weight wi of each path pi in R to R′ for all 1≤i ≤m(line 2–3). Then the weight w′j of each path qj in R′ to R is calculated for all 1≤j ≤n (line 5–6).This means two sets of weights (w1,..,wm ) and (w1,..,wn) are computed. If the cardinalities ofthe two sets are not equal, then the weights of 0 are added to the smaller set to ensure the twosets have the same cardinality (7–11). The similarity of R and R′ is calculated based on thesetwo sets of weights using a Cosine Similarity formula (13–15). In the following subsection, wedescribe how to measure the similarity between paths.

3.2.2 Path similarity

Path Similarity is used to measure the similarity of two paths, where each path is considered aset of nodes. Consequently, the similarity of two paths is evaluated based on the informationfrom two sets of nodes, which includes common-nodes, gap and length difference. Thecommon-nodes refer to a set of nodes shared by two paths. The number of common-nodesindicates the level of relevance between two paths. The gap denotes that pairs of adjacentnodes in one path appear in the other path in a relative order but there exist a number ofintermediate nodes between two nodes of each pair. The numbers of gaps and the lengths ofgaps have a significant impact on the similarity between two paths. The longer gap length orthe larger number of gaps will result in less similarity between two paths. Finally, the lengthdifference indicates the difference in the number of nodes in two paths, which in turn, indicatesthe level of dissimilarity between two paths.

We also take into account the node’s positions in measuring the similarity between paths.Nodes located at different positions in a path have different influence-scopes to that path. Wesuppose that a node in a higher level is more important in terms of semantic meaning andhence, it is assigned more weight than a node in a lower level. The weight of a node v havingthe depth of d is calculated as μ(v)= (λ)d, where λ is a coefficient factor and 0<λ<=1. Thevalue of λ depends on the length of paths.

List 2 represents the Path_Similarity algorithm to calculate the similarity of two paths p=(v1,..,vm) and q= (w1,..,wn), where v1 and w1 have the same node-label l, and m and n are thenumbers of nodes in p and q, respectively. The similarity of two paths p and q, dP(p, q), iscalculated from three metrics, common-node weight, average-gap weight and length differencereflecting the above factors common-nodes, gap and length difference (line 1). The common-node weight, fc, is calculated as the weight of nodes having the same node-labels from twopaths. The set of nodes having the same node-label between p and q, called common node-

World Wide Web

Page 10: Structured content-based query answers for improving information quality

labels, is the intersection of two node-label sets of p and q (line 3). Assuming that there exist klabels in common, the common-node weight can be calculated as:

f c p; qð Þ ¼

Xi¼1

k

μ við Þ:μ wið ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXi¼1

k

μ við Þ2s

:

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXi¼1

k

μ wið Þ2s ;

where μ(vi) and μ(wi) are the weights of two nodes vi and wi in p and q, respectively. vi and wi

have the same node-label. The coefficient factor λ= min(|p|,|q|)/max(|p|,|q|) (line 3).The average-gap weight, fa, is calculated as the average weight of gaps in two paths. The

calculation of fa comprises three steps. First, the algorithm finds the longest gap and thenumber of gaps between two paths (line 7–9). Second, the gap’s weights from one path against

List 1 The algorithm for SubTree_Similarity

World Wide Web

Page 11: Structured content-based query answers for improving information quality

List 2 The algorithm for Path_Similarity

World Wide Web

Page 12: Structured content-based query answers for improving information quality

the other path and vice versa are calculated. Each gap’s weight is calculated based on the totalweights of nodes and the number of nodes in the longest gap in that path. The gap’s weight ofp against q is calculated by:

gw p; qð Þ ¼

Xi¼1

g

μ við Þ

gj j ;

where g is the length of the longest gap of p and q, and the coefficient factor λ=|g|/|q|. Thesame process is applied to calculate the gap’s weight of q against p (line 10). Finally, theaverage of the gaps is calculated based on two calculated gap’s weights and the numbers ofgaps in two paths (line 11). The length difference, fl, is the difference in the number of nodebetween two paths (line 21)

For example, given two paths p= "Booking/Departure", q= "Booking/Trip/Departure", wecalculate the similarity score of p and q as follows:

& Calculating the common node weight

lp= {Booking, Departure}lq= {Booking, Trip, Departure}comLab(p, q) = lp∩lq= {Booking, Departure}

The depths of "Booking" and "Departure" in p and q are {1, 2} and {1, 3}The weights in p are{2/3, (2/3)2} and in q are {2/3, (2/3)3}.

f c p; qð Þ ¼ 2=3:2=3þ 2=3ð Þ2: 2=3ð Þ3� �

= 2=3ð Þ2 þ 2=3ð Þ4� �1=2

: 2=3ð Þ2 þ 2=3ð Þ6� �1=2

� �¼ 0:99

& Calculating the average gap weightCalculating gw(p, q):

noG1 = 1; gap1max= "Trip";| gap1max | =1;

Assuming that the depth("Trip") is 2

gw(p, q)= 0.11Calculating gw(q, p)

noG2 =2; gap2max="Booking/Departure";| gap2max | =2;

Assume that depth("Booking")=1 and depth("Departure")= 2.

gw(q, p)= 1The average gap weight fa(q, p) = (1/9 * 1+ 1* 2) /3 = 0.7

& Calculating the length difference: fl(p, q)= 1/3 =0.33& The similarity score of p and q: dP(p, q)= 0.99-(0.7+0.33)/3= 0.64

If the similarity score is larger than a given similarity threshold, then we conclude that twopaths are similar; otherwise, the two paths are not similar. The similarity score of 1 indicatesthat the two paths are the same.

World Wide Web

Page 13: Structured content-based query answers for improving information quality

In the next section, we propose a formal approach to calculate the consistent query answer.We first present the concepts used in our approach, including a definition of consistent data anda definition of data repair. We also mention repairing principles applied to repair inconsistentdata which includes repair cost and data repair values. Then, we present the details of ourapproach.

4 SC2QA algorithm: structure content-aware approach for customized consistent queryanswers

In this paper, we only focus on calculating consistent query answers for queries posted to anarbitrary XML data with respect to a set of XCSDs. Consistent data is defined as follows:

Definition 3 (Consistent data)Given a data tree T= (V, E, F, root) and a set of XCSDs ∑, T is consistent with ∑ denoted

as T|= ∑, if T satisfies every predefined XCSD in ∑, otherwise T is inconsistent denoted asT |≠ ∑.

A data tree T is said to satisfy an XCSD 8 ¼ Pl : α½ � C½ �; X→Yð Þ denoted as T|= 8 if anytwo sub-trees R and R′ rooted at vi and vj in T having dt(R, R′) ≥ α and if {vi[X]}=v {vj[X]} then{vi[Y]}=v {vj[Y]} under the condition C , where vi and vj have the same root node-label l.

The concept of data repair is used as an auxiliary to describe the definition of a customizedconsistent query answer in inconsistent data. We define the notion of data repair and the detailof data value repair principles and the repair cost model in the next section.

4.1 Data repair

Repair R is found based on the semantics of candidate XCSDs in ∑which relates to queryQ tomodify the inconsistent data in the original data tree T such that R is consistent with respect to∑ and R isminimal different with original data T. Minimal here means repair Rmust be the onethat takes as few repair operations as possible to preserve the information from the originaldata.

Definition 4 (Data repair)A repair of Twith respect to ∑ is a data tree node R such that: (i) R conforms to ∑; and (ii)

there does not exist any other repair R′ of T such that R′ conforms to ∑, cost(T, R′) < cost(T, R),where cost(T, R) is the repair cost used to transform T to R.

In our work, repair R is found by locally repairing every inconsistent data node. A noderepair is defined as follows:

Definition 5 (Node repair)A repair of a node v with respect to 8 i is a node v′ such that: (i) v′ conforms to 8 i; (ii) there

does not exist any other repair v″ of v such that v″ conforms to 8 i and (iii) cost(v, v″) < cost(v, v′),where cost(v, v′) is the repair cost used to transform v to v′.

Data value repair principles: The computation of the repair data is based on the semanticsof XCSDs to modify the values of nodes or add missing data. We do not invent newvalues as in [20, 19]. Suppose that a node v in T violates an XCSD8 ¼ Pl : α½ � C½ �; X→Yð Þ , the value of v is repaired based on a set of values occurring

World Wide Web

Page 14: Structured content-based query answers for improving information quality

in the data tree T or it is a value deduced based on the semantics of XCSDs. The details ofdata repair computation are as follows:

i) Values of node modification:

& If v relates to a constant expression in 8 , the value of v is updated by the value of thecorresponding constant such that v satisfies 8 .

& If the violation of node v relates to a variable expression in X∪Y, this means v violates8 with another node v′, then the value of the violation node is modified with respectto the value v′. That is, (i) if the value of v is null and v′ is constant, the value of v′ isset to that constant, (ii) if v and v′ are not null and they contain different values, suchviolations have to be resolved by repairing other nodes relating to another expressionsin 8 .

ii) Node insertions: suppose that a sub-tree Ti is inconsistent with respect to an XCSD8 ¼ Pl : α½ � C½ �; X→Yð Þ due to missing a node v, v is inserted into that sub-treebased on the semantics of XCSD such that T is consistent with respect to theconsidered XCSD.

Example 1 Booking(22, 1) (in Figure 1) violates constraint 8 2 = PBooking: [1][./Carrier ="Tiger Airways" ], (./Departure="SIN" →./Arrival= "MEL") (in Figure 2).This is because Booking(22, 1) contains Carrier of "Tiger Airways", Departure of"SIN" but Arrival of "SYD". The Arrival = "SYD" causes a violation to the RHS of8 2 which is a constant expression. Therefore, the value of Arrival should be changed to“SYD” as that in 8 2. Considering Booking(12,1), Booking (2,1) in Figure 1 and 8 1 =PBooking: [1] [./Carrier ="Tiger Airways" ], (./Fare →./Tax) in Figure 2. Booking(12, 1)violates constraint 1 with Booking(2, 1). According to constraint 1: any Bookinghaving the Carrier "Tiger Airways" has Tax identified by the Fare, while both Book-ing(12, 1) and Booking (2, 1) have the same Carrier of "Tiger Airway" and the Fare of"600", the Tax "0" and "150". Therefore, we modify the value of Tax in Booking(12, 1)to "150". Booking(32, 1) in Figure 1 violates constraint 3 in Figure 2. According toconstraint 3: for any Booking the Carrier "Air Asia", Departure of "KUL" and Arrivalof "SYD", there must exist a Transit of "MEL". Booking(32, 1) does not satisfyconstraint 3 since the information of Transit is missing. Thus, Transit of "MEL" isinserted into Booking(32, 1).

Repair Cost: there are several different ways to resolve a violation. In our approach, we usea cost model to give priority to the repair which is considered to be qualified to a query. Theresult with a lower transformation cost is considered to be closer to the original data and ispreferred over the ones with higher costs. The cost used to repair an inconsistent node isweighted by the total number of operations applied to correct the node so that it satisfies allrelevant XCSDs. The cost used to repair an inconsistent data tree is weighted by the total costapplied to all violation nodes such that the data tree satisfies all relevant XCSDs.

Observe that the deletions which may lose information, node modification and nodeinsertion in general can preserve more information from the original data. Thus, our approachonly considers modification and insertion operations. We assume that each operation isassigned a weight in a range between 0 and 1. A cost of the modification operation is thecost of updating a node value. The cost of the insertion operation is the cost to insert a singlenode which is a descendant of a considered sub-tree. We prefer node updating to node insertionsince inserting a node is more complicated than updating a node value. In our approach, we

World Wide Web

Page 15: Structured content-based query answers for improving information quality

assume that the cost of node insertion is three times the modification cost. The repair cost of aT to a repair R is defined as

cost T ;Rð Þ ¼X

i¼ 1::n½ �cost vi; v0i

� �;

where vi is original node and vi′ is the transformed node.In addition to choosing a repair with the lowest cost, we also use cost threshold, denoted as

γ, to ignore the repairs which are too different to the original. Users choose their preferredrepair cost threshold when specifying queries. A cost threshold for each considered node isindicated by the percentage of the repair cost threshold with the number of candidate nodesand the number of XCSDs. Such an evaluation mechanism allows retrieving the desiredinformation which is closer to the original data source

γnode ¼γ

Cv : Cϕ;

where γ is the repair cost threshold of the total data, Cv is the number of candidate nodesrelating to the query and C8 is the number of candidate XCSDs. In this paper, we set the valueof γ to 1.

Example 2 Suppose that Booking (42, 1) in Figure 3 is also included in the data tree T inFigure 2. Booking (42, 1) is inconsistent with respect to constraints 1, 2 & 3 in Figure 2. Atleast two alternative ways exist to correct Booking (42, 1) with different results. Assume thatthe total number of candidate nodes is 4, the modification cost is 0.01 and the insertion cost is0.03. The cost repair threshold for each node is γnode= 1/(4 * 3) ≈ 0.08.

c) corrections follows constraints 3

a) Inconsistent data

b) corrections follows constraints 1&2

(42,1)Booking

(43, 2)Carrier"Tiger

Airway"

(47, 2)Tax

"150"

(45, 2)Arrival"MEL"

(44, 2)Departure

"SIN"

(46, 2)Fare

"600"

(42,1)Booking

(43, 2)Carrier"Air

Asia"

(48, 2)Tax

"150"

(46, 2)Transit"MEL"

(45, 2)Arrival"SYD"

(44, 2)Departure

"KUL"

(47, 2)Fare

"600"

(42 ,1)Booking

(43, 2)Carrier

""

(47, 2)Tax

"150"

(45, 2)Arrival""

(44, 2 )Departure

""

(46, 2)Fare

"600"

Figure 3 Repairing inconsistent data

World Wide Web

Page 16: Structured content-based query answers for improving information quality

The first repair v1 is followed by constraints 1&2. According to constraint 1, 'Any Bookinghaving the Carrier "Tiger Airways", the Tax is identified by the Fare'. The Tax and the Fare ofBooking(42, 1) are the same with the Tax and the Fare of Booking(2, 1). Thus, the value of theCarrier of Booking(42, 1) is updated by "Tiger Airways" with repair cost of 0.01. According toconstraint 2 'Any Booking having Carrier of "Tiger Airways" and Departure of "SIN", thevalue of Arrival is "MEL" ', the Carrier of Booking(42, 1) is "Tiger Airways". Hence, the valueof Departure of Booking(42, 1) is updated by "SIN" with repair cost of 0.01 and Arrival ismodified by "MEL" with repair cost of 0.01.

The total cost repair: cost(Booking(42, 1), v1)=(0.01+0.01+0.01)=0.03<0.08=γnodeThe second repair v2 follows constraints 3. According to constraint 3: for 'Any Booking

having Carrier of "Air Asia", Departure of "KUL" and Arrival of "SYD", there must exist aTransit of "MEL"'. Following constraint 3, Carrier of Booking(42, 1) is updated by "Air Asia"with a cost of 0.01, Departure is updated by "KUL" with repair cost of 0.01, Arrival is updatedby "SYD" with repair cost of 0.01, and insert a node: Transit of "MEL" with a cost of 0.03.

The total repair cost cost(Booking (42, 1), v2 )=(0.01+0.01+0.01+0.03)=0.06<0.08=γnode . The repair costs of the two cases are 0.03 and 0.06, respectively which satisfythe repair cost threshold for a node (i.e. 0.08). However, the former repair is preferred over thelatter. Indeed, v1 is closer to original Booking (42, 1) than v2.

Definition 6 (Customized consistent query answer- CCQA)Given a data tree T, a set of XCSDs ∑ over T and a query Q , the customized consistent

query answer of the query Q on Twith respect to ∑, denoted as Qc(T, ∑) = ∪i=1..k Qc(Rvi, ∑),whereQc(Rvi,∑) is a consistent query answer of Q on the sub-tree Rvi, Rvi is a repair of sub-treerooted at vi w.r.t ∑, k is the number of related nodes to Q and vi is a related nodes to the queryQ.

A formal approach to calculate the customized consistent query answer will be presented inthe next section.

4.2 Calculating customized consistent query answers

Given an inconsistent XML data tree T and a set of XCSDs∑, we aim to compute a customizedconsistent answer for queryQ posted to T. Our approach is based on the semantics of XCSDs tofind consistent data, preserving the original data source. The original inconsistent data isevaluated at each constraint. The answer is calculated by qualifying the query with appropriateinformation derived from the interaction between the query and the XCSDs. The result of thequery is a data tree which is constructed by appropriate projections on the data qualified to Q.The similarity parameters in XCSDs may result in a large number of candidate objects qualifiedto the query, causing difficulties for computing CCQAs. Therefore, we restrict all XCSDshaving the same similarity threshold to avoid dealing with a various number of candidateobjects. Our algorithm traverses the data source following a breadth-first manner as in [23].

In particular, SC2QA consists of four processes (List 3). First, the function selDC isperformed to select all candidates XCSDs (8 i,…,8 k) relating to the posted query Q. Thisprocess is based on the comparison between the context paths of XCSDs and paths in Q withrespect to the similarity threshold of XCSDs. Second, the function selCanNode is called toselect all candidate nodes (v1, v2,…, vn) relating to query Q, based on the semantics of thecandidate XCSDs, where vi is an ancestor of target nodes of XCSDs. Third, valXCSD isperformed to validate XCSDs, vi is the domain for validating candidate XCSDs. The process ofvalidating XCSDs is performed locally at each candidate node to retrieve consistent data. Foreach violation node vk, the list of violation XCSDs, repair values and repair costs are calculated

World Wide Web

Page 17: Structured content-based query answers for improving information quality

and stored in vk.list(8 kj, rkj, ckj), and node vk is marked as a violation. Finally, the result data isgenerated in a top-down fashion by combining all consistent data and repaired data. A result isconsidered valid only if the repair cost of inconsistent data is under a certain cost threshold. Thatis, if the repair cost of data is above a given cost threshold, then the corresponding repair is notconsidered to be qualified for query Q and query Q is not actually executed. Thus, for eachviolation node vk, we choose the repair (8 kj, rkj, ckj) with the lowest overall repair cost.Commonly, we choose the node repair cost for vk with the lowest cost.

Example 3 Finding CQAs for a XPath query Q: /Bookings/Booking[Carrier='Tiger Airway'],posted to Bookings data tree in Figure 1 with respect to three data constraints 8 4, 8 5 and 8 6

φ4 ¼ Pl : 0:6½ � :=Carrier ¼00 Tiger Airways00½ � :=Fare→:=Taxð Þφ5 ¼ Pl : 0:6½ � :=Carrier ¼00 Tiger Airways00½ �; :=Departure ¼00 SIN00→:=Arrival ¼00 MEL00ð Þφ6 ¼ Pl : 0:6½ � :=Carrier ¼00 Air Asia00½ �; :=Departure ¼00 KUL00; :=Arrival ¼00 SYD00ð Þ→ :=Transit ¼00 MEL00ð Þ

We follow the process described in section 4.2 to find a CCQA for Q. First, we select allcandidate XCSDs relating to query Q which include 8 4 and 8 5. Second, we select allcandidate nodes relating to query Q which include CN= {Booking(2, 1), Booking(12, 1) ,Booking(22, 1 )}. Third, we validate XCSDs: for each candidate node in CN, we check for thesatisfaction of candidate XCSDs and find consistent data for that node. We find Booking(2, 1)is similar to Booking(12, 1). This is because following the sub-trees similarity algorithmdescribed in List 1 to calculate the similarity between sub-trees T1 rooted at Booking(12,1) andT2 at Booking(2,1), suppose we have dT(T1, T2)= 0.64>0.6. Booking (2, 1) satisfies constraints8 4 and 8 5. Booking (12, 1) violates 8 4 with Booking(2, 1) since the Tax of the two nodesdoes not satisfy the condition that Tax is identified by the Fare. They contain the same Fare of"600" but the values of Tax are different. While the Tax of Booking(2, 1) is a constant of"150", the Tax of Booking(12, 1) is constant of "0". Thus, we replace the inconsistency of theTax by updating the Tax in Booking(12, 1) with a value of "150" in the answer. The Tax(18, 2)node is marked. Booking(22, 1) violates 8 4 and 8 5. According to 8 4, the Fare identifies theTax, the Fare of Booking(22, 1) is the same at that of Booking(2, 1) but the Tax is different.Thus, the Tax of Booking(22, 1) is corrected based on the semantics of 8 4. That is, Tax is

List 3 The SC2QAs algorithm

World Wide Web

Page 18: Structured content-based query answers for improving information quality

"150" and Tax(28, 2) is marked. According to 8 5, if Departure of "SIN", then Arrival must be"MEL" but the Arrival of Booking (22, 1) is "SYD" which is replaced by "MEL" in the answerbased on 8 5. The result of query Q will include Bookings {Booking (2, 1), Booking (12, 1),Booking (22, 1)} with modified values at marked nodes.

List 4 Utility functions of SC2QA

World Wide Web

Page 19: Structured content-based query answers for improving information quality

5 Complexity analysis and completeness

Complexity analysis: The complexity of the SC2QA algorithm mostly depends on the size ofthe XML data source, which is determined by the number of elements, the number of XCSDsand the complexity of query Q on T. The SC2QA algorithm first performs the selDC to find aset of candidate XCSDs related to the query Q. The complexity of selDC depends on thecomplexity of query Q on T and the size of context paths of XCSDs. In the worst case, weassume that all n nodes in T are satisfied by Q. Let |∑| be the total number of XCSDs and m bethe maximum size of the context paths of XCSDs 8 . Thus, the selDC makes nm|∑| randomaccesses to the dataset. Then, the function selCanNode is called to select candidate nodesrelated to the query Q with respect to the conditions of the XCSDs. The selCanNode dependson the number of XCSDs and the size of the XML data source. The worst case occurs when alln nodes in T related to Q and every XCSD in ∑ having a condition, without considering thenumber of expressions in the conditions, the function selCanNode makes n|∑| randomaccesses to the dataset.

Third, the valXCSD is called to validate candidate XCSDs against every candidatenode. In the worse case, we assume that all |∑| XCSDs are in the set of candidateXCSDs, where the set of candidate nodes includes all n nodes in T and each nodeviolates all |∑| candidate XCSDs. The valXCSD makes n|∑| random accesses to thedataset. Finally, the SC2QA traverses the data tree T in a top-down manner to obtain thequery result. Every node in T is visited once which means this step needs n randomaccesses to the dataset. In summary, the SC2QA algorithm has time complexity ofO(n(m|∑|+ 2|∑| + 1)). SC2QA needs to maintain a copy of data source T at a time.Hence, the space complexity is bounded by O(n). However, in practice, the number ofrelated nodes can be significantly smaller than n and the number of XCSDs relating to Qis also smaller than the total number of XCSDs |∑|. Therefore, the time complexity canbe reduced significantly.

The following theorem states that the SC2QA algorithm must be terminated and returnscustomized consistent query answers.

Theorem 2 (Termination) Let Q be a query on a data tree T and ∑ be a set of XCSDs. TheSC2QA always terminates and generates a consistent query answer Qc(T, ∑).

Proof Although in each step of algorithm SC2QA, a violation node with respect to a candidateXCSDs is resolved, it might also introduce new violations. SC2QA proceeds until no moreviolation nodes exist. However, the set of candidate XCSDs and the number of candidatenodes are limited. Thus, SC2QA always terminates. □

Theorem 3 (Correctness) Let Q be a query on a data tree T and ∑ be a set of XCSDs. Thequery answer obtained by SC2QA is always customized consistent with the given XCSDs.

Proof Suppose thatQc(T, ∑)= ∪i=1..k Qc(Rvi, ∑) is the answer ofQ. This means each data nodein the CCQA satisfies all XCSDs with the lowest repair cost.

If there exists an answer Qc(Rvi, ∑) for sub-tree Rvi rooted at vi which violates a constraint8m, then there exists at least a node vj in the Rvi violates 8 m. In such a case, Qc(Rvi, ∑) is notincluded in the answer Qc(T, ∑) and a repairing with respect to the given XCSDs is impossible.Otherwise, it is a contradiction with the data value repair principles that each Qc(Rvi, ∑) isconsidered valid in the answer set only if it is computed from all consistent data with a repaircost under a certain cost threshold. □

World Wide Web

Page 20: Structured content-based query answers for improving information quality

6 Experiment evaluation

Data: We run experiments on synthetic data to evaluate the calculation efficiency forSC2QA. This is to avoid the noise in real data. Our dataset is an extension of the "FlightBookings" data shown in Figure 1 which includes three additional elements, namely, theID of the customer CID, the departure time DTime and arrival time ATime. The datasetcovers common features in XML data, including structural diversity and various datarules. The original dataset contained 100 Bookings. The DirtyXMLGenerator [30] madeby Sven Puhlmann was used to generate the synthetic datasets. We specified that thepercentage of duplicates of an object is 100 % to generate a dataset containing similarBookings. From 100 duplicate Bookings, we specified 30 % of data was missing from theoriginal objects so that the dataset becomes inconsistent due to missing data. Thegenerated dataset contains 200 Bookings, named GD1. Then we used the generated dataGD1 to create GD2. The percentage of missing data was specified at 15 % of the numberof objects in GD1. Our purpose is to generate datasets containing the same number ofinconsistent data caused by missing data. GD2 contains 400 Bookings. Similarly, wegenerated GD3, GD4 and GD5 contain 800 Bookings, 1600 Bookings and 3200 Book-ings and the percentages of missing data were specified at 7.5 %, 3.75 % and 1.875 %,respectively. We evaluated on 5 constraints, consisting of 3 constant XCSDs and 2variable XCSDs. We ran experiments on a PC with an Intel i5, 3.2GHz CPU and 8GBRAM. The implementation was in Java and data was stored in MySQL.Parameters: Table 1 is a set of XCSDs and Table 2 is a set of queries which are used inexperiments. The repair cost threshold γ is set to 1.Efficiency and Effectiveness: the number of XCSDs and the query influence on thecomplexity of SC2QA. The dataset GD1 is fixed, the number of conditions on the queryand the XCSDs change. We consider the efficiency and effectiveness of cases where: (i)the query Q is computed with respect to different types of XCSDs including constantXCSDs and variable XCSDs; and (ii) the number of conditions in query Q increases, andthe number of XCSDs is stable. For each query, we recorded the run time.

The results in Figure 4 are the execution times of query Q1 under various types ofXCSDs. The results show that the SC2QA runs more efficiently when utilizing constant

Table 1 Set of XCSDs used in experiments

C1 8 1=Pl:[0.6][./Carrier=′ ′Tiger Airways ′ ′](./Fare→./Tax)

C2 8 2=Pl:[0.6][./Carrier=′ ′Air Asia ′ ′],(./Departure,./Arrival)→(./Tax)

C3 8 3=Pl:[0.6][./Carrier=′ ′Tiger Airways ′ ′],(./Departure= ′ ′SIN ′ ′→./Arrival= ′ ′MEL′ ′)

C4 8 4=Pl:[0.6][./Carrier=′ ′Air Asia ′ ′],(./Departure= ′ ′KUL ′ ′,./Arrival= ′ ′SYD′ ′)→(./Transit= ′ ′MEL′ ′)

C5 8 5=Pl:[0.6][./Carrier=′ ′Air Asia ′ ′],(./Departure= ′ ′SIN ′ ′,./Arrival= ′ ′SYD′ ′)→(./Tax= ′ ′200 ′ ′)

Table 2 Set of queries used in experiments

Q1 /Bookings/Booking

Q2 /Bookings/Booking[Carrier= 'Tiger Airways']

Q3 /Bookings/Booking[Carrier= 'Tiger Airways' and Departure='SIN' ]

Q4 /Bookings/Booking[(Carrier= 'Tiger Airway' and Departure= 'SIN' and Fare = '600']

Q5 /Bookings/Booking[(Carrier= 'Air Asia' or Carrier= 'Tiger Airways') and Departure='SIN']

World Wide Web

Page 21: Structured content-based query answers for improving information quality

XCSDs than variable XCSDs. This is because XCSDs with constant expressions providemore information for validating and correcting violation data than those with variableexpressions. The variable XCSDs hold on a large number of objects, which requires moreprocessing time. For instance, the execution times of query Q1 under 8 3 are around 50 %the execution times of either 8 1 or 8 2. The same situation occurs to 8 4 and 8 5

Figure 5 represents the execution times of queries Q1–Q5 where the number ofconditions varies from 0 to 3 and the number of XCSDs remains the same (i.e 8 1-8 5). Execution time depends on the number of conditions in the query. In cases where thenumber of conditions in the queries increases, the execution time is also slightlyincreased. That is, the time required to analyse the interactions between the query andthe XCSDs increases.Scalability in data size: We investigated the scalability of our approach with respect to thedataset size. We fixed the set of XCSDs 8 1- 8 5 and the query Q5. We ran SC2QA fordatasets ranging from 200 Bookings to 3200 Bookings. For each query, we recorded therun time. The results in Figure 6 show that the algorithm scales linearly for the datasizes.The run time increases gradually due to the number of Bookings affecting to the run time.This is due to the increased number of possible elements required to validate and correct.We also observed that the number of inconsistent elements strongly affects the run time.For example, two datasets GD1 and GD5 contain nearly the same number of inconsistentBookings. GD1 is 16 times smaller than GD5 which contain 200 Bookings and 3200Bookings, respectively; however, the run time of GD1 is only about 8 times faster thanthat of GD5. It is more time consuming to undertake correction than validation.

0

5

10

15

20

25

30

35

C1 C2 C3 C4 C5XCSDs

Tim

e (s

econ

ds)

Figure 4 Execution times on GD1: constant XCSDs vs variable XCSDs

0

10

20

30

40

50

60

70

80

Q1 Q2 Q3 Q4 Q5Queries

Tim

es (

seco

nds)

Figure 5 Execution times on GD1 when varying the number of conditions in queries

World Wide Web

Page 22: Structured content-based query answers for improving information quality

7 Conclusion

This paper introduced an approach utilizing XCSDs to compute customized consistent queryanswers for queries posted to an inconsistent data source to improve information quality. Ourapproach is based on the semantics of XCSDs to find consistent data from involved objects.By identifying every inconsistent node locally with respect to each XCSD, SC2QA is able tocollect the information as consistent as possible. Experiments on a synthetic dataset are used toevaluate the effectiveness and efficiency of SC2QA. The results show that SC2QAworks moreefficiently for constant XCSDs than variable XCSDs. It also works more effectively in caseswhere the number of conditions in the query is small. We proved that customized queryanswers computed by SC2QA are always consistent with respect to a set of preferred XCSDs.We believe that our approach can be applied to arbitrary XML data sources.

References

1. Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity, ICDT ‘09Proceedings of the 12th International Conference onDatabase Theory St. Petersburg, Russia, pp. 31–41. (2009)

2. Arenas, M.: Normalization theory for XML. SIGMOD Rec. 35(4), 57–64 (2006)3. Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases, PODS ‘99,

Philadelphia, Pennsylvania, USA, ACM, pp. 68–79. (1999)4. Arenas, M., Bertossi, L., Chomicki, J.: Answer sets for consistent query answering in inconsistent databases.

Theory Pract Log Program 3(4), 393–424 (2003)5. Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent

databases. Theor. Comput. Sci. 296(3), 405–434 (2003)6. Arenas, M., Bertossi, L.: On the Decidability of Consistent Query Answering, In proc. Alberto Mendelzon

Int. Workshop on Foundations of Data Management, (2010)7. Bertossi, L.: Consistent query answering in databases. SIGMOD Rec. 35(2), 68–76 (2006)8. Bertossi, L.: Database repairing and consistent query answering. In: Synthesis Lectures on Data

Management. Morgan & Claypool Publishers (2011)9. Buttler, D.: A Short Survey of Document Structure Similarity Algorithms, Proceedings of the 5th

International Conference on Internet Computing, USA, pp. 3–9. (2004)10. Cate, B.T., Fontaine, G., Kolaitis, P.G.: On the data complexity of consistent query answering. Proceedings

of the 15th International Conference on Database Theory, Berlin, Germany, ACM, pp. 22–33. (2012)

0

50

100

150

200

250

300

350

400

450

200 400 800 1600 3200

# of Bookings in dataset

Tim

es (

Sec

on

ds)

Figure 6 Scalability in data size

World Wide Web

Page 23: Structured content-based query answers for improving information quality

11. Ceravolo, P., Liu, C., Jarrar, M., Sattler, K.-U.: Special issue on querying the data web. World Wide Web14(5–6), 461–463 (2011)

12. Chomicki, J.: Consistent Query Answering: Five Easy Pieces 11th International Conference on Databasetheory, Springer LNCS, 1–17. (2007)

13. Chomicki, J., Marcinkowski, J., Staworko, S.: Computing consistent query answers using conflicthypergraphs, CIKM ‘04 Proceedings of the Thirteenth ACM International Conference on Information andKnowledge Management, ACM Press, pp. 417–426. (2004)

14. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving Data Quality: Consistency and Accuracy,VLDB‘07, Vienna, Austria, VLDB Endowment, pp. 315–326. (2007)

15. Deutsch, A., Tannen, V.: Reformulation of XML Queries and Constraints, Proceedings of the 9thInternational Conference on Database Theory, Springer-Verlag, pp. 225–241. (2002)

16. Deutsch, A., Popa, L., Tannen, V.: Query reformulation with constraints. SIGMOD Rec. 35(1), 65–73 (2006)17. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data

inconsistencies. ACM Trans. Database Syst. 33(2), 1–48 (2008)18. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing, SIGMOD

‘11, Athens, Greece, ACM pp. 469–480. (2011)19. Flesca, S., Furfaro, F., Greco, S., Zumpano, E.: Repairs and consistent answers for XML data with functional

dependencies. In: Database and XMLTechnologies, pp. 238–253. Springer, Berlin (2003)20. Flesca, S., Furfaro, F., Greco, S., Zumpano, E.: Querying and repairing inconsistent XML data. In: WISE

2005, pp. 175–188. Springer, Berlin (2005)21. Flesca, S., Furfaro, F., Greco, S., Zumpano, E.: Repairing Inconsistent XML Data with Functional

Dependencies. In: Encyclopedia of Database Technologies and Applications, Idea Group, 542–547. (2005)22. Flesca, S., Furfaro, F., Parisi, F.: Querying and repairing inconsistent numerical databases. ACM Trans.

Database Syst. 35(2), 1–50 (2010)23. Ghodke, S., Bird, S., Zhang, R.: A Breadth-First Representation for Tree Matching in Large Scale Forest-

Based Translation, 5th International Joint Conference on Natural Language Processing Chiang Mai,Thailand, IJCNLP2011 pp. 785–793. (2011)

24. Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tackling inconsistencies in data integration throughsource preferences Workshop on Information Quality in Information Systems - QDB, Paris, pp. 27–34.(2004)

25. Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations,ICDT ‘09 Proceedings of the 12th International Conference on Database Theory St. Petersburg, Russia,ACM, pp. 53–62. (2009)

26. Kolahi, S., Lakshmanan, L.V.S.: Exploiting conflict structures in inconsistent databases, ADBIS‘10Proceedings of the 14th East European Conference on Advances in Databases and Information Systems,Novi Sad, Serbia, Springer-Verlag, pp. 320–335. (2010)

27. Lee, K.-H., Whang, K.-Y., Han, W.-S.: XMin: minimizing tree pattern queries with minimality guarantee.World Wide Web 13(3), 343–371 (2010)

28. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML Queries on Heterogeneous Data Sources,Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 241–250.(2001)

29. Ng, W.: Repairing Inconsistent Merged XML Data, Database and Expert Systems Applications. (2003).30. Puhlmann, S., Naumann, F., Eis, M.: The Dirty XML Generator. (2004)31. Rafiei, D., Moise, D.L., Sun, D.: Finding Syntactic Similarities Between XML Documents, Proceedings of

the 17th International Conference on Database and Expert Systems Applications, DEXA‘06, pp. 512–516.(2006)

32. Staworko, S., Chomicki, J.: Validity-Sensitive Querying of XML Databases, EDBT Workshops, pp. 164–177. (2006)

33. Tagarelli, A.: Exploring dictionary-based semantic relatedness in labeled tree data. Inf. Sci. 220(20), 244–268 (2013)

34. Tan, Z., Zhang, L.: Repairing XML functional dependency violations. Inf. Sci. 181(23), 5304–5320 (2011)35. Tan, Z., Wang, W., Shi, B.: Extending Tree Automata to Obtain Consistent Query Answer from Inconsistent

XML Document Proceedings of the First International Multi-Symposium on Computer and ComputationalSciences (IMSCCS‘06), pp. 488–495. (2006)

36. Tan, Z., Zhang, Z., Wang, W., Shi, B.: Computing repairs for inconsistent XML document using chase. In:Anvances in Data and Web Management, pp. 293–304. Springer, Berlin (2007)

37. Tan, Z., Liu, C., Wang, W., Shi, B.: Consistent query answers from virtually integrated XML data. J. Syst.Softw. 83(12), 2566–2578 (2010)

38. Vincent, M.W., Liu, J., Liu, C.: Strong functional dependencies and their application to normal forms inXML. ACM Trans. Database Syst. 29(3), 445–462 (2004)

World Wide Web

Page 24: Structured content-based query answers for improving information quality

39. Vo, L.T.H., Cao, J., Rahayu, W.: Discovering Conditional Functional Dependencies in XML Data,Australasian Database Conference, pp. 143–152. (2011)

40. Vo, L.T.H., Cao, J., Rahayu, W., Nguyen, H.-Q.: Structured content-aware discovery for improving XMLdata consistency. Inform. Sci. 248(1), 168–190 (2013)

41. W3C, XML Path Language (XPath), (1999)42. Weis, M., Naumann, F.: Detecting Duplicate Objects in XML Documents, Proceedings of the 2004

international workshop on Information quality in information systems, Paris, France, ACM, pp. 10–19.(2004)

43. Weis, M., Naumann, F.: DogmatiX Tracks down Duplicates in XML, Proceedings of the 2005 ACMSIGMOD international conference on Management of data, Baltimore, Maryland, ACM pp. 431–442.(2005)

44. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M.: GDR: a system for guided data repair, SIGMOD,pp. 1223–1226. (2010)

45. Yu, C., Jagadish, H.V.: XML Schema refinement through redundancy detection and normalization. VLDB17(2), 203–223 (2008)

46. Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration, SIGMOD ‘04, Paris, France,pp. 371–382. (2004)

47. Yu, C., Jagadish, H.V.: Efficient Discovery of XML Data Redundancies, Proceedings of the 32ndInternational Conference on Very Large Databases, Seoul, Korea, VLDB Endowment pp. 103–114. (2006)

World Wide Web