referential integrity quality metrics

14
Referential integrity quality metrics Carlos Ordonez a, , Javier García-García b,1 a University of Houston, Department of Computer Science, Houston, TX 77204, USA b Universidad Nacional Autónoma de México, Facultad de Ciencias, UNAM, Mexico City, CU 04510, Mexico Received 12 September 2006; received in revised form 7 May 2007; accepted 18 June 2007 Available online 23 June 2007 Abstract Referential integrity is an essential global constraint in a relational database, that maintains it in a complete and consistent state. In this work, we assume the database may violate referential integrity and relations may be denormalized. We propose a set of quality metrics, defined at four granularity levels: database, relation, attribute and value, that measure referential completeness and consistency. Quality metrics are efficiently computed with standard SQL queries, that incorporate two query optimizations: left outer joins on foreign keys and early foreign key grouping. Experiments evaluate our proposed metrics and SQL query optimizations on real and synthetic databases, showing they can help in detecting and explaining referential errors. © 2007 Elsevier B.V. All rights reserved. Keywords: Referential integrity; Foreign key; Metric; Database integration 1. Introduction Referential integrity is a fundamental global con- straint in a relational database [8], that basically ensures a foreign key value exists in the referenced relation. Referential integrity issues are found in database integration, data quality assurance, data warehousing and data modeling. Referential integrity is violated or relaxed for practical reasons. Database integration represents a common scenario where similar tables coming from multiple source databases (OLTP systems) have different referential integrity constraints and each DBMS provides distinct mechanisms and rules to enforce referential integrity [20]. Therefore, source databases may violate referential integrity and their integration may uncover additional referential integrity problems. Performance is the second common reason, where referential integrity checking is disabled to allow fast insertions in batch mode. Finally, the logical data model behind a relational database evolves, incorporat- ing new attributes and new relations not defined before, causing old data to violate new referential integrity constraints. In short, referential integrity is an important broad problem in modern relational databases. The issues outlined above motivated us to revisit the fundamental concept of referential integrity. We propose several Quality Metrics (QMs) that measure complete- ness and consistency with respect to referential integrity. Our QMs not only cover normalized databases whose relations are incomplete when they have missing foreign key values, but also measure the inconsistency that arises when a relation is not normalized and an attribute Available online at www.sciencedirect.com Decision Support Systems 44 (2008) 495 508 www.elsevier.com/locate/dss Corresponding author. E-mail addresses: [email protected] (C. Ordonez), [email protected] (J. García-García). 1 The second author was sponsored by the UNAM IT project Macroproyecto de Tecnologías para la Universidad de la Información y la Computación". 0167-9236/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2007.06.004

Upload: carlos-ordonez

Post on 05-Sep-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Referential integrity quality metrics

Available online at www.sciencedirect.com

44 (2008) 495–508www.elsevier.com/locate/dss

Decision Support Systems

Referential integrity quality metrics

Carlos Ordonez a,⁎, Javier García-García b,1

a University of Houston, Department of Computer Science, Houston, TX 77204, USAb Universidad Nacional Autónoma de México, Facultad de Ciencias, UNAM, Mexico City, CU 04510, Mexico

Received 12 September 2006; received in revised form 7 May 2007; accepted 18 June 2007Available online 23 June 2007

Abstract

Referential integrity is an essential global constraint in a relational database, that maintains it in a complete and consistent state.In this work, we assume the database may violate referential integrity and relations may be denormalized. We propose a set ofquality metrics, defined at four granularity levels: database, relation, attribute and value, that measure referential completeness andconsistency. Quality metrics are efficiently computed with standard SQL queries, that incorporate two query optimizations: leftouter joins on foreign keys and early foreign key grouping. Experiments evaluate our proposed metrics and SQL queryoptimizations on real and synthetic databases, showing they can help in detecting and explaining referential errors.© 2007 Elsevier B.V. All rights reserved.

Keywords: Referential integrity; Foreign key; Metric; Database integration

1. Introduction

Referential integrity is a fundamental global con-straint in a relational database [8], that basically ensuresa foreign key value exists in the referenced relation.Referential integrity issues are found in databaseintegration, data quality assurance, data warehousingand data modeling. Referential integrity is violated orrelaxed for practical reasons. Database integrationrepresents a common scenario where similar tablescoming from multiple source databases (OLTP systems)have different referential integrity constraints and each

⁎ Corresponding author.E-mail addresses: [email protected] (C. Ordonez),

[email protected] (J. García-García).1 The second author was sponsored by the UNAM IT project

“Macroproyecto de Tecnologías para la Universidad de la Informacióny la Computación".

0167-9236/$ - see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.dss.2007.06.004

DBMS provides distinct mechanisms and rules toenforce referential integrity [20]. Therefore, sourcedatabases may violate referential integrity and theirintegration may uncover additional referential integrityproblems. Performance is the second common reason,where referential integrity checking is disabled to allowfast insertions in batch mode. Finally, the logical datamodel behind a relational database evolves, incorporat-ing new attributes and new relations not defined before,causing old data to violate new referential integrityconstraints. In short, referential integrity is an importantbroad problem in modern relational databases.

The issues outlined above motivated us to revisit thefundamental concept of referential integrity. We proposeseveral Quality Metrics (QMs) that measure complete-ness and consistency with respect to referential integrity.Our QMs not only cover normalized databases whoserelations are incomplete when they have missing foreignkey values, but also measure the inconsistency thatarises when a relation is not normalized and an attribute

Page 2: Referential integrity quality metrics

496 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

value, determined by a foreign key, does not match thecorresponding attribute value in the referenced relation.This is common when tables are denormalized, viewsare materialized or similar tables, from different sourcedatabases, are integrated. This article continues ourwork on referential integrity [20] (see Section 5).

The article is organized as follows. Section 2introduces definitions on relational databases, referentialintegrity and denormalized databases. Section 3 presentsour main contributions. We introduce QMs that measurecompleteness and consistency in a relational databasewith respect to referential integrity. QMs quantifyreferential errors at the database, relation, attribute andvalue level, on both normalized and denormalizeddatabases. We introduce a basic set of SQL queries tocompute QMs and identify two main optimizations.Section 4 presents an experimental evaluation with realand synthetic databases. We demonstrate the usefulnessof QMs on real databases and we study queryoptimization with large synthetic databases. Section 5discusses related work. Section 6 presents the conclu-sions and directions for future research.

2. Definitions

We denote a relational database by DðR; IÞ, whereRis a set of N relationsR ¼ fR1;R2; N ;RNg, Ri is a set oftuples and I a set of referential integrity constraints. Arelation Ri of degree di is denoted by Ri(Ai1, Ai2,…, Aidi),where each attribute comes from some domain. Oneattribute from Ri is the primary key (PK), called Ki, andthe remaining attributes Aij are functionally dependenton Ki: denoted Ki→Aij. To simplify exposition we usesimple PKs. Relations are manipulated with the standardrelational algebra operators σ, Π, ⋈ (i.e. with SPJqueries [7]) and aggregations. The cardinality (size) of Ri

is denoted by |Ri| and ni (to avoid confusion with N).A referential integrity constraint, belonging to I,

between Ri and Rj is a statement of the form: Ri(K)→Rj (K), where Ri is the referencing relation, Rj is thereferenced relation, K is a foreign key (FK) in Ri andK is the primary key (PK) of Rj. To simplify expositionwe assume the common attribute K has the same nameon both relations Ri and Rj. In a valid database statewith respect to I, the following two conditions hold:(1) Ri ·K and Rj ·K have the same domains. (2) forevery tuple in Ri there must exist a tuple in Rj suchthat Ri ·K=Rj ·K. The primary key of a relation (i.e.Rj ·K) is not allowed to have nulls. But in general, forpractical reasons the foreign key Ri ·K is allowed tohave nulls. We also study integrity over inclusiondependency (ID). An ID constraint between attributes

Ai and Aj from relations Ri and Rj holds when ΠAi

(Ri)⊆ΠAj(Rj).We relax the concept of referential integrity. Assume

DðR; IÞ is an invalid state with respect to I. That is,some relations Ri from R violate referential integrityconstraints. We call the valid state a strict state. Adatabase state where there exist referential errors iscalled relaxed. We assume Ri may contain tuples havingRi ·K values that do not exist in Rj ·K or Ri ·K may benull (η). To make our definition more precise, if Ri ·K isnull in some tuple we will consider such tuple incorrect.This is motivated by the fact that a null referenceprovides no information and the ⋈ operator eliminatesRi tuples with a null value on K. Our goal is to quantifyreferential integrity violations, by defining qualitymetrics of relaxed states.

We add one practical aspect: relations can bedenormalized. We compute referential integrity metricson attributes such that each attribute is either a foreignkey (as defined above) or a foreign attribute (function-ally dependent on some foreign key). We use K to referto a foreign key and F for a foreign attribute, in a genericmanner. Recall we assume primary and foreign keysconsist of one attribute to simplify exposition, but inpractice a foreign key may consist of two or moreattributes (e.g. composite keys). Notice a foreignattribute introduces a potential source of inconsistency;this notion will be formalized in the next section.Relation Ri has ki foreign keys and fi foreign attributes.Attributes that depend only on the primary key, and noton any foreign key, are assumed to be correct. Therefore,we do not define quality metrics on them.

3. Referential integrity QMs

3.1. Operations violating referential integrity

Referential integrity can be violated basically for tworeasons: (1) a relation (table) coming from a differentsource database is integrated into a central database,such as a data warehouse. (2) A database operation at therow level introduces a referential integrity violationwhen a constraint is disabled or non-existent. This is acommon scenario in a data warehouse which storestables coming from multiple source databases. Ingeneral, referential integrity constraints are disabledwith two reasons in mind: Adding new data moreefficiently and avoiding the elimination of tuples withinvalid references. The database operations that cancause a relaxed state are: (a) Inserting or updating aforeign key value Ri ·K in Ri, such that the Ri ·K is nullor it does not exist in Rj ·K at the time of insertion or

Page 3: Referential integrity quality metrics

497C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

updating (completeness issue); (b) Deleting or updatingtuples from Rj, whose primary key Rj ·K leaves a set ofreferencing tuples in Ri without a correspondingmatching tuple with the same Ri ·K value (completenessissue); (c) Updating or inserting an attribute F in Rj,functionally dependent on Rj ·K, that is used as a foreignattribute in Ri leaving both attributes with inconsistentvalues or, conversely, updating F in Ri, but not on Rj

(inconsistency issue). This case arises when referencingtables are denormalized, when views are materialized orwhen query results are stored.

There are a variety of DBMS mechanisms to enforcereferential integrity [10] in databases that are assumed tobe in 3NF. A common mechanism is to forbid insert orupdate operations on a referencing relation that havetuples with inexistent foreign keys on the referencedrelation. Orthogonal referential actions associated withdelete or update operations on referenced relationscommonly are: restrict, cascade, nullify, set a defaultvalue or no action at all. These actions are commonlyused to neutralize the possible referential integrityviolation. In general, a DBMS does not enforce referen-tial integrity on non-key attributes because databasesare assumed to be in 3NF and views are materialized.Triggers [10] can be used to detect referential violations.

3.2. Absolute and relative error

Let X represent an attribute (foreign key K orforeign attribute F), a relation or a database. We definethe absolute error of object X as a(X), given by itsnumber of incorrect references. Let t(X) be the totalcount of references in X. The relative error of X isdefined as r(X)=a(X) / t(X).

Absolute error is useful at fine granularity levels orwhen there are few referential violations. In contrast,relative error is useful at coarse granularity levels orwhen the number of referential violations is large.

3.3. Hierarchical definition of QMs

We introduce referential integrity QMs at fourgranularities: (1) database; (2) relation; (3) attribute;(4) value. QMs are intended to give a global picture anda micro-view of the database. Database level QMsconstitute the first step towards discovering referentialerrors. Relation level QM isolate those relations withreferential problems. Attribute QMs provide specificinformation that may be used to explain and fix specificreferential integrity errors. QMs are hierarchicallydefined starting with the attribute QMs and endingwith the database QMs.

Attribute and attribute value QMs are defined overforeign keys and foreign attributes. QMs over foreign keysrepresent a measure of completeness. QMs over foreignattributes represent a measure of consistency. Therefore,denormalization introduces potential inconsistency.

3.3.1. Attribute level QMsGiven an attributeK that is a foreign key, a referencing

tuple in Ri is considered correct if there exists a matchingtuple in the referenced relation Rj with the same K value.For a foreign attribute F in a referencing table that isdenormalized, a referencing tuple is considered correct ifRi ·F=Rj ·F for the matching tuple in the referenced tableRj, where Ri ·K=Rj ·K. Otherwise, a tuple is consideredincorrect. Therefore, a tuple in Riwith null values in K orF is considered incorrect. We even consider the case thata foreign attribute is also a foreign key, where we treat itas a foreign attribute so that inconsistency can bedetected. Summarizing, referential integrity for K (aforeign key), requires just an existential check, butreferential integrity for F (foreign attribute) also requiresan equality test.

We define the attribute level metric on Ri withreferential integrity Ri (K)→Rj (K) on foreign key K.We start by defining the attribute level absolute errormetric, that is defined in terms of unmatched values andnulls. If K is a foreign key in Ri referencing Rj then

aðRidKÞ ¼ jRij � jRiTKRjj ð1Þ

This QM is a measure of completeness. In this case,null values and unmatched foreign keys contribute toabsolute error. Motivated by the fact that sometimesforeign keys are allowed to have nulls, especially in datawarehouses, we provide a less restrictive definition,where K is allowed to have nulls. In this case error iscomputed over a subset of Ri, where K is not null. Thatis, tuples where K is null are assumed correct.

aðRidKÞ ¼ jrnotnullðKÞðRiÞj � jRiTKRjj ð2Þ

Let F be a foreign attribute in Ri dependent on a foreignkey attribute K and let P be the primary key of Ri where Rireferences Rj. That is, Ri ·P→Ri ·K, Ri ·P→Ri ·F andRj ·K→Rj ·F. But notice Ri ·K→Ri ·F is not necessarilyvalid because R can be in a relaxed (inconsistent) state.We define the absolute error for a foreign attribute as:

aðRidFÞ ¼ jRij � jrRidF¼RjdFðRiTKRjÞj: ð3ÞThis QM is a measure of consistency. Notice the metric

for a foreign attribute is strict in the sense that in order to

Page 4: Referential integrity quality metrics

498 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

consider a reference a correct one, not onlymust its foreignkey exist, but also have the same value and such valuemust be different from null. A comparison between twovalues in which either value is null returns null by three-valued logic [8] and therefore it increases a(Ri ·F). Noticethat in Eq. (3) we adopted a conservative strategy in thesense that if the foreign key of the corresponding foreignattribute does notmatch, then it will be counted as an error,no matter the value of the foreign attribute. If F is also aforeign key then we treat it as a foreign attribute so thatinconsistency can be detected. If we only need the numberof inconsistent values fromFwhere the foreign key is validwe can compute a(Ri ·F)−a(Ri ·K). In an analogousmanner, we provide a less restrictive definition, where Kcan be null. Notice whenF is null it will still be consideredincorrect since comparison with any value is undefined.

aðRidFÞ ¼ jrnotnullðKÞðRiÞj � jrRidF¼RjdFðRiTKRjÞj: ð4Þ

We can now define the attribute level relative errorQM. Notice this metric is undefined for empty relations.If Aij is either a foreign key or a foreign attribute in Ri

then

rðRidAijÞ ¼ aðRidAijÞni

ð5Þ

The QM r(Ri ·K) over a foreign key K is a measure ofcompleteness and r(Ri ·F) on a foreign attribute F is ameasure of consistency.

3.3.2. Value level QMsWe define a value level QM for one foreign key value

kp∈Ri ·K as a(Ri ·K, kp)= |σK=kp (Ri)|− |σK=kp (Ri ⋈K

Rj)|, for each key value kp, where kp may be null. ThisQM provides the frequency of invalid key values,including null. This QM is a measure of completeness.The value QM for each foreign attribute value fq∈Ri ·Fis similarly defined counting those tuples that containvalues kp∈K and fq∈F in Ri and Ri ⋈K Rj. This QMmeasures consistency and provides the frequency ofeach non-matching foreign attribute value. If nulls in Kare considered correct then fq will be considered correctif it is null. When nulls are not considered correct a nullvalue in K will cause fq to be considered incorrect,regardless of null. Therefore, incompleteness in K leadsto inconsistency in F. Relative error for attribute valuesis defined dividing by ni= |Ri|.

3.3.3. Statistics and correlations on Attribute QMsWe introduce univariate and bivariate statistics to

understand the probability distribution behind referentialerrors in some attribute and also, to discover inter-

relationships among two unrelated attributes. Wecalculate univariate statistics on attribute QMs includingthe minimum, maximum, mean and standard deviationof absolute error (frequency) of invalid attribute values.Statistics on QMs are useful to explain why errorshappen and to develop a database repair plan. Aswe shallsee in Section 4, these statistics can give us specificinformation on the probability distribution behindinvalid values in a relation. For instance, we can knowhow many tuples there are per invalid value on average.To have an idea about the scatter of frequencies of invalidvalues we compare their mean and standard deviation.The minimum and maximum frequencies help us detectcritical values (outliers). We also introduce an errorcorrelation measure between two attributes on the samerelation. In general, the correlation between a foreign keyand a functionally dependent foreign attribute is close to1. Error correlation can reveal interesting facts amongtwo attributes that do not have any functional depen-dency between them. A correlation between twounrelated foreign keys or two unrelated foreign attributesprovides valuable insight to explain why referentialerrors happen.

3.3.4. Relation level QMsWe define relation QMs for table Ri with ki foreign

keys and fi foreign attributes. When Ri is normalizedfi=0.

Based on attribute QMs we define independent erroraggregations on foreign keys and foreign attributes. Thefollowing QM measures the completeness of Ri:

aKðRiÞ ¼Xkij¼1

aðRidKjÞ ð6Þ

This QMmeasures completeness since it tells us howmany foreign key values are not matched in each Rj orare simply null. Similarly, we define a consistency QMfor Ri as:

aFðRiÞ ¼Xfij¼1

aðRidFjÞ ð7Þ

We define relative error QMs for completeness andconsistency, respectively:

rKðRiÞ ¼ aKðRiÞkini

ð8Þ

rFðRiÞ ¼ aFðRiÞfini

ð9Þ

Page 5: Referential integrity quality metrics

Fig. 1. QMs diagram.

499C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

rK () and rF () are η (undefined) when ki=0, fi=0 orni=0. Also, aF (Ri) is η when Ri is normalized. Observethat relation level QMs quantify completeness andconsistency in a relation based on all its attributes.

3.3.5. Database level QMsWe define database absolute error QMs, for com-

pleteness and consistency respectively.

aKðRÞ ¼XNi¼1

aKðRiÞ ð10Þ

aFðRÞ ¼XNi¼1

aFðRiÞ ð11Þ

Finally, we define global relative error QMs.

rKðRÞ¼ aKðRÞPNi¼1

kini ð12Þ

rFðRÞ¼ aFðRÞPNi¼1

fini ð13Þ

Our database level QMs exclude relations Ri withki=0 (e.g. lookup relations without references). If ki=0for some Ri then Ri does not contribute to relative error.

3.3.6. Discussion on QMsCompleteness and consistency are measured sepa-

rately. In our proposal we do not give different weightsto references coming from large or small relations. Anincorrect reference in a small relation (e.g. a lookuptable) is as important as an incorrect reference in a largerelation (e.g. a transaction table). Correlations andfurther multidimensional statistical models can be usedto explain relationships between referential errors indifferent attributes. QMs can be ranked by thecardinality of the referencing relation so that referencesin smaller or larger relations carry more or less weightbased on user's preferences. Table 1 summarizesattribute-level QMs, which can be thought as the atomic

Table 1Attribute level QMs

Data quality dimension Absolute error Relative error

Completeness a(Ri ·K) r(Ri ·K)Consistency a(Ri ·F) r(Ri ·F)

metrics from which relation and database QMs areaggregated. Value QMs can be considered a zoomedview of attribute QMs. Fig. 1 shows the hierarchicalrelationship among QMs; everything is derived fromattribute level QMs.

3.4. Example

We now present examples of QMs using a storedatabase schema. Since the goal of our QMs is to help auser discover and explain referential errors in a database,we present QMs going from the highest level down to themost detailed level, in opposite order from theirdefinition. We start showing global measures of qualityof referential integrity at the database level in Table 2.The first important observation is that rK () and rF () arenot zero, which indicates there exist referential errors inthe database. The second observation is that incomplete-ness and inconsistency have similar importance. If it isdesired to maintain a database in a strict state this QMtable can be periodically refreshed and monitored, by aset of SQL queries, as discussed above. Going down onelevel we can discover which specific relations are givingus problems. Table 3 helps us identify those relationswith errors. For each relation there are completeness and

Table 2Database level QMs

R N rK() rF()

Store DB 14 0.02% 0.02%

Page 6: Referential integrity quality metrics

Table 3Relation level QMs

Ri ki fi ni aK() aF()

categ 0 0 20 η ηproduct 4 5 100,109 2444 3100txHeader 2 2 2,003,569 0 220txLine 4 2 19,130,964 7200 7495

Table 5Value level QMs for foreign key K

Ri stateId a(Ri ·K)

store η 4store 0 3store blank 2store T. 2store XX 1

500 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

consistency QMs with a(K) and a(F). Relation producthas a mixture of completeness and consistency pro-blems. Relation txHeader has only consistency problemsin its foreign attributes (e.g. customerId). RelationtxLine reveals completeness and consistency problems:some foreign keys have invalid values or referencedrelations are incomplete.

The next level gives us more specific evidence aboutreferential problems, as shown in Table 4. To makeoutput more concise we only show absolute error, butrelative error can be easily derived. Relation categ hasno references, and it has attribute categName which isneither K nor F; therefore, this attribute does notcontribute to absolute/relative error. Relation cityindicates there exists a serious completeness problemin the database with a high fraction of invalid foreignkeys. Relation product indicates denormalization intro-duces important consistency problems since it has manyinconsistent values in categName. In relation txLine wecan see that the main source of referential problems areinvalid foreign keys, but we can see there are 7495–7200=295 rows with inconsistent values for prodPriceeven when the foreign key exists. Finally, at the fineststorage granularity level, Tables 5 and 6 show valuelevel QMs, exhibiting unmatched K values andinconsistent F values in a denormalized relation, aswell as their respective frequencies. Values are sorted indescending order by their absolute error to quicklyidentify critical values in order to develop a databaserepair plan. In other words, when a(K) or a(F) haveskewed distributions we can learn which values can helpus fix most errors. Value level metrics for F considerevery existing combination between K and F, which canhelp explain why denormalized values do not match and

Table 4Attribute level QMs

Ri Aij K or F a() r(), %

categ categName ∅ 0 0.00city sateId K 12 9.01product categId K 2444 2.00product categName F 3100 3.00txLine prodId K 7200 0.02txLine prodPrice F 7495 0.02

assess their relative importance. In this case we can seethat the functional dependency between K and F doesnot hold. When a database is in a fs state, these valuelevel tables are empty. Another important aspect is thatthese QM tables represent extended statistics on thedatabase metadata. Therefore, QM tables can be queriedby an end-user to explore a database for completenessand consistency.

3.5. QMs calculation with SQL

Absolute error metrics are distributive [13] based on thefact that they are counts. Therefore, from a query evaluationperspective, most of the effort is spent on computingattribute level or value level QMs. But since relative error isa quotient it cannot be used to compute relative error atcoarser granularity levels. This is a consequence of therelative QM being an algebraic [13] function.

We present two query optimizations that are particu-larly useful to computeQMs. The first optimization favorsa left outer join over a set containment computation tocompute absolute QMs. The second optimization evalu-ates error aggregations grouping by foreign keys eitherbefore or after joining Ri and referenced relations Rj.

3.5.1. Set containment and left outer joinThe following query computes a(Ri ·K) as defined

above, with the constraint Ri(K)→Rj(K), assuming theprimary key of Ri is P. We call this query “setcontainment”. This query is generally computed with anested loop algorithm.

SELECT count(⁎) FROM Ri WHERE Ri ·Ki NOT IN

Table 6Value level QMs for foreign attribute F

Ri cityId cityName a(Ri ·K, Ri ·F)

store LA η 43store SF blank 34store SF Los Angeles 1store SF η 67store η New York 1

Page 7: Referential integrity quality metrics

501C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

(SELECT Ri ·Ki FROM Ri JOIN Rj ON Ri ·K=Rj ·K)

We introduce an equivalent query to compute a(Ri ·K)that may be more efficient, when using a different joinalgorithm, like a merge-sort join. We call it “left outerjoin”.

SELECT count(⁎)FROM Ri LEFT OUTER JOIN Rj ON Ri ·K=Rj ·KWHERE Rj ·K is null

A similar framework can be used for foreignattributes. The following query computes a(Ri ·K) anda(Ri ·F) in a single table scan over Ri. In general, the leftouter join optimization will be applied by default.

SELECTsum (case when Rj ·K is null then 1 end) AS aK,sum (case when Rj ·K is null or Ri ·F bNRj ·F then 1

end)AS aFFROM Ri LEFT OUTER JOIN Rj ON Ri ·K=Rj ·K ;

Fig. 2. Algorithm to compute QMs.

3.5.2. Early or late foreign key groupingThis optimization evaluates a “group-by” clause by

foreign keys either before or after joining Ri andreferenced relations Rj. The rationale behind thisoptimization is reducing the size of Ri before joiningwith Rj. We also call this optimization the “cube” variantbecause we build a cube whose dimensions are foreignkeys before joining. The early foreign key grouping queryoptimization with Ri and Rj computes a(Ri ·K) as follows:

INSERT INTO RK SELECT K,count(⁎) AS aFROM Ri GROUP BY K;SELECT sum(a) FROM RK LEFT OUTER JOIN Rj

ON RK ·K=Rj ·K WHERE Rj ·K is null;

On the other hand, the late group-by evaluation on Ri

and Rj to get a(Ri ·K) is shown below. The derived tableT can be materialized to efficiently query value levelQMs.

INSERT INTO Rij SELECT Ri ·KFROM Ri LEFT OUTER JOIN Rj

ON Ri ·K=Rj ·K WHERE Rj ·K is null;SELECT sum(a) FROM(SELECT K,count(⁎) AS a FROM Rij GROUP BY K)T;

The early foreign key grouping optimization can begeneralized to Ri being joined with m relations R1, R2,…, Rm on foreign keys, K1, K2,…, Km, respectively.

3.6. Algorithm to get QMs

We represent referential integrity constraints by adirected graph G=(V, E), where V={1,…, N} and vertexi corresponds to Ri and E={(i, j)|Ri (K)→Rj (K)}. Thatis, each directed edge represents one reference. Weintroduce an algorithm that computes referential integ-rity QMs based on G. The algorithm traverses thelogical data model behind the database, moving fromreferenced relations towards referencing relations, asspecified by G. Each vertex i corresponding to relationRi, stores information about its primary key, foreignkeys, foreign attributes and functional dependencies.Each edge (reference) makes the algorithm create anSQL query between Ri and Rj. The algorithm producesmetadata tables with QMs at different levels (Fig. 2).

4. Experimental evaluation

We implemented our program in the Java languagethat connected to different relational DBMSs throughthe JDBC interface. The program received G (thereferential integrity directed graph) as input, andgenerated SQL statements to create the four QM tables,one for each granularity level.

Weused three real databases and one synthetic database,generated with the TPC-HDBGEN program. All times arereported in seconds, unless stated otherwise. We usedmultiple relational DBMSs, from different softwarecompanies. To avoid discussion on each DBMS specificfeatures, performance and SQL compliance, we omit theirreal name. This was also a request from end users to protecttheir privacy, while allowing us to report our findings.

We used standard SQL (ANSI) in our QMs imple-mentation, making it portable in all relational DBMSs.

Page 8: Referential integrity quality metrics

Table 8Educational institution: attribute QM statistics

Ri Aij min μ max σ

student studentId 1 3 9 1.52

502 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

4.1. Real databases

Real databases were used to assess the usefulness of ourapproach and the relative importance of QMs. In theexperiments presented belowwe asked several informationtechnology organizations to give us permission to discoverreferential integrity problems in their databases. The realdatabases came from a university, a government organi-zation and a retail company. In our discussion we changethe actual table and column names to protect privacy.Organizations did not agree to reveal attribute value QMssince they revealed specific information about theiroperation. In general, query running times are given inseconds.

4.1.1. Educational institutionWenowpresent interesting results on a database from a

public higher educational institution. Since this was aproduction environment, other database processes wererunning concurrently with our program, but the workloadwas similar in all cases; we report the average runningtime. Due to security restrictions we could only exploreone important historic table containing important studentinformation, including student name, year, semester,course names, course grades, credits and grade pointaverage.Wewere not allowed to explore value level QMs,showing specific referential errors for specific students.The experiment goal was to validate that studentidentifications were actually valid, according to areference table containing biographic and demographicinformation for all students ever registered at theinstitution. Results were discussed with the InformationTechnology manager, two database developers and asystem administrator. Results are presented in Table 7.The IT manager was aware there existed some referentialintegrity issues. Database developers were surprised therewas such a high fraction of missing foreign keys forstudent identification numbers. The system administratorstated the historic table was constantly updated with newsemester course grades, but he said this table had a fewchanges in its definition in the past. Table 8 givesstatistical information about the probabilistic distributionof invalid references. It is noteworthy that there exists avalue that contributes significantly to a() with 9 invalidreferences. The coefficient of variation σ /μ states there is

Table 7Educational institution: attribute level QMs

Ri K rK time

student studentId 19.0% 707

not much variation around μ: most values contributeequally to a().

4.1.2. Government organizationApplying our approach on a different database, we

now present QMs on a driver's license database from astate in Mexico (state name omitted due to privacyrestrictions). This database contained driver's licenseinformation and vehicle registration information forpeople living in certain counties. The database hadhistoric tables containing historic personal information,driving records, traffic fines and payments. There wereadditional reference tables containing personal identi-fiers issued by government, similar to the US SocialSecurity Number. These results were obtained during anearly stage of our project, where we did not compute allQMs. Results were discussed with the IT manager, adatabase applications developer and a database admin-istrator. The database level QMs are shown on Table 9,where we can see there are minor referential integrityproblems. Incompleteness of foreign keys is clearlymore important than consistency, even though thedatabase is denormalized. This was good news for theIT manager, who did not anticipate having anyimportant referential issues. Going down one level,Table 10 shows a couple of relation level QMs. Usersbecame aware referential problems were prevalentspecifically in relation License, which was the mostimportant relation being continuously updated. To alesser extent, there were both completeness andconsistency problems in relation person. Then goingto a more granular storage level, attribute level metricsare shown in Table 11. QMs a() and r() are significantlyhigh for the county attribute. On the other hand, ourprototype uncovered inconsistency problems in attri-butes brName and model. Before computing our QMs,users had told us there could be referential violations.Nevertheless, they had not imagined there were seriousproblems in many cases.

Table 9Government database; database level QMs

R N rK rF

GovtDB 23 0.43% 0.01%

Page 9: Referential integrity quality metrics

Table 10Government database; relation level QMs

Ri ki ni a r, %

docum 3 4,779,940 2,081,465 14.52genId 3 2,493,855 15,198 0.14

Table 12Retail database; attribute-level QMs

Attribute K/F a() r(), %

storeId K 9468 7.56format F 9672 7.72region F 15,036 12.01

503C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

4.1.3. Retail companyWepresent error correlation results with a database from

a retail company. We focused on a summary fact table,having ni=125, 208 rows, used to perform store-level datamining analysis and to build monthly and annual OLAPreports. This fact table had been built from a databasecontaining 6 billion transactions. From an analyticalperspective this was one of the most important tables toperform OLAP and data mining analysis for this company.We ran our QMs and they took just 10 s; the correlationmatrix was derived in 2 s. Table 12 shows attribute-levelQMs only for those attributes with non-zero error. Clearly,completeness is a major issue since more than 7% of rowshave an invalid FK for storeId. Looking at foreignattributes we can see a(format) is close to a(storeId)which indicates we cannot tell if format is consistent or notwith the value in the referenced relation; given ourconservative definition we assume it is incorrect. On theother hand, a(region) is far from a(storeId), which reveals aserious consistency problem. In fact, 15,036−9468=5568values are inconsistent despite the fact that the FK storeIdexists. Table 13 shows a high correlation between storeIdand format, which is a consequence of the functionaldependency. However, region shows a lower correlation toeither attribute, which tells us that this attribute showsinconsistency in an independent manner. In practical terms,repairing referential errors in storeId takes care of format,but region requires a separate (independent) repair action.

4.1.4. Users feedback summaryIn general, users expected their databases to be clean

and they asked us to keep their specific organizationnames and specific record information confidential. Insome cases, they also requested to keep the DBMSbrand confidential as well. We requested getting accessto critical databases that were updated and refreshed

Table 11Government database; attribute level QMs

Ri Aij K or F a r, %

docum county K 2,076,530 43.443docum citizen K 4935 0.103docum paymts K 0 0.000genId brName F 5628 0.226genId model F 4562 0.183

continuously. It did not matter if such databases weredenormalized since our program was prepared to handlethem. In particular, we focused on analyzing largehistoric tables whenever possible since those tables areused in a wide variety of queries and reports. Under theIT manager supervision we were allowed to computeour QMs with automatically generated SQL queries.When the referential integrity prototype had generatedresults, these results were discussed with the ITmanagers, database developers and database adminis-trators. We presented results hierarchically going fromthe database level QMs down to attribute level QMs.Based on findings, users requested to browse a sampleof records with referential errors. Some users wereintrigued by results and asked us to explain how ourQMs were computed with SQL queries.

We asked the IT managers the following questions.Are you surprised QMs indeed uncovered somereferential problems?, what level of granularity ofQMs would you compute on a frequent basis?, whichis a more critical dimension in your database: com-pleteness or consistency?, for which tables is it critical tomaintain referential integrity? We asked databasedevelopers and database administrators the followingquestions. Why do you think table X has referentialerrors?, when table X is denormalized, how is itcomputed?, how frequently is database X updated?, istable X updated in batch or continuously, inserting onerecord at a time?, is there an explanation for attribute Yto have high levels of referential errors, compared toother attributes?

In general, users feedback about our tool was positive.Users stated that QMs helped them to discoverunexpected referential integrity problems and to ensuredata quality policies and procedures were workingproperly. Since most tables were normalized, QMs onforeign keys were more interesting. Database level QMswere not particularly useful on the three real databasesbecause relative error was high. Attribute level QMs onforeign keys (completeness) were particularly useful inOLTP systems or isolated databases in which referentialintegrity was routinely enforced. That is, completenesswas more important for databases where records wereinserted by transactions. Attribute level QMs on foreign

Page 10: Referential integrity quality metrics

Table 15

Table 13Retail database; a() error correlation

ρ a(storeId) a(format) a(region)

a(storeId) 1.000a(format) 0.988 1.000a(region) 0.774 0.764 1.000

504 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

attributes (consistency) were valuable to identify stalerecords in large fact tables and to detect inconsistentattributes in denormalized tables used for data miningpurposes. That is, consistency was more important for adata warehouse where there exists a large fact table withhistoric information and where there are denormalizedtables computing aggregations from the fact table.Database and relation level QMs helped detectingunforeseen referential integrity issues. Attribute andvalue level QMs helped diagnosing (explaining) refer-ential errors and preparing a repair plan. Users feedbackis summarized in Table 14.

We now present a more detailed discussion of userscomments. A system manager stated that QMs revealedproblems he was not aware of, and another IT managerstated that QMs helped him obtain a clear idea aboutdata quality. A DBA for a data warehouse stated thatQMs could enrich metadata to test data loading scripts inorder to detect problems while tables are beingrefreshed. An IT manager said QMs could justify aplan to improve data quality when integrating databasesinto a central data warehouse. QMs provided variedusefulness depending on each type of user. QMs at therelation and database levels were more useful for ITmanagers since they give a high level referentialintegrity quality assessment. QMs at the attribute andvalue level were more useful for database applicationdevelopers, who suggested integrating QMs with testingdatabase application code. FK value level QMs (invalidFK frequencies) were interesting to all users (e.g. a FKappearing in many relations) because they providedevidence about problems, but users stated they preferredto run the prototype on their own. An IT manager and aDBA stated that QMs should be collected over a longperiod of time (e.g. every week), especially for largehistoric tables containing transaction data, to track data

Table 14User's feedback

QM level Usefulness Application User

Database Low Detection IT managerRelation Medium Detection IT manager/DBAAttribute High Diagnosis DBA/developerValue High Diagnosis/repair DBA/developer

quality and prevent future data quality problems. Usersfeedback is summarized in Table 14.

4.2. Synthetic databases

We conducted our experiments on a database serverwith one Intel Xeon CPU at 1 GHz with 256 MB ofmain memory and 108 GB on disk. The relationalDBMS we used was Postgres V8.0.1.

Our synthetic databases were generated by the TPC-H DBGEN program, with scaling factors 1 and 2. Weinserted referential integrity errors in the referencing facttable (lineitem) with different relative errors (0.1%,0.2%,…, 1%, 2%,…, 10%). The invalid values wereinserted following several different probability distribu-tion functions (pdfs) including geometric, uniform, zipfand normal, and in three foreign keys (l_orderkey,l_partkey and l_suppkey).

The results we present in this section, unless statedotherwise, use a default scale factor 1. The referencingtable, lineitem and the referenced tables, orders, partand supplier have the following sizes: 6 M, 1.5 M, 200 kand 10 k tuples, respectively. Invalid FK values wererandomly inserted according to four different pdfs, asshown in Table 15. The minimum number of referentialerrors was approximately 6000 and the maximum was600,000. Elapsed times are indicated in seconds.

4.2.1. Query optimizationsFirst, we evaluated the left outer join optimization on

foreign keys, summarized in Table 16. Performancedegrades significantly for the set containment query, asthe cardinality of the referencing relation increases. Onthe other hand, it is noteworthy time grows more slowlyfor the left outer join query. To explain why setcontainment queries are slower than left outer joinqueries, we obtained the query plans for both types ofqueries on two DBMSs. We found that the queryoptimizer in DBMS X first produced a sequential scanfor the nested subquery and then a nested loop join todetermine the negated set containment. The optimizerfrom DBMS Yproduced a nested loop join for the same

Probability distributions used to insert invalid values

pdf Function Parameters

Uniform 1h h=6000

Zipf 1=ks

HM ;s

M=40000s=1

Geometry (1−p)n−1p p=1/2Normal 1ffiffiffiffiffiffiffi

2pr2p exp � ðx�lÞ2

2r2

� �μ=3000σ2=1000

Page 11: Referential integrity quality metrics

Table 16Query optimization: left outer joint and set containment

ni Left outer joint Set containment

10,000 25 6020,000 25 24030,000 29 60050,000 35 1560100,000 37 6120

Table 17Query optimization: early and late foreign key grouping

ni LateFKgroup

Early FK grouping

Size 4 Size 5

120,000 54 67 562,500,000 91 136 855,000,000 165 172 134

505C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

query, which was more efficient. On the other hand, theleft outer join was generally computed with a merge sortor hash join. These results supported using a left outerjoin by default.

We compared the performance between the late andthe early foreign key grouping optimization for smallgroups of invalid foreign key values. If the number ofdistinct values in the foreign key was similar to therelation cardinality (i.e. large) applying early foreignkey grouping was counterproductive. But if the numberof distinct values on foreign keys was smaller, then thisoptimization produced a significant speedup. InTable 17 we present performance for different cardinal-ities, considering foreign key groups of sizes 4 and 5(meaning each invalid value appeared in 4 or 5 tuples onaverage). Times become better for group size 5. Smallreferenced relations or large referencing relations havingfew distinct FK values make early foreign key groupingan essential optimization, since it significantly reducesthe size of the relation to be joined with all its referencedrelations.

Fig. 3 analyzes the impact of referential errors ontime performance with two representative pdfs. We cansee a() has a marginal impact on performance as itincreases. The impact is more important for the zipfdistribution. Summarizing, the size of a relation is farmore important than the number of invalid values.

4.2.2. Statistics on QMsTable 18 shows a summary of statistics of QMs with

a fixed r()=1% for the lineItem table and the partIdcolumn. Observe that for the geometric pdf one invalidvalue produces many referential errors. On the otherhand, for the uniform pdf all invalid values contributeevenly to referential errors. The geometric pdf has thehighest standard deviation, meaning there are values farfrom the mean, that significantly contribute to error.

5. Related work

There is extensive work on maintaining referentialintegrity. For instance, Refs. [17,19] identify conditions toavoid referential problems during data manipulation. Ref.

[14] presents a model for referential integrity maintenanceduring run-time execution. There is a proposal to checkdeferred referential integrity using a metadata network[5], where inclusion dependencies are considered andeach foreign key value is verified to have a correspondingmatching key value. No analysis is done on measuringinconsistency, nor on algorithm performance. Implemen-tation of update and delete referential actions, full andpartial types, reference types, are features that have notbeen fully completed in commercial DBMSs [24].

The SQL language has supported referential integrityby defining foreign keys and referential actions. Asurvey on SQL query optimization is given in Ref. [7],explaining when to perform a “group-by” operationbefore a join operation (early or eager), instead ofjoining tables first and then performing the “group-by”operation (late or lazy). This optimization is applicablewhen the number of result groups is small and when adifferent evaluation order does not change resultcorrectness. Our early foreign key grouping is aparticular case of this optimization, but we also considernull key values and we generalize it to build a cube offoreign keys. SQL aggregations are extended to returnapproximately consistent answer sets when there existinvalid FKs [20]; this approach dynamically detectsreferential errors and improves answer sets in two ways:(1) by distributing aggregated values in a measureattribute from invalid FK values among valid FK values;(2) by exploiting valid FK values in another attribute tomake distribution more accurate.

To our knowledge, data quality metrics have not beenproposed with respect to referential integrity. A proposalof simple metrics for data quality in relational databases isgiven in Ref. [18], where completeness and soundnessmetrics are introduced. These metrics compare the state ofthe database to the data from the real world such databaseis supposed to represent. Our proposed QMs measure thecompleteness of references among relations and to someextent QMs on foreign attributes measure soundness. Wedo not deal with the problem of measuring if a validreference is indeed valid. Concerning this aspect, Ref.[25] introduces an attribute-based model that incorporatesdata quality indicators and focuses on two dimensions:

Page 12: Referential integrity quality metrics

Fig. 3. Early FK grouping variant with two pdfs.

Table 18Univariate statistics with different probability distributions

pdf min μ max σ

Uniform 1 10 22 3Zipf 1 7 6002 82Geometric 1 3505 29,873 7825Normal 1 120 372 121

506 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

interpretability and believability. In Ref. [22] the authorsintroduce an algebra to measure completeness in severaldata quality dimensions, considering basic relational setoperations, null values and the open/closed worldassumptions, but ignoring referential integrity.

A model linking data quality assessment to userrequirements is proposed in Ref. [6]. In Ref. [21] aclassification of objective qualitymeasures is presented. Inparticular, their simple ratio function measures the ratio ofdesired outcomes to total outcomes. The authors suggestthat consistency could be measured by these means. Thisis related to our referential integrity error definitions. InRef. [16] the authors investigate the correct integration ofrelationship instances integrated from different sourcedatabases, focusing in detecting semantic conflicts andreconciling them. This proposal can be used as pre-processing to get QMs, since semantic conflicts must besolved before quantifying referential integrity errors.Referential integrity metrics have been studied consider-ing themodel design point of view. Ref. [4] introduces twometrics to aid model designers make better decisions byanalyzing referential paths and the number of foreign keysin a schema. The authors do not consider invalid keys ordenormalized relations. In contrast, our approach isapplicable after the database logical data model hasevolved from its original design or for database integra-tion. Authors in Ref. [23] introducemetrics to evaluate theeffectiveness of conceptual models for a data warehouse,which complement logical and physical design tools.

We now explain our approach from a broaderperspective. Data quality problems related to referentialintegrity are described in Ref. [15]. The authorsdistinguish between operational and diagnostic dataqualitymetrics. In thiswork referential violations betweentwo tables are called poor join paths. Our metrics at thedatabase and relation levels are diagnostic and frequencies

of invalid foreign key values are operational since theyprovide insight into how to fix referential errors.

Healthcare data warehouses represent a prominentexample, where data quality (missing information,inconsistency), data integration from diverse sources(structured and semistructured) and privacy (confidential-ity of patient records)make databasemanagement difficult[2]; several of such issues are related to referential integritymaintenance among tables coming from different sources.

On a closely related topic, Refs. [1,11] study dataquality issues in data warehouses considering the timedimension for aggregate queries; in our work we simplydetect referential integrity errors on the current state ofthe database, but we do not track when referential errorswere introduced. Therefore, discovering and explainingreferential errors back in time is an interesting issue forfuture work.

Discovering database relationships and repairinginconsistent databases are important related problems. InRef. [9] the authors propose techniques to automaticallyidentify PK/FK relationships between tables coming fromdifferent databases; in our case we assume such relation-ships are manually specified by the user or come from alogical data model. Therefore, this approach can beapplied as a pre-processing phase before computingreferential integrity QMs. In Ref. [3] the authors introducea cost framework to aid in the restoration of a database

Page 13: Referential integrity quality metrics

507C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

with integrity constraints (user-defined rather than refer-ential) violations via value modifications. Constraints inan inconsistent database can be satisfied by incorporatingconstraint checks in equivalent rewritten queries to givenqueries [12]. Both approaches apply heuristics to repairdatabases, either statically (by updating tables) ordynamically (by rewriting queries). In contrast, our workdiagnoses problems and gives detailed information to theuser in order to explain and repair referential errors.Therefore, our proposal can serve as a pre-processing stepbefore applying any heuristic to repair invalid values.

6. Conclusions

We proposed a comprehensive set of quality metrics(QMs) for referential integrity, which can be applied indata warehousing, database integration and data qualityassurance. Our QMs measure completeness in the case offoreign keys and consistency in the case of foreignattributes in denormalized databases. QMs are hierarchi-cally defined at four granularities: database, relation,attribute and attribute value. Quality metrics are of twobasic types: absolute and relative error. Absolute error isuseful at fine granularity levels or when there are fewreferential violations. Relative error is adequate at coarsergranularity levels or when the number of referentialviolations is relatively large.We introduced univariate andbivariate statistics on attribute level QMs to furtherunderstand the probability distribution of invalid foreignkey values. We explained how to efficiently calculateQMs with SQL queries. Specifically, we presented twoquery optimizations. The first optimization favors a leftouter join over a set containment to use a hash or merge-sort join algorithm instead of a nested loop algorithm. Thesecond optimization performs a group-by operation onforeign keys before a join (pushing aggregation, earlygroup-by) to reduce the size of the referencing relation.This optimization is effective for large relations withmany foreign keys, where the number of distinct valuesper foreign key is small. Experiments evaluate referentialintegrity QMs with real and synthetic databases ondifferent DBMSs. We got interesting results on realdatabases and end-users opinion was positive. QMs at thedatabase and relation level were more useful formanagers, whereas value and attribute level QMs weremore interesting to DBAs and application programmers.We studied quality metrics for attributes following fourdifferent probability distributions. Attribute values with ahigh relative error in skewed distributions can be used tofix critical referential problems. Univariate statistics andcorrelations can help understand the probability distribu-tion and co-occurrence of referential errors. On the time

performance side, a left outer join evaluation wasgenerally more efficient than a set containment due tofast join evaluation algorithms in different DBMSs. Onthe other hand, early foreign key grouping was alwaysmore efficient than late foreign key grouping.

Our work can be extended to repair a databaseconsidering the frequency of matching values for foreignkeys and foreign attributes. Our statistics on attributelevel metrics can be extended to apply multidimensionaldata mining techniques, such as clustering and factoranalysis. We want to study how to efficiently repair adenormalized database to leave it in a strict state, basedon a plan derived from quality metrics. Incrementalcomputation is needed to keep quality metrics up to datein ever-growing data warehouses. Finally, we believe ourideas may be applicable in semistructured data, such astext or XML.

References

[1] D.J. Berndt, J.W. Fisher, Understanding dimension volatility indata warehouses, INFORMS CIST Conference, 2001.

[2] D.J. Berndt, J.W. Fisher, J. Studnicki, A.R. Hevner, Healthcare datawarehousing and quality assurance, IEEEComputer 34 (12) (2001)56–65.

[3] P. Bohannon, W. Fan, M. Flaster, R. Rastogi, A cost-based modeland effective heuristic for repairing constraints by valuemodification, ACM SIGMOD Conference, 2005, pp. 143–154.

[4] C. Calero, M. Piattini, M. Genero, Empirical validation ofreferential integrity metrics, Information & Software Technology43 (15) (2001) 949–957.

[5] S.J. Cammarata, P. Ramachandra, D. Shane, Extending a relationaldatabase with deferred referential integrity checking and intelligentjoins, ACM SIGMOD Conference, 1989, pp. 88–97.

[6] C. Cappiello, C. Francalanci, B. Pernici, Data quality assessmentfrom the users perspective, ACM IQIS, 2004, pp. 68–73.

[7] S. Chaudhuri, An overview of query optimization in relationalsystems, ACM PODS Conference, 1998, p. 84-43.

[8] E.F. Codd, Extending the database relational model to capturemore meaning, ACM TODS 4 (4) (1979) 397–434.

[9] T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk, Miningdatabase structure; or, how to build a data quality browser, ACMSIGMOD Conference, 2002, pp. 240–251.

[10] R. Elmasri, S.B. Navathe, Fundamentals of Database Systems,3rd edition. Addison/Wesley, Redwood, City, California, 2000.

[11] J.W. Fisher, D.J. Berndt, Understanding dimension volatility indata warehouses, INFORMSConference on Information Systemsand Technology (CIST) Conference, 2001.

[12] A. Fuxman, E. Fazli, R.J. Miller, Conquer: efficient managementof inconsistent databases, ACM SIGMOD Conference, 2005,pp. 155–166.

[13] J. Gray, A. Bosworth, A. Layman, H. Pirahesh, Data cube: arelational aggregation operator generalizing group-by, cross-taband sub-total, ICDE Conference, 1996, pp. 152–159.

[14] B.M. Horowitz, A run-time execution model for referential integritymaintenance, IEEE ICDE Conference, 1992, pp. 548–556.

[15] T. Johnson, A. Marathe, T. Dasu, Database exploration andBellman, IEEE Data Engineering Bulletin 26 (3) (2003) 34–39.

Page 14: Referential integrity quality metrics

508 C. Ordonez, J. García-García / Decision Support Systems 44 (2008) 495–508

[16] E.P. Lim, R.H. Chiang, The integration of relationship instancesfrom heterogeneous databases, Decision Support Systems 29-2(3-4) (2000) 153–167.

[17] V.M. Markowitz, Safe referential structures in relationaldatabases, VLDB, 1991, pp. 123–132.

[18] A. Motro, I. Rakov, Estimating the quality of data in relationaldatabases, IQ, 1996, pp. 94–110.

[19] D. Mukhopadhyay, G. Thomas, Practical approaches to main-taining referential integrity in multidatabase systems, RIDE-IMS,July 1993, pp. 42–49.

[20] C. Ordonez, J. García-García, Consistent aggregations in databaseswith referential integrity errors, ACM IQIS Workshop, 2006.

[21] L. Pipino, Y.W. Lee, R.Y. Wang, Data quality assessment, ACMCACM 45 (4) (2002) 211–218.

[22] M. Scannapieco, C. Batini, Completeness in the RelationalModel: A Comprehensive Frame-work, IQ Conference, 2004.

[23] M. Serrano, C. Calero, J. Trujillo, S. Lujan-Mora, M. Piattini,Empirical validation of metrics for conceptual models of datawarehouses, CAISE, 2004, pp. 506–520.

[24] C. Türker, M. Gertz, Semantic integrity support in SQL: 1999 andcommercial (object-)relational database management systems,VLDBJ 10 (4) (2001) 241–269.

[25] R.Y. Wang, M.P. Reddy, H.B. Kon, Toward quality data: anattribute-based approach, Decision Support Systems 13 (3-4)(1995) 349–372.

Carlos Ordonez received his BS degree in Applied Mathematics andhis MS degree in Computer Science both from the UNAM University,Mexico, in 1992 and 1996 respectively. He got a PhD degree inComputer Science from the Georgia Institute of Technology, USA, in2000. Dr. Ordonez is currently an Assistant Professor at the Universityof Houston. He has published more than 20 research articles ondatabase systems and data mining.

Javier García-García received a degree in applied mathematics(actuarial sciences) from the UNAM University, Mexico, in 1982 andan MS degree in management of information technology from theNational Polytechnic Institute (IPN), Mexico, in 1987. He workedmanaging IT departments in several public and private institutions formore than 20 years. Currently, he is a PhD candidate at the UNAMUniversity, conducting research on relational database systems.