utility cost perspectives in data quality management even... · metadata is important for managing...

9
Winter 2009 Journal of Computer Information Systems 127 UTILITY COST PERSPECTIVES IN DATA QUALITY MANAGEMENT ADIR EVEN G. SHANKARANARAYANAN Ben-Gurion University of the Negev Babson College Beer-Sheva, 84105, Israel Babson Park, MA 02457 Received: February 25, 2009 Revised: May 7, 2009 Accepted: May 27, 2009 ABSTRACT The growing costs of managing data demand a closer examination of associated cost-benefit tradeoffs. As a step towards developing an economic perspective of data management, specifically data quality management, this study describes a value-driven model of data products and the processes that produce them. The contribution to benefit (utility) is associated with the use of data products and costs attributed to the different data processing stages. Utility/cost tradeoffs are thus linked to design and administrative decisions at the different processing stages. By modeling and quantifying the economic impact of these decisions, this study shows how economically superior data quality management policies may be developed. To illustrate it, the study uses the model to develop a data quality management policy for online error correction. The results indicate that decisions that consider economic tradeoffs can be very different compared with decisions that are driven by technical and functional requirements only. KEYWORDS: Data Quality Management, Data Warehouse, Information Value, and Metadata INTRODUCTION Data is an organizational resource that enables efficient business processes, supports decision making, and generates revenue as a commodity. This recognition, along with superior data collection and management technologies, has significantly increased the volumes of data managed by organizations [27]. The resources for data management activities (such as acquisition, processing, storage, and delivery) as well as investments in related technologies are also increasing with the increasing data volumes [26]. These draw attention to the economic aspects of data management — to what extent do data resources contribute to business value and profitability? Does the contribution justify associated implementation and maintenance costs? Addressing these questions requires an in-depth examination of the business benefits attributed to data resources, the costs for managing them, and understanding the implications of the associated cost/benefit tradeoffs for data management. We argue that current data management practices are driven primarily by functionality and technical efficiency, and rarely by economic considerations. In this paper we advance the notion of examining data management through an economic lens. We specifically focus on data quality improvement, an important data management activity. We propose a model for managing a data manufacturing process and its outputs. Our model examines the business benefit attributed to data resources, conceptualized as utility, and the costs involved in managing them.We suggest that design and administrative decisions introduce significant utility/ cost tradeoffs in data manufactures and associated processes. We show that understanding and modeling these tradeoffs can help identify optimal policies for improving the data quality of data resources. We posit that concepts of information value and economics can help improve data management in general and data quality management in particular, from a business viewpoint. While management efforts focus on technical and functional aspects, there is a need to measure performance within a business context and evaluate associated economics accordingly. Maximizing economic performance in data management implies increasing the utility gained from and/or reducing the costs of managing the data. In this study, we use net-benefit, the difference between utility and cost, as the indicator of economic performance and attempt to maximize net-benefit. While the costs of data management are reasonably well understood, the value gained is largely unknown and difficult to quantify [12]. Data management activities are deeply embedded within information systems and the value gained by a specific activity is an inextricable part of the overall utility gained from the system. Further, the decision to invest in data management solutions is not always a “yes/no” decision. It may involve the gamut of options in the continuum between the two extremes (e.g., implement a subset instead of the entire data resource). Each option may be associated with a different cost and may affect the utility contributed by the data resource, differently. While investments in a data resource may increase the utility gained from that resource, there could be a point, beyond which, additional investments do not significantly increase the utility gained. The objective of this paper is to highlight the economic factors that impact data management decisions. We focus on a specific data management activity, data quality management. As a foundation for conceptualizing the economic effects, we adopt the Total Data Quality Management (TDQM) view of data management environments as manufacturing processes, and their outputs as data products [25]. We associate the utility with the use of these products in different business contexts and attribute the costs to investments in acquiring data resources, and in the technologies and processes used to manage them. We link utility and cost to design and maintenance decisions in a data management environment. Unifying these pieces into a single high-level framework allows us to assess design alternatives better and identify ones that maximize economic performance. The optimal error correction policy that we develop in this paper

Upload: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

Winter2009 JournalofComputerInformationSystems 127

UtIlItyCoStperSpeCtIveSIndataqUalItymanagement

adIreven g.Shankaranarayanan Ben-GurionUniversityoftheNegev BabsonCollege Beer-Sheva,84105,Israel BabsonPark,MA02457

Received:February25,2009 Revised:May7,2009 Accepted:May27,2009

abStraCt

The growing costs of managing data demand a closerexamination of associated cost-benefit tradeoffs. As a steptowardsdevelopinganeconomicperspectiveofdatamanagement,specifically data quality management, this study describes avalue-driven model of data products and the processes thatproduce them.Thecontribution tobenefit (utility) isassociatedwiththeuseofdataproductsandcostsattributedtothedifferentdata processing stages. Utility/cost tradeoffs are thus linked todesign and administrative decisions at the different processingstages. By modeling and quantifying the economic impact ofthesedecisions,thisstudyshowshoweconomicallysuperiordataqualitymanagementpoliciesmaybedeveloped.Toillustrateit,thestudyusesthemodeltodevelopadataqualitymanagementpolicyforonlineerrorcorrection.Theresultsindicatethatdecisionsthatconsidereconomictradeoffscanbeverydifferentcomparedwithdecisionsthataredrivenbytechnicalandfunctionalrequirementsonly. KeywoRdS:dataQualityManagement,datawarehouse,InformationValue,andMetadata

IntrodUCtIon

data is an organizational resource that enables efficientbusiness processes, supports decision making, and generatesrevenue as a commodity.This recognition, along with superiordata collection and management technologies, has significantlyincreased the volumes of data managed by organizations [27].Theresourcesfordatamanagementactivities(suchasacquisition,processing, storage, and delivery) as well as investments inrelatedtechnologiesarealsoincreasingwiththeincreasingdatavolumes [26].Thesedrawattention to theeconomicaspectsofdatamanagement—towhatextentdodataresourcescontributetobusinessvalueandprofitability?doesthecontributionjustifyassociatedimplementationandmaintenancecosts? Addressingthesequestionsrequiresanin-depthexaminationof the business benefits attributed to data resources, the costsfor managing them, and understanding the implications ofthe associated cost/benefit tradeoffs for data management.we argue that current data management practices are drivenprimarily by functionality and technical efficiency, and rarelybyeconomicconsiderations.Inthispaperweadvancethenotionof examining data management through an economic lens.wespecificallyfocusondataqualityimprovement,animportantdatamanagementactivity.weproposeamodelformanagingadatamanufacturingprocessanditsoutputs.ourmodelexaminesthe

business benefit attributed to data resources, conceptualized asutility,andthecostsinvolvedinmanagingthem.wesuggestthatdesignandadministrativedecisionsintroducesignificantutility/costtradeoffsindatamanufacturesandassociatedprocesses.weshowthatunderstandingandmodeling these tradeoffscanhelpidentifyoptimalpolicies for improving thedataqualityofdataresources. weposit that concepts of informationvalue and economicscanhelp improvedatamanagement ingeneralanddataqualitymanagement in particular, from a business viewpoint. whilemanagement efforts focus on technical and functional aspects,thereisaneedtomeasureperformancewithinabusinesscontextand evaluate associated economics accordingly. Maximizingeconomic performance in data management implies increasingthe utility gained from and/or reducing the costs of managingthedata.Inthisstudy,weusenet-benefit,thedifferencebetweenutility and cost, as the indicator of economic performanceand attempt to maximize net-benefit. while the costs of datamanagementarereasonablywellunderstood,thevaluegainedislargelyunknownanddifficulttoquantify[12].datamanagementactivitiesaredeeplyembeddedwithin informationsystemsandthevaluegainedbyaspecificactivity isan inextricablepartoftheoverallutilitygainedfromthesystem.Further, thedecisiontoinvestindatamanagementsolutionsisnotalwaysa“yes/no”decision.Itmayinvolvethegamutofoptionsinthecontinuumbetween the two extremes (e.g., implement a subset instead oftheentiredataresource).eachoptionmaybeassociatedwithadifferentcostandmayaffect theutilitycontributedby thedataresource,differently.while investments inadata resourcemayincrease the utility gained from that resource, there could be apoint,beyondwhich,additionalinvestmentsdonotsignificantlyincreasetheutilitygained. The objective of this paper is to highlight the economicfactors that impact data management decisions.we focus on aspecific data management activity, data quality management.As a foundation for conceptualizing the economic effects, weadopt the Total data Quality Management (TdQM) view ofdata management environments as manufacturing processes,andtheiroutputsasdataproducts[25].weassociate theutilitywiththeuseoftheseproductsindifferentbusinesscontextsandattributethecoststoinvestmentsinacquiringdataresources,andinthetechnologiesandprocessesusedtomanagethem.welinkutility and cost to design and maintenance decisions in a datamanagement environment. Unifying these pieces into a singlehigh-level framework allows us to assess design alternativesbetter and identify ones that maximize economic performance.Theoptimalerrorcorrectionpolicythatwedevelopinthispaper

Page 2: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

128 JournalofComputerInformationSystems Winter2009

demonstratesanapplicationof thehigh-level framework in thecontext of managing and maintaining data quality. our modelsuggests that considering economic tradeoffs, in addition totechnicalandfunctionalaspects,mayyieldsignificantlydifferentquality management policies. we further show that in certainscenarios,itmaymakesensetoallowsomedataimperfectionstoremainiftheincrementalutilitygainedbycorrectingtheseerrorsdoesnotjustifythecosts. This paper makes several interesting contributions. First,it proposes an innovative model for modeling processes thatmanufacturedataproducts.Themodelcapturesthedesignchar-acteristics at each manufacturing stage as a metadata vector.Consequently, it enables a better understanding of designdecisionsthataffecteconomictradeoffsateachstageandpossibledependencies between stages. Second, by modeling designcharacteristics in this manner, this paper positions the designcharacteristicsandrelateddesignandmaintenancedecisionsinaneconomiccontext.Thispermitslinkingcostsandutilityandhelpsunderstandrelevanttradeoffswhendesigningtheproductionofadataproduct.Third, thepaperinstantiatesthemodel,andusingthis instantiation, develops an optimal policy for online errorcorrectioninadatamanufacturingsystem. In the remainder of the paper, we first review relevant datamanagement literature.we then develop the model of the datamanufacturing process that helps evaluate decision alternativesbyconsideringtheirimpactonutility,cost,andtheaccumulatingeffectofrandomhazards.wefurtherdemonstratetheapplicationof the model for developing a utility-driven policy for optimalonlineerrorcorrection.wefinallyconcludebyhighlightingthekey contributions of this study and suggesting directions forfutureresearch.

baCkgroUnd

datamanagementisacomplextaskthatrequiressignificantresourcesandmanagerialefforts.Adatamanagementenvironmentisacollectionofprocessesandsystemsthatfollowamulti-stagearchitecture[15].Inthisstudyweviewsuchanenvironmentasamanufacturingprocessthatcreatesdataproducts.dataproductscanbeused internally,sold tootherfirms,orembeddedwithinproductofferings.Thesecanbe,forexample,reportsthatareusedinsomedecisiontaskordatasetsthatareexportedtoothersystemsorusedforfurtheranalysesanddatamining.TheprocessviewofthedatamanagementenvironmentunderliesTotaldataQualitymanagement (TdQM), a paradigm that addresses data qualitypoliciesandqualityimprovementefforts[25].Thisviewhasbeenadoptedbyseveralqualityimprovementmethodologies(e.g.,[6],[17], [22]). This paper adopts the manufacturing process viewof data management environments as the foundation to modeland understand the business value of data quality managementactivities. Thetypicaldataprocesshasacomplexstructurethatincludesmultipleinputsandmultipleoutputs[3].Literaturehassuggestedtechniques for modeling the process by mapping it onto adirectednetworkofprocessingstages ([6], [22]). In thispaper,we focus on a set of sequential processing stages that create asetofdataproducts.Forexample,inadatawarehouse(dw),alargevolumeofdataisprocessedusingmultipledifferentstages([15], [16]). data is typically gathered from multiple sources,includingsourcesexternaltotheorganization.Thisdataisthencleansed,aggregated,andtransformedintothedesign(schema)ofthedatawarehouserepository.Thetransformeddataisstored

inastagingareafromwhichitisloadedintothedatawarehouse.From the warehouse, datasets are extracted and processed forreporting,extractingbusinessintelligence,datamining,andotherapplications.Thewarehouse,hence,hasmultipledataprocessingstages,andweassociateoneormoreoutputswitheachprocessingstage—intermediateoutputsaswellas thefinaldataproductsthataredeliveredtoconsumers.Similarsequentialprocessesexistin other data management environments as well. For instance,salesdata inafirmmaybecleansedandstored ina salesdatarepository. Inventory data from downstream partners may becollected and formatted based on the firm’s data requirements.This inventory data may then be combined with the sales datato create a demand forecast by another processing stage. Thegenerateddemandforecast(storedforuseelsewhere),combinedwiththefirm’sowninventorylevels,maythenbeusedtogeneratethecapacityorrequirementsplanningreport.Besidesthis,salesreports, inventory reports, and periodic forecasts may also begeneratedusingthesamesetofsequentialprocesses.Themodeldevelopedinthispaperisapplicabletoanydataenvironmentthatiscomposedofsequentialdatamanufacturingprocesses–beitanoperationalenvironmentdescribedaboveor thebackendof thedwdescribedearlier. Thequalityofdataresourcesisidentifiedasakeyfactorforthe success of firms [19]. due to the complexity and the highvolumesofdataprocessed,datamanagementenvironmentssuchas data warehouses are particularly vulnerable to data qualitydefects ([15], [21]). Studies have developed optimal inspectionandcorrectionpolicieswhich targetmaximaldataquality level([10], [24]). Parssian et al. examine the propagation of errors,originating from different data sources and through the dwprocessingstagesand,theireffectontheoutput[17].Cuietal.developlineage-trackingmechanismfordetectingthesourceoferrorsidentifiedatfinaldwstages[11].However,avastmajorityof literature does not address the value gained by improvingthequalityofdata, and the few thatdo,discuss itqualitatively(e.g., [19], [25]). our research offers a quantitative assessmentof the value gained. Ballou et al. offer a quantitative approachforunderstandingtheeffectofqualitymanagementdecisionsoneconomicperformance[6].Thisapproachalongwithothersthatoptimizeutilitywithinquality tradeoffs ([4], [5]), influence theframeworkproposedhere. Metadata is important for managing the complexity ofdata environments and for managing data quality in theseenvironments [15]. In this study, metadata is abstracted dataaboutthedatamanagedandthesystemsthatcreateandmanagethe data [21]. Metadata captures an abstraction of the designand administration choices related to different components ofthe data manufacturing environment such as infrastructure,model, process, contents, representation, and administration([20], [21]). Metadata characteristics can be broadly classifiedas design characteristics and maintenance characteristics. Thiscategorization reflects the different options for optimization atthevariousstagesofimplementingthedataenvironment.Designcharacteristics reflect the long term decisions that are typicallyexamined in the early stages (e.g., decisions regarding systemdesign,infrastructureandarchitecture).AssuggestedbyBaldwinandClark,thesetofdesignparameters,togetherwithassociatedconstraintsandinterdependencies,defineaspaceofallpossibledesigns [1]. The chosen design, the configuration of designcharacteristics, is a point in this design space. Maintenancecharacteristicsreflectshort-term,ongoingdecisionsthataremadewhen the system is operational (e.g., regarding maintenance,

Page 3: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

Winter2009 JournalofComputerInformationSystems 129

performance, and troubleshooting). Such decisions may alterand enhance the preliminary design. The process model thatwedevelopconceptualizesthesetofvariablesrelatedtodesigndecisionsasmetadata.weviewthemetadataasrepresentingthesetof inputcharacteristics thatdescribethedatamanufacturingprocesses.wepositthatthesecharacteristicsaffecttheoutputoftheprocesses,thedataproduct.Avaluedrivenevaluationofthedifferent metadata configurations (design alternatives) can helpoptimizedecisionsrelated to thedesignofdata resources(e.g.,thedesignofatabulardatasetin[14])anddecisionsonprocessmaintenance.

avalUedrIvenproCeSSmodelfordatamanUfaCtUre

Systems that manage organizational data and create dataproductsareoftenconceptualizedasmulti-stagemanufacturingprocesses(e.g.,[6],[17],[22]).Basedonthisconceptualization,the model developed in this study, links the processing stagesand the final output, the data product, to utility/cost tradeoffs.our formulation is influenced by dynamic-programmingtechniques [9]. It represents the data processes as a directednetwork of multiple processing stages and links the relateddesign and maintenance decisions, to economic performance.Thebasisforsuchdecisions, theinputtothemodel, isasetofdesigncharacteristics.Theobjectiveof themodel is to identifyanoptimalconfigurationofdesigncharacteristicsthatmaximizesthenet-benefit.wedefinenet-benefitas thedifferencebetweenutilityandcost. Thismodelingapproachprovidesapowerfultoolformanagingdataprocesses.Itallowsanalyzingdataflows,detectingpossiblesourcesofqualityerrors,quantifyingtheiraccumulatedeffectonthenet-benefitgainedbyusingthedataproduct,andevaluatingdecision alternatives for maximizing the overall benefit. Theanalyticaldevelopmentofsuchamodeliscomplex.So, as a first step, this study focuses on sequential processes with the final stage representing the data product output.Analyzingsequential(singleinput/outputperstage)casesfirstisacommonanalyticalapproach in developing data quality management models (e.g.,[2],[24]). The proposed model for sequential processes is shown infigure1.Itincorporatestheprocessingstages,thecharacteristicsofthedatasetateachprocessingstagerepresentedasametadatavector, the transformations that the dataset undergoes at eachstage represented by changes to the metadata vector, and therandomqualitydefectsthataffectqualityatoneormoreoftheseprocessing stages. These constructs, associated costs, utilityattributed to the final output, and the net-benefit gained aredescribedinmoredetailinthefollowingparagraphs.

ProcessingStages(Sn,n=0..N

):ThemodelhasN+1processingstages, indexedby [n].S0

represents the initial data acquisition

fromasourceandSN,thefinalstage,representsthedataproduct,the final output that may be used by multiple data consumers.StagesS1throughSN-1,representtheintermediaryprocessingandstoragestages.Thedatasetflowsthroughthesestages,undergoingsomeprocessingateachstage. MetadataVector(Xn

,n=0...N):Themetadatavectorisacollection

of characteristics such as the time of last update, number ofrecords,andqualitylevel, thatdescribesthedatasetatstage Sn.each characteristic is representedbyoneof the element in themetadata vector. data quality may be measured along severaldimensionssuchasaccuracy,completeness,andtimeliness[23].eachqualitymeasurementmaybeconsideredasanelementofthemetadatavector. Stage Transformations (Rn, {rn}

,n=0...N): the transformation

associated with stage Sn, represents the effect of processingperformedonthedataset,atthatstage(Sn).ItisreflectedbythechangestothemetadatavectorassociatedwiththeinputdatasetenteringstageSnandcapturedbythemetadatavectorassociatedwith the dataset that leaves stage Sn : Xn* = Rn(Xn+1). Stagetransformationsare typicallystochastic,but themodelassumesthatwhenneeded,anequivalentdeterministicapproximationcanbe obtained. The stage transformation RN can be chosen fromamongaset{rn} ofdifferentfeasiblealternatives. Random Quality Defects (Wn

,n=0...N): Quality defects may

occurinadatasetatanyofitsprocessingstages.Forannotationpurposes, the model assumes that quality defects at stage Snoccur(orareidentified)afterthedatahasbeenprocessedatthatstageandbeforeitistransferredtothenextstage(Sn+1).QualitydefectsarerepresentedasarandomdeviationvectorWn,similarinitsdimensionalitytothemetadatavectorXnandwithaknownprobability distribution F

n(W). The metadata Xn* of the actual

dataset that is transferred to from one stage to the followingstage is thedifferencebetween theoutputmetadataXn and thedeviation vector Wn.As our model is a first step in examiningtheutility/costperspectives,wehaveadopted theview that theelements of Wn are independent to permit parsimonious modeldevelopment. Accordingly,thetransformationRn+1associatedwiththenextstagecanbedefinedas:

Xn* = Xn - Wn ,andXn+1=Rn+1(Xn*) = Rn+1(Xn - Wn),where,eq.(1)

Xn–Themetadatavectorofadatasetafterprocessingatstage Sn

Wn–MetadatadeviationduetorandomqualitydefectsatstageSn

fIgUre1:Sequentialdataprocessingmodel

Page 4: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

130 JournalofComputerInformationSystems Winter2009

Xn*–ThemetadataforwardedtostageSn+1,consideringtheeffectofquality-defects

Rn+1 – The transformation associated with the followingstageSn+1

Thechosen transformationsassociatedwitheachstagehaveanimpactonthequalitydefectsthataffectthedatasetasitpassesthroughthatstage.Further,thequalitydefectsarealsoimpactedby the current state of the dataset which is represented by itsassociated metadata vector. Hence, the probability distributionfunctionofqualitydefectsat stageSn, F

n(W), is affectedby Xn

and Rn.wehaveassumedthattheerrorateachstageisdependentonandisaffectedbythemetadatavectorelement(i.e.,thestateofthedataset)correspondingtothaterrorinthepreviousstageand the processing/transformation performed at that stage. Hence,wearenot treating theerrorsatone stageasbeingcompletelyindependentof the error in theprevious stage.For instance, insequentialdataprocesses, suchas theone in theback-endofadata warehouse, one of the common errors is “drops” — lossof records. Based on our observations, we recognize that thenumberofrecordslostinonestageisindependentofthenumberofrecordslostinthepreviousstage.But,itisdependentonthenumberofrecordsinthedatasetinthatstage.whileithasbeenshown that an error is compounded — the magnitude of errorina stage is larger than themagnitudeof error in thepreviousstage—theexactdependencybetween theerror-magnitudes isnotexplicitlydefinedinresearch.webelieveourmodelhereisreasonablebecausethemetadataelement(e.g.,numberofrecords)isdeterminedbytheerrorinthepreviousstage,and,theerrorinthisstageisdependentonthemetadataelement. The metadata vector Xndescribes the dataset characteristicsalong many different dimensions. For instance, one elementof the vector may be the “number of records” in the dataset.other elements may include quality characteristics such as“number of missing data elements” representing a measure ofthecompletenessof thedatasetand theoverallaccuracyof thedataset.Themetadatavectorishencea“snapshot”ofthedatasetcharacteristics.Theelementsoftherandomqualitydefectvectorrepresentthedeltachangeinthecharacteristicofthedatasetdueto somequalitydefect.For instance, theoneelementmightbe“10”—indicatingthedeltachangeinthenumberofrecordsinthedatasetaftersomeprocessing.NotethatthedimensionalityofWnisidenticaltothedimensionalityofXn.eachelementofWnhencerepresentsthechangetothecorrespondingelementofXnrepresentingsomecharacteristicofthedataset. Inconstructingthismodelweassumethattheerrorscreatedinthedatasetaredetectableusingthemetadatavectoroftheinputandtheoutputdatasets.Processingerrorsareofseveraldifferenttypes. In this paper we focus on the effect of the processingerrorintermsofthequalityoftheoutputdataset.Forinstance,a processing error can reduce the accuracyof thedataset from90%to60%.Themetadatavectorwouldcapturethisreductioninaccuracylevelasacharacteristic.Similarly,processingerrorscanresultininvalidormissingvaluesintheoutputdataset,causingadropinthevalidityorcompletenessoftheoutputdataset.Theycanalsocauserecord-loss,theerrorthatwefocusontoinstantiatetheerror-correctionpolicydescribedlater. Cost (Cn

,n=0...N): The data manufacturing process may be

associatedwithdifferenttypesofcosts.Costscanbeattributedto,forexample,dataacquisition,investmentsinhardware,softwaredevelopment and licensing, ongoing maintenance, monitoringand system administration efforts. Some costs are fixed while

others vary, depending on the uncertain and dynamic behaviorof the data environment. In this study, we focus primarily onthe variable cost. The cost Cn is expressed in monetary unitsandlinkedtothechosenandappliedtransformationRnatstageSn. The cost is affected by the input data (Xn-1, Wn-1), as pro-cessing a large dataset of poor quality is likely to be morecostly than processing small and clean dataset. Similarly, thecost is affectedby the targetedoutput (Xn, Wn), asproducingalarge and error-free dataset is likely to be more expensive.Finally, the cost will be affected by the chosen transformationRn—processingislikelytobeamoreexpensive,forexample,ifmanual interventionisoftenrequired.Bymappingthesefac-tors, thecostatSn isCn = cn+1 (Rn,Xn-1,Wn-1,Xn, Wn),wherecn

is the mapping function that maps the costs associated withstageSn. Utility(Ui

,i=1..I):Thefinaldataproduct,theoutputofstageSN,

isdescribedbyametadatavectorXN*=XN-WN.Thisdataproductcan be used by multiple data consumers in different businesscontexts.wevieweachcontextasapossibleusageof thedataproduct.weassumeIpossibleusagesindexed[i],eachassociatedwithautilityUi.Theutilityreflectsthebusinessvaluecontributionof using the data (data product), measured in monetary units.Theutilityisaffectedbythemetadatacharacteristicsofthefinaloutput:Ui=ui(Xn*), whereuiisautilityfunctionassociatedwithusage[i]. Net Benefit (B): The net-benefit gained is defined as thedifference between the overall utility gained by the use ofdata product and the overall cost associated with creating andmanagingit.Consideringthestochasticbehavior,thenet-benefitanditsexpectedvalueovertimecanbedefinedas:

B=∑1i=1

Ui-∑Ni=0

Cn,

andE[B]=∑1i=1

E[Ui]-∑N

n=0E[C

n] eq.(2)

dataexitingaspecificstagehascertaincharacteristics(suchasqualitylevel,numberofmissingvalues,andnumberofmissingrecords).Improvementprocessestargetingthisdatawillchangethese characteristics and the extent of change is dependent onthe typeof transformation(appliedbythequality improvementeffort/process). Costs associated with each transformation maybe different and the utility of the finished product can varywith the transformation.Thepurposeof thismodel is to selecttheoptimalsetoftransformationsRnateachstagefromamongthecorrespondingfeasiblesetoftransformations{rn},suchthatthe overall net-benefit B (or the expected net-benefit E[B]) ismaximized. Theutilityof thedata isdeterminedby thesetofdecisionsit is used for. It is to be noted that the model developed hereaddressesscenariosinwhichthe usages (business contexts) of the data product are known in advance andthedataproductfulfilsthe requirements of these usages. In cases where the usage isunknown,real-optionmodelsareneededtomodeltheutility.wehavenotaddressedreal-optionmodelsinthispaper. Itisalsoimportanttonotethatthemodelassumesthatutilitycanbeestimatedandallocated toeachprocessing stagewithinthesetofprocessingstages.Ingeneral,wehaveassumedthattheutilityofdatawouldincreasewithincreasingdataqualitylevels—areasonableassumptioninmostdatamanagementscenarios.Havingdescribedourgeneralmodelandthemodelparameters,wenowshowhowthismodelmaybeinstantiatedforaspecificcase,todevelopanoptimalerrorcorrectionpolicy. Thekeyassumptionsofthemodelare:

Page 5: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

Winter2009 JournalofComputerInformationSystems 131

1.Sequential model — the process consists of multiplestages.Thereisonlyoneinputandoneoutputateachstage.

2.The random quality defects vector assumes that eachelementofthevectorisindependentofotherelementswithin the vector. The quality defect element at anystage is dependent on the corresponding metadataelement (describing the state of the dataset) and upontheprocessingperformed.Thedefectsfollowaknownprobabilitydistribution.

3.Thecost is affectedby thecharacteristicsof the inputdata and by the targeted characteristics of targetedoutput.

4.Utility is affected by the characteristics of the outputdataset. Utility can be determined for the dataset andallocatedforeachprocessingstageinthemodel.Utilityisdeterminedbasedonknownusagesanddecisions.Ifusagesanddecisionsassociatedwiththedatasetisnotknown, we need real-options modeling which is notaddressedinthispaper.

anoptImalonlIneerrorCorreCtIonpolICy

error detection and correction is an archetype data qualitymanagement approach (the others being process control andprocess design — [19], [25]). It addresses short-term “cures”to immediate quality improvement needs rather than long-termsolutions that target rootcauses [19].errordetectioncomparesdata to a baseline that is perceived as being correct (e.g., the“realworld”,businessrules,oranotherdataset).errorcorrectionmethodsmayincludesuchalternativesaseditingthedatamanually,fixingitusingautomatedprocesses,classifyingsomepartsofthedataset as unfit and prevent it from being used, retransmitting(or recapturing) the dataset (or subsets of it), or even possiblyleavingthedata“asis”whenthequalityimprovementnecessaryis far too expensive to justify the efforts. This study suggeststhat approaching very high quality might be sub-optimal whenviewedfromaneconomicperspective.Inspectionandcorrectionalternatives introduce tradeoffs between the implementationissues and the level of quality obtained, i.e., between cost andutility.Giventhesetradeoffs,thegoalofhighqualitymightcomeatacostthatcannotbejustifiedbythebenefitgained. we use our process model to develop an optimal errorcorrectionpolicythatillustratesthetradeoffsbetweenutilityandcostandhelpsunderstandtheimportanceofeconomicevaluation.Todevelopthepolicyweadoptdynamicprogramming(dP)makethefollowingassumptions:

• Transformations{Rn}areappliedonline,afterthedamagecausedbytherandomqualitydefecthasbeenassessed.This represents the “fixes” applied after knowingwhattheerrors in thedataare, subsequent to thedatabeingprocessed.

• AssessingdamageatstageSndoesnotrequireknowledgeofthebehavioratprocessingstagespriortoSn-1. weareinterestedinfixingthedamageandnotinwhatcauseditforthepurposesofthismodel.

• ThestochasticbehavioratstageSndoesnotdependonstochasticbehavioratotherstages

• Costsand utilities are orthogonal (i.e. independent),hence,modeledassumadditive

withtheseassumptions,thedPalgorithmforoptimalchoiceofstagetransformationsis:

• AtSnwedefinethe“forwardprofitability”functionJn=MAX

{n}E[Jn+1-Cn]

• Theboundarycondition for forwardprofitabilityat thefinalstageSNisJn+1=∑1

i=1Ui

• Theoptimalpolicyforchoosingastagetransformationis:Rn=ARGMAX

{n}E[Jn+1-Cn]

Ithasbeenshown thatunder thegivenassumptions, suchapolicymaximizesexpectedprofitabilityovertime[9].However,this base algorithm is very high-level and insufficient for ourneeds.Modelconstructsmustbefurtheranalyzedandquantifiedtomodelthetradeoffsandunderstandthemaximizationcriteria.Intheerrorcorrectionscenariothatweanalyzefurther,theprimarymetadatacomponentaffectingutilityandcost is the number of records in the dataset. we assume a tabular dataset in which all records have anidentical field structure.The number of records is proportionaltothesizeofthedataset.Ifthenumberofrecordsinthedatasetwere tobe reduceddue to errors, say, duringdata transfer, theutilityofthedatasetmaybereducedtosomeextent.(Reductioninnumberof records, or “drops”, is a classicbackend issue indatawarehouses.)Intheabsenceoferrorsinthemanufacturingprocesses,thenumberofrecordsinthedatasetatthestartofthestaged model should remain unchanged throughout the set ofprocessingstages.However,qualityoftherecordsinthedatasetmightbedamagedduetotransferandprocessingfailures.eachstage typically has a built in capability to recover the affectedrecords(e.g.,byreprocessingormanuallycorrectingthedamageddata subset). The cost of recovering the damaged recordsincreaseswiththenumberofrecordsandrecoverycostislinearlyproportionaltothenumberofrecordsthatneedreprocessing.Byreprocessing andfixing records, thevalue lost due todamagedrecords,mayberecovered.Thegoalistodevelopanonlineerrorcorrectionpolicythathelpsdecidewhetherornottorecovertheloss in value at each stage, such that the overall expected net-benefitismaximized.Thiscanbeformulatedasfollows:

• TheprocesshasN+1stagesSn, n=0..N

• Theexpected net-benefit, as suggested by the generalmodel(2),isthedifferencebetweentheoverallexpectedutilityandtheoverallexpectedcost.

• Utilityandcostareaffectedby thenumberof records.Accordingly,themetadataobservedatSnismodeledasascalarXn≥0andtheactualnumberofrecordsatthestageisdenotedxn

• WnisavariablethatrepresentstherandomrecordlossatSn.wn

istheactualrecordlossatSn, 0<wn<xn

• Fw

n (w, x)=P(Wn-1 <w |Xn=xistheconditionalrecordlossdistributionfunctionatSn.weassumethatF

wn (w, x)

canbeobtainedoncexnisknown• oncexnisknown,theloss ratioatSnisdefinedasqn(xn)=

E[W|Xn=xn]|xn, where qn(xn)iswithin[0,1].weassumethattheexpectedlossatSnisapproximatelylinearwiththenumberofrecords(i.e.,E[W|Xn=xn]|xn ≈ qnx,andthatqncanbeapproximatedinadvance

• Rn, the transformationatSn, iscodedeitheras1,whenvaluelossisrecoveredoras0whennoactionistaken.Hence,thetransformationcanbeformulatedasXn=Xn-1-(1-Rn)Wj-1

Page 6: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

132 JournalofComputerInformationSystems Winter2009

Nowweshowthatiftheformulationholdsfor[i-1],itholdsfor[i].AtstageSN-i:

XN-1=XN-i-1-(1-RN-i)WN-i-1, and

XN-i=MaxRN-iÎ{0,1}

{E[JN-i+1-CN-i]}=

MaxRN-iÎ{0,1}

{aN-i+1E[XN-i]-Min{aN-i+1, cN-i+1}E[WN-i]-cN-iRN-

iE[WN-i-1]}=

MaxRN-iÎ{0,1}

{aN-i+1E[XN-i-1 - (1 -RN-i)WN-i-1] -Min{aN-i+1, cN-

i+1}E[WN-i]-cN-iRN-iE[WN-i-1]}

The transformation RN-i is chosen after XN-i-1 and WN-i-1are known, hence, the expected values can be replaced withdeterministicexpressions:E[XN-i-1] = XN-i-1 and E[WN-i-1] = WN-i-1. AlsoreplaceWN-i=E[WN-i-1|XN-i=xN-i]=qN-i(XN-i-1-(1-RN-i)WN-

i-1).Asaresult:

JN=MaxRN-iÎ{0,1}

=

MaxRN-iÎ{0,1}

wenowdenoteaN-i=aN-i+1-qN-iMin{aN-i+1, cN-i+1}

andwecan

obtain

JN-i=MaxRN-iÎ{0,1}

{aN-iXN-i-1-aN-iWN-i-1+(aN-i-cN-i)RN-iWN-i-1

For RN-i =0: JN-i=aN-iXN-i-1-aN-iWN-i-1

ForRN-i=1: JN-i=aN-iXN-i-1-cN-iWN-i-1

Tomaximize,werecover(RN-i=1)ifaN-iXN-i-1-aN-iWN-i-1MaN-i

XN-i-1-cN-iWN-i-1,or,aftersimplifyingtheexpression,ifcN- i < aN-i.Further,JN-icanbeexpressedas:JN-i=aN-iXN-i+1-Min{aN-i, cN-i}WN-i-1.Thesuggestedformulationholdsfor[i-1]and,hence,byinductionitholdsfor[i]aswell. Q. E. D. A special case that we examine in more detail is when themarginalcostsarefixedatallstages(i.e.,cn = c,

n=0..N).weconsider

twopossiblescenariosinthiscase:

(a) The marginal cost of fixing a quality defect is larger than the marginal utility gained by the data product produced after fixing the quality defect (c>a):

• Sincec>aN=a,noactionistakenatthefinalstageSN (hence,RN=0)

• Min{aN, c} = aNandaN-1 =(1-qN-1) aN < c,hence,noactionistakenatstageSN-1

• Similarly,sinceaN-ikeepsdecreasingwith[i],noactionistakenatanypriorstage

(b) The marginal cost of fixing a quality defect is equal to or smaller than the marginal utility gained by the data product produced after fixing the quality defect (c ≤ a):

• Sincec ≤ a = aN,valuelossisrecoveredatstageSN(hence,RN=1)

• Min{aN, c} = c andaN-1 =aN-qN-1c<aN

• AsaN-idecreaseswith[i],whenaN-i≤c,noactionis

• Cn, thecostatSn iszeroifnoactionis taken(Rn = 0),or increaseswithrecordloss ifcorrectivemeasuresaretaken (Rn = 1). This model assumes a correction costthat is linearlyproportionalwith record loss, i.e.,Cn = cnRnWn-1

• ThefinalnumberofrecordsisdenotedXN*.weassumenorecordlossatSN,henceXN*=XN.Theutilityofusage[i] isassumedtobelinearwiththenumberofrecords:Ui = aiXN

• TheforwardprofitabilityatSN(theboundarycondition)isdefinedas

JN+1=∑1i=1

Ui=∑1i=1

aiXN = aXN, where a = ∑1i=1

ai

Following this formulation, thesuggesteddPalgorithmcanbedescribedas:

1. obtaintheforwardprofitabilityfunctionJn definedasJn=MAX

RnE[JN+1-Cn]

2. TheoptimalpolicyforchoosingthestagetransformationisRn*=ARGMAX

{n}E[Jn]=ARGMAX

{n}E[Jn+1-Cn]

3. TheboundaryforwardprofitabilityfunctionatthefinalstageisJN+1=aXN

Proposition: the process is optimal (i.e., yields maximumexpected net-benefit) when the following backward recursiondecisionrulesareappliedatstageSN-i

i=0..N

1. obtainthemarginalvalueaN-i=aN-i+1 -qN-i*Min{cN-i+1, aN-i+1}

2. Recoverthevalueloss(i.e.,RN-i=1)ifcN-i<aN-i,ortakenoaction(i.e.,RN-i=0)otherwise

3. TheforwardprofitabilityfunctionisgivenbyJN-i=aN-

iXN-i-1 - Min{cN-i+1, aN-i+1}*WN-i-1.Forboundarycondition,consideraN+1=a,cN+1=0 andqN=0

ProofbyInduction:Firstobservethatfori=0:

XN-1=XN-1-(1-RN)WN-1,and

JN=MaxRNÎ{0,1}

{E[JN+1-CN]}=

MaxRNÎ{0,1}

{aE[Xn]-cNRNE[WN-1]}=

MaxRNÎ{0,1}

{aE[Xn-1]-aE[WN-1]+aRN-1E[WN-1]-cNRNE[WN-1]}

ThetransformationRNischosenafterXN-1andWN-1areknown,hence the expected values can be replaced with deterministicexpressions:E[XN-1]=XN-1andE[WN-1]=WN-1.Therefore,

JN=MaxRNÎ{0,1}

{aXN-1-aWN-1+aRN-1WN-1-cNRNWN-1}

ForRN=0:JN=aXN-1-aWN-1

ForRN=1: JN=aXN-1-cNRNWN-1

To maximize, if a>cN the record loss is recovered (RN=1),otherwise no action is taken (RN=0). Therefore JN = aXN-1- Min{a,cN}*WN-1 and the formulation is confirmed for i=0.observingthataN+1=a, CN+1=0andqN=0:

aN=aN+1+qNMin{aN, cN+1}=a,andJN=aNXN-1-Min{aN-1, cN}*WN-1

aN-i+1XN-i-1-aN-i-1WN-i-1-aN-i-1RN-iWN-i-1-Min{aN-i-1, cN-1}qN-i(XN-i-1-(1-RN-i)WN-i-1)-cN-iRN-iWN-i-1{ }

(aN-i+1-Min{aN-i-1, cN-1})XN-i-1-(aN-i-1-Min{aN-i-1, cN-1}qN-i)WN-i-1+(aN-i-1-Min{aN-i-1, cN-1}qN-i-cN-i)RN-iWN-i-1{ }

Page 7: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

Winter2009 JournalofComputerInformationSystems 133

takenforallstagesbeforeandincludingstageSN-i.Value loss is recovered forall stagessubsequenttoSN-i

Theabovescenariosareillustratedinfigures2through4.Infigure 2, the marginal utility (increasing line) is always higherthanthemarginalcost(fixed).Therefore,therecordlossmustberecoveredatallstages.Infigure4,themarginalutilityisalwayslowerthanthemarginalcostandnoactionisnecessary.Infigure3,initially,themarginalutilityislowerthanthemarginalcostandsonoactionisneeded.However,atacertainstagethemarginalutility supersedes the marginal cost. From that stage onwards,recordlossmustalwaysberecovered.Itmustbenotedthat,thispolicyaddressesrecoveryofrecordsthatwerelostordamagedattheimmediatepreviousstageonly.Hadrecoverybeenappliedtotheoverallrecordloss(i.e.,addressinglossatallpriorstagesaswell),costsatlaterstageswouldhavebeensignificantlyhigher,resultinginasubstantiallydifferentoptimalpolicy.

The above model and its instantiation for an optimal errorcorrection policy have important implications for managingdataqualityinmulti-stagedataenvironments.Literatureindataqualitysuggeststhatallerrorsneedtobeaddressed([19],[23],[25]).Themodelandthesubsequentanalyses,however,suggestthatpermittingpropagationofqualitydefectstothenextprocessstagemaynotbeallbad—especiallyifthecostoffixingerrorsoffsetsthegaininutilityobtainedfromfixingthe(erroneous)data.Further,theresultssuggestthatinsomedataprocessingscenarios,fixingqualitydefectsmaybebeneficialonlyat the later stagesoftheprocessandshouldbeavoidedattheearlystages.Again,thisisdifferentfromliteraturethatsuggestsfixingerrorsasearlyaspossible.Forinstance,indatawarehouses,itisrecommendedthat errors be addressed in the operational databases and inthe extraction and transformation stages. However, our modelsuggests that itmight be less expensive andmoreprofitable tofixtheerrorsinthedatawarehouse.Attemptingtofixallerrorsastheyoccurateachstagecanresultinasub-optimaleconomicperformance. Stage transformations described earlier, are design choicesassociatedwiththedesignofdataqualitymanagementactivitiesthathelprecoverrecordloss—namely,thechoicewhetherornottoapplyamaintenanceactivitytoimprovequality.Thesechoicesaremadeafter thecharacteristicsof thedataset(representedbythemetadatavectorX)andthequalityhazards(representedbythequalityhazardsvectorW)areknown.Thedecision,whetherornottopursuedatacorrection,isbasedonknowingqualitydeficiencies,understanding the costs associated with correcting them, andunderstandingtheaddedutilitycontributedbythecorrecteddata.Moreover, the decisions are made dynamically, as the data isprocessed, and theactual costs,qualitydeficiencies, andutilitycontributioncanbeevaluated.Inthisstudy,themulti-stageddatamanufacturing has been modeled as being sequential.we havesimplified the model to permit a parsimonious formulation, astrategythatistypicallyadoptedindevelopinganalyticalmodels.Inreality,however,datamanufacturingenvironmentsmaynotbesequential.Themodelshouldbeenhancedtoaddressscenariosofmultipleinputsand/oroutputsperstage. Theproposedmodelisafirststeptowardsintegratingeconomicconsiderationsintomanagingdataqualityandmaintainingdatamanufacturing processes. Key issues with operationalizing themodel are the determination of costs and utilities. Fixed costsare relatively easier to estimate, but have a smaller impact onthe model outcomes. The variable costs have a larger impact,but,arehardertoestimate.onewayofobtaininggoodestimatesofvariablecosts isbyinterviewingtheadministratorstoobtainestimates of person-hours needed to locate a specific error, toimplementanad-hocsolutiontofixtheerrorandrecoverthedata,andextrapolatetheassociatedcostsusingthedatagatheredfromtheadministrators. estimatingutilitymayturnouttobeevenmorechallenging.Utility of information has been defined as the difference inbenefit in the presence of full information versus partial or noinformation[7].Thevalueofdataproductsmaterializesthroughusage and experience, which are context-dependent and oftenrequire successful integration with complementary resources[12]. Hence, a data product may offer different utilities underdifferent contexts in which it is used. data, therefore, cannotbe associated with utility in general and has to be associatedwithutilityinthecontextofaspecificusageortask.Incertaindata management contexts utility assessment is possibly lesschallengingthaninothers.Marketingandcustomerrelationship

fIgUre 2: fixed marginal CostScenario1

fIgUre4:fixedmarginalCostScenario3

fIgUre 3: fixed marginal CostScenario2

Page 8: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

134 JournalofComputerInformationSystems Winter2009

management(CRM),forexample,offertechniquesforestimatingtherelativevalueofcustomers—e.g.Recency,Frequency,andMonetary analysis (RFM) [18], and Customer Lifetime Value(CLV)[8].Suchassessmentscanbeagoodproxyforutility inourmodelasitoffersamonetaryestimateoftheutilityassoci-ated with customer records.An in-depth analysis of each con-text is necessary to determine a good proxy for utility in thatcontext. Themodelproposedinthisstudyassumesthattheutilityofthedataproductisknownorcanbedetermined.Inreallife,itislikelythattheutilityhassomeuncertainty.whenadwiscreated,someoftheusages(businessanddecisioncontexts)forthedataare known. Some others are known only after users have usedthedataforaperiodof time,or,whenanewbusinessproblemarises, and users attempt to figure out how to use the existingdata to solve the new problem. In its current form, our modeldoesnotaddressutilityuncertainty.Moreadvancedmodelsthatemployrealoptionsareneeded.Anotheralternativetoconsiderisinvestingsomeminimalqualitymanagementeffortsateachstageuntil the utility is better revealed (i.e., associated uncertaintyis reduced), at which time the decision on how to completelyaddress that quality issue can be made. Further, the modelassumes thatmultipledifferentutilitiesofagivendataproductare sum-additive. In other words, the utilities are independentand orthogonal.Although this assumption has been adopted indata quality literature (e.g., [3]), future research must addresssituationswhereutilitiesmaybeinterdependent.

ConClUSIonSandfUtUredIreCtIonS

datamanagement iscritical to thesuccessoforganizations.The technical and functional aspects of data managementhave been studied. The economic aspects such as the businessvalue contribution of data resources and the costs associatedwith managing them have not been thoroughly researched andquantified.wesuggestthataneconomicunderpinningofdesignand maintenance decisions in data management, in general,data quality management in particular, can help organizationsachieve optimal economic performance from data resources.we contribute by looking inside the “black box” of the datamanufacturing processes and its data product outputs. whilebusiness value is attributed to the output, the costs are linkedto the implementation and administrationof themanufacturingstages(processes)thatcreatetheoutput.Value(utility)andcostarebothaffectedbydesignandmaintenancedecisions.weunifythesefactorsintoasingleframeworkthatallowsbetterassessmentof design and maintenance alternatives, where the goal is tomaximize net-benefit, the difference between utility and cost.The directed network representation of the data manufacturingprocess highlights the impact of local decisions on the cost atsubsequentstages,onthevalueofthefinaloutput,andonoverallprofitability. This approach integrates business perspectives(costs,value,andprofitability)andtechnicalcharacteristics(datadesign and maintenance). It thus contributes to identifying asuperioralignmentofthetwofordataqualitymanagement. The optimal error correction policy developed heredemonstratesanapplicationofthegeneralframeworkforongoingdatamaintenancedecisions.Theresultshighlighttheimportanceoftheeconomicperspectiveofdatamanagement–errorcorrectiondecisions that take into account utility and cost tradeoffs, mayturnouttobesignificantlydifferentfromdecisionsthataredrivenby technical and functional considerations. data imperfections

are indeedundesirable.However, itmaymakeeconomic sensetoletthedefectsremainifthevalueobtainedbycorrectingtheseimperfections cannot justify related costs. The framework andtheinstantiatedonlineerrorcorrectionmodelallowquantitativeassessmentoftheutilityandcosttradeoffsinvolved.Theyassistindevelopingeconomicallyoptimalerrorcorrectionpolicies. Limitationsofthisstudyofferdirectionsforfurtherresearch.The study examined a sequential model and considered asimplified decision scenario with a limiting set of assumptionsand conditions. The focus was limited deliberately to betterillustrate and formulate the value-driven model and obtain aclosed-form analytical solution. Further development of theprocessmodeloughttoreflectoperationaldataprocesssettings,addressingcomplexscenariossuchassourceintegration,repeatedprocessing,andmultipledataproducts. empiricallyvalidating themodelandits theoreticalfindingsis another important research issue. This can be pursued byexamining the suggested value and cost formulations in real-life (or well-simulated) settings, identifying data and systemcharacteristics that influence utility and cost, and quantifyingthe dynamics of data processing. Such an empirical validationis challenging for many reasons. one of them is that businessprocesses that integrate data are typically complex in terms ofprocessing stagesand technologies involved.Theyalso involvecomplementary resources such as human knowledge andfinancialresources.Hence,quantifyingtheutilitycontributionofdataresourcescanmakevalidationchallengingandmayrequiretechniques for mapping and attributing value within businessprocesses and allocating it appropriately between the differentresourcesinvolved.

referenCeS

[1]. Baldwin,C.y.andClark,K.B.,Design Rules, Vol. 1: The Power of Modularity,MITPress,Cambridge,MA,2000.

[2]. Ballou d.P. and Pazer, H.L., “Process Improvement vs.enhancedInspectioninoptimizedSystems”International Journal of Production Research, 23(6), 1985, pp. 1233-1245.

[3]. Balloud.P.andPazer,H.L.,“ModelingdataandProcessQualityinMulti-Input,Multi-outputInformationSystems”Management Science,31(2),1985,pp.150-162.

[4]. Ballou, d.P. and Pazer, H.L. “designing InformationSystems to optimize the Accuracy-timeliness Tradeoff”Information Systems Research,6(1),pp.51-72.

[5]. Ballou, d.P. and Pazer, H.L. “Modeling Completenessversus Consistency Tradeoffs in Information decisionSystems,” IEEE Transactions on Knowledge and Data Engineering,15(1),pp.240-243.

[6]. Ballou,d.P.,wang,R.,Pazer,H.andTayi,G.K.“ModelingInformation Manufacturing Systems to determine dataproduct Quality,” Management Science, 44(4), 1998, pp.462-484.

[7]. Banker, R. d. and Kauffman, R. J. “The evolution ofResearchonInformationSystems:AFiftieth-yearSurveyof the Literature in Management Science,” Management Science50(3),2004,pp.281-298.

[8]. Berger, P.d., and Nasr, N.I. “Customer Lifetime Value:MarketingModelsandApplications”Journal of Interactive Marketing12(1),1998,pp17-30.

[9]. Bertsekas,d.P.(2000),Dynamic Programming2nded.,AthenaScientific,Nashua,NH.

Page 9: UtIlIty CoSt perSpeCtIveS In data qUalIty management Even... · Metadata is important for managing the complexity of data environments and for managing data quality in these environments

Winter2009 JournalofComputerInformationSystems 135

[10]. ChengalurI.N.,Balloud.P.andPazer,H.L.“dynamicallydetermined optimal Inspection Strategies for SerialProductionProcesses,”International Journal of Production Research,30(1),1992,pp.169-187.

[11]. Cui,y.,widom,J.,andwiener,J.L.“TracingtheLineageof View data in a warehousing environment,” ACM Transactions on Database Systems,25(2),2000,pp.179-227.

[12]. davern,M.J.andKauffman,R.J.“discoveringPotentialandRealizingValuefromInformationTechnologyInvest-ments,” Journal of Management Information Systems,vol.16,no.4,2000,pp.121-143.

[13]. even,A.,andShankaranarayanan,G.“Utility-drivenAsses-smentofdataQuality,”The DATA BASE for Advances in Information Systems(38:2),May2007,pp.76-93.

[14]. even, A., Shankaranarayanan, G. and Berger, P. d.“economics-driven data Management: An ApplicationtothedesignofTabulardatasets,”IEEE Transactions on Knowledge and Data Engineering (19:6),June2007,pp.818-831.

[15]. Jarke, M., Lenzerini, M., Vassiliou,y., and Vassilliadis,P., Fundamentals of Data Warehouses, Second edition,Springer,Newyork,Ny,2003.

[16]. Kimball, R., Reeves, L., Ross, M. and Thorthwaite, w.The Data Warehouse Lifecycle Toolkit, wiley ComputerPublishing,Newyork,Ny2000.

[17]. Parssian, A, Sarkar, S., and Jacob, V.S. “Assessingdata Quality for data products — Impact of Selection,Projection,andCartesianProduct,”Management Science,50(7),2004,pp.967-982.

[18]. Petrison, L.A., Blattberg, R.C., and wang, P. “database

marketing: past present, and future,” Journal of Direct Marketing 11:4,1997,pp.109-125.

[19]. Redman, T.C., Data Quality for the Information AgeArtech,Boston,MA,1996.

[20]. Sen,A.“MetadataManagement:Past,PresentandFuture,”Decision Support Systems, 37(1),April 2004,pp. 151-173.

[21]. Shankaranarayanan, G., and even,A. “Managing Meta-dataindatawarehouses:PitfallsandPossibilities,”Com-munications of the AIS,14(13),2004,pp.247-274.

[22]. Shankaranarayanan, G., Ziad, M., and wang, R. y.“Managing data Quality in dynamic decision environ-ments: An Information product Approach,” Journal of Database Management,14(4),2003,pp.14-32.

[23]. Strong,d.M.,Lee,y.w.,andwang,R.y.“dataQualityinContext,”Communications of the ACM,40(5),May1997,pp.103-110.

[24]. Tayi, G.K., and Ballou, d.P. “An Integrated Production-Inventory Model with Reprocessing and Inspection,”International Journal of Production Research,26(8),1988,pp.1299-1315.

[25]. wang, R.y. “A Product Perspective on Total QualityManagement,”Communications of the ACM,41(2),1998,pp.58-65.

[26]. west,L.A.Jr.“PrivateMarketsforPublicGoods:PricingStrategies of online database Vendors,” Journal of Management Information Systems 17(1), 2000, pp. 59-

84.[27]. wixom, B. H., and watson, H. J. “An empirical Inves-

tigationof theFactorsAffectingdatawarehousingSuc-cess,”MIS Quarterly25(1),2001,pp.17-41.