database complexity metrics - ceur-ws.orgceur-ws.org/vol-1284/paper9.pdf · database complexity...

Database Complexity MetricsCoraI Calero, Mario Piattini, Marcela Genero

GmpO ALARCOS - Departamento de Inform;iticaUniversidad de Castilla-La Mancha- SPAIN

email: {ccalero, mpiattin meenero l @inf-ct.uclm.es

AbstractMetrics are useful mechanisms for improving

the quality of soji`vvare products and also fordetermim.ng the best ways ro help practitioners andresearchers. UnfortunateLy, almost all the metricsput forward focus on program characteristz`csdisregardz.ng databases. However, databases arebecomz`ng more compLex, and it is necessary tomeasure schemata complexity in order to understand,monz.tor, control, predict and z.mprove databasedevelopment and ma"zntenance projects. In fizz'spaper, we will present dierent measures in order tomeasure the complexz-ty that aJTects themaintainabz"Iz.ty of the relational, object-relatz`onaland active database schemas. However it is notenough to propose the metrics, a formaL validatz.onis also needed for know!ng thez.r mathematz.caLcharacteristics. We wz.II present the two maintendencz.es in metrics formal validation, axz`omatz.capproaches and measurement theory" However,research !nto sofiware measurement is neededfrom a theoretz`caL but also jrom a practical point ofview ( (121). For thz`s reason, we wz`LL also presentsome of the experz�ments that we have developedforthe dijferent ia`nds of databases.

1.IntroducBon

Software engineers have been proposing largequantities of metrics for software products,processes and resources ([l0], [23], [37]). Metricsare useful mechanisms for improving the quality ofsoftware products and also for determining the bestways to help practitioners and researchers ([26]).Unfortunately, almost all the metrics put forwardfocus on program characteristics (e"g" McCabe([21]) cyclomatic number) disregarding databases([34]). As far as databases are concerned, metricshave been used for comparing data models ratherthan the schemata itself. Several authors ([2], [16],[17], [18], [32], [33]) have compared the most wellknown models such as NIAM and relationalusing different metrics. Although we think thiswork is interesting, metrics for comparing schemataare needed mostly for practical purposes, like

choosing between different design alternatives orgiving designers limit values for certaincharacteristics (analogously to value 10 for McCabe complexity of prograrns). Some recentproposals have been published for conceptualschemata ([20], [24], [28]) but for conventionaldatabases, such as relational ones, nothing has beenproposed, excepting normalization theory.

Databases are becoming more complex, and it isnecessary to measure schemata complexity in order tounderstand, monitor, control, predict and improvedatabase development and maintenance projects. Inmodem Information Systems (IS), the database hasbecome a crucial component, so there is a need topropose and study some measures to assess itsquality.

Database quality depends on several factors:functionality, reliability, usability, efficiency,maintainability and portability ([15]). Our focus ison maintainability because maintenance accountsfor 60 to 90 percent of life cycle costs and it isconsidered the most important concern for modemIS departments ([1l], [22], [29]).

The International Standard, ISO/IEC 9126,distinguishes five subcharacteristics formaintainability: analysability, changeability,stability, testability and compliance. Analysability,changeability and testability are in turn influencedby complexity ([191). However, a generalcomplexity measure is "the impossibLe holy grail"([9]), i.e" it is impossible to get one value thatcaptures all the complexity factors of a database,Henderson-Sellers ([14]) dis6nguishes three typesof complexity: computational, psychological andrepresentational, nd for psychological complexityhe considers three components: problemcomplexity, human cognitive factors and productcomplexity. The last one is our focus and for ourpurposes the product will be databases.

Our goal is to propose internal measures fordatabases, which can characterise their complexityhelping to assess database maintainability (theexternal quality characteristic). In the next sectionwe will present the framework followed to defineand validate database metrics.

Section three summarizes the proposed metricsfor relational, object-relational and active

QuaTIC'2OOI / 79

databases. In section four the formal validation ofsome of these metrics is described. Some empiricalvalidations are presented in section five andconclusions and future work will be presented insections six and seven respectively.

2. A Framework for Developing andValidating Database Metrics.

As we have said previously, our goal is todefine metrics for controlling databasemaintainability. However, metrics definition mustbe done in a methodological way, so it is necessaryto follow a number of steps co ensure the reliabilityof the proposed metrics. Figure 2 presents themethod we apply for the metrics proposal.

(such as [37] or [36]) specify a generalframework in which measures should bedefined. Measurement theory gives cleardefinitions of terminology, a sound basisof software measures, criteria forexperimentation, conditions for validationof software measures, foundations ofprediction models, empirical properties ofsoftware measures, and criteria formeasurement scales.Empirical vaBdaOon. The goal of thisstep is to prove the practical utility of theproposed metrics~ Although there arevarious ways of performing this step,basically we can divide the empiricalvalidation into experimentation and casestudies. Experimentation is usually madeusing controlled experiments and the casestudies usually work with real data. Bothof them are necessary, the controlledexperiments as a first approach and thecase studies for backing up the results .

METRIC~ ~,~ .--. ~ . .

Figure 2. Steps followed in the definition andvalidation of the database metrics

In this figure we have three main activities:

Metrics definition The first step is theproposal of metrics. Although it lookssimple, it is an important one in ensuringmetrics are correctly defined. Thisdefinition is made taking into account thespecific characteristics of the database wewant to measure and the experience ofdatabase designers and administrators ofthese databases.TheoreBeal validation. The second step isthe formal validation of the metrics. Theformal validation helps us to know whenand how to apply the metrics. There aretwo main tendencies in metrics validation:the frameworks based on axiomaticapproaches and the ones based on themeasurement theory. The goal of the firstones is merely definitional. The most well-known frameworks of this type are thoseproposed by [351, [3] and [251. Themeasurement theory-based frameworks

In this section, we present the different metricsthat we have proposed for relational. objectrelational and active databases. For each kind ofdatabase, a brief summary of its maincharacteristics is given and an example usingANSI/ISO SQL:1999 code is used to illustrate thecalculation of the proposed metrics.

Metrics for Rela6onul DatabasesTraditionally, the only indicator used to

measure the "quality" of relational databases hasbeen the normalization theory, with which [13]propose to obtain a normalization ratio. However,we think that normalization is not enough tomeasure complexity in relational databases, so wepropose the following four metrics in addition tonormalization ([5]):

Number of aun.buzes (NA)NA is the number of attributes in all the tables

of the schema.

Depth Referential Tree (DRT)DRT is defined as the lenooth of the longest

referential path in the database schema. Cycles areonly considered once.

80 / QuaTIC2001

Number of Foreign Keys (NFK)The NFK metric is defined as the number of

foreign keys in the schema.

Cohesion of the schema (COS)COS is defined as the sum of the square of the

number of tables in each unrelated subgraph of thedatabase schemata that is:

|USj 2 |US| number of unrelated subgraphsCOS - .S JV7 USi - NTUSi number of tables in the

z=1 ` related subgraph "i"

We apply the previous metrics tothe following example (suppliers-and-partsdatabase) taken from [61:

CREATE TAELE S( S# S#,

SNAME NAME,STATUS STATUS,CITY CITY,PRIMARY KEY (S#));

CREATE TABLE P( P# P#PNAME NAME,COLOR COLOR,WEIGHT WEIGHT,CITY CITY,PRIMARY KEY (P#));

CREATE TABLE SP( S# S#,P# P#,QTY QTY,PRIMARY KEY (S#, P#),FOREIGN KEY

NCES S,FOREIGN KEY

REFERENCES P);

In this schema the value of the metrics are: NA= 12, DRT = 1, NPK = 2, COS = 9.

Metrics for Object-Relotional DatabasesAn object-relational database schema is

composed of a number of tables related byreferential integrity, which have columns that canbe defined over simple or complex (user-defined)data types. We define the next metrics for object-relational databases ([4J):

Schema Si~e (SS)We define the size of a system as the sum of the

size of every table (TS) in the schema:

(S#)

(P#)

This relational database schema can berepresented as

a relational graph (see figure 3).

The table size (TS) measures the size not onlyin terms of the simple columns (defined usingsimple domains) bot also in terms of complexcolumns (defined using user-defined classes).Formally it can be defined as the sum of the totalsize of the simple columns (TSSC) and the totalsize of the complex columns (TSCC) in the table.TSSC is simply the number of simple columns inthe table (considering that each simple domain hasa size equal to one). TSCC is defined as the sum ofcomplex columns size (CCS). The size of acomplex column is no more than the size of theclass hierarchy above the columnand is defined asweighted by the number of complex columns whichuse the hierarchy. Finally, the size of a classhierarchy is defined as the sum of the size of eachclass on the hierarchy. For more details about theprecise definition of this metric see [4J.

Complexz"ty of references between tables (DRT,NFK)

In object-relational databases, othercharacteristics of relational databases are preserved.Metrics related with the referential integrity, suchas NI:;K and DRT proposed in the previous section,can also be used.

We can apply these metrics to the followingexample:

S

SP

P

CREATE TYPE project AS (name CHAR(10),

budget FLOAT);Fiooure 3. Relational graph for the example

QuaTIC'2001 / 81

CREATE TYf E employee AS (emp_num INTEGER,level INTEGER,salary_base FLOAT,proj project)method calc_salary()

RETURNS DECIMAL(72));

Let us assume that all methods have acyclomatic complexity equal to 1. In table 1, wepresent the value of the size of each data type.

Table 1. Size values of the data types

Then, the values for the other metrics are: SHC= 6, CCS = 3, TSCC = 6, TSSC - 2, TS = 8, SS -8.

Mebies for Active DatabasesWhen measuring active databases, we can make

use of the notion of a triggering graph as defined in[I]. A triggering graph is a pair <S, L> where Srepresents the set of ECA rules, and L is a set ofdirected arcs where an arc is drawn from Si to Sj ifSis action causes the happening of an eventoccurrence that participates in Sjs events. Thisnotion of triggering graph is modified by [71 in twoaspects. Firstly, arcs are weighted by the number ofpotential event occurrences produced by thetriggering rule Ci.e. Si) that could affect the ruletriggered off (i.e. Sjs event). Secondly, the nodesS are extended with the set of transactions T. Atransaction is an atomic set of (database) actionswhere any of these actions could correspond to anevent triggering one or more rules. Therefore, Tnodes will have outgoing links but never incominglinks, as we assume that a transaction can never betriggered from within a rules action or anothertransaction.

The active database could be characterised by thefollowing triggering graph measures ([8]):

. NA the minimum number of anchorsrequired to encompass the whole set ofpotential causes of Si. An anchor is atransaction node of the triggering graph,

which has a link (either directly ortransitively) with at least one cause of Si.D, the distance. This measure corresponds tothe length of the longest path that connectsSi with any of its anchors.TP, the triggering potential. Given atriggering graph < S, L >, and a node of thegraph, rule Si, the number of causes of Si, isthe sum of weights of the incoming arcsarriving at Si. The triggering potential for arule R is the quotient between the number ofpotential causes of Si, and Sis eventcardinality~

CREATE TRIGGER ONEAFTER DELETE ON TABLE3FOR EACH ROWWHEN (OLD.NUMBER=3)BEGIN

DELETE FROM TABLE4 WHERE TABLE4.S#=:TABLE3"J#;END ONE;CREATE TRIGGER TWOAFTER DELETE ON TABLE4FOR EACH ROWWHEN COLD.NAME=.SMITH)BEGIN

DELETE FROM TABLE5 WHERE TABLE4.S#=:�OLD.S#:END TWO;

For trigger ONE, NA = I, TP = I and D = I, fortrigger TWO, NA = 1, TP = 1, and D = 2.

4. Metrics Formaf VaRdation

There are two main tendencies in metricsvalidation: the frameworks based on axiomaticapproaches and the ones based on the measurementtheory. The goal of the first ones is merelydefinitional~ On this kind of formal framework, aset of formal properties is defined for a givensoftware attribute and it is possible to use thisproperty set for classifying the proposed measures.

The most well-known frameworks of this typeare those proposed by [351, [3], and [25]. The maingoal of axiomatisation in software metrics researchis the clarification of concepts to ensure that newmetrics are in some sense valid. The measurementtheory based frameworks (such as [37} or {36])specify a general framework in which measuresshould be defined, Measurement theory gives cleardefinitions of terminology, a sound basis ofsoftware measures, criteria for experimentation,conditions for validation of software measures,foundations of prediction models, empiricalproperties of software measures, and criteria formeasurement scales. The discussion of scale typesis important for statistical operations.

In this section, we will present the results of theformal verification of the presented metrics with

Name datatype

SHOT SADTSAS + CAS

DTS

ROJECT 0 2+O 2EMPLOYE I 3+2 6

82 J QuaTIC'2OOI

results of an experimentation depend on careful,rigorous and complete experimental design. Aclaim that a measure is valid because it is a goodpredictor of some interesting attribute can bejustified only by formulating a hypothesis about therelationship and then testing the hypothesis ([10]).

In the rest of this section, we summarizedifferent experiments that we have done with someof the metrics discussed in this chapter. All of theseinitial experiments require further experimentationin order to validate the findings. However, theseresults can be useful as a starting point for futureresearch.

A complete description of the experiments canbe found in [5] for relational database metrics, in[27] for object-relational ones and in [8] for activeones.

Relational experimentOur objective was to demonstrate that the

metrics related with referential integrity (DRT and) can be used for measuring the complexity of

the relational database schema, which influences inthe relational database understandability. Theparticipants of this study were computer sciencestudents at the University of Castilla-La Mancha(Spain), who were enrolled in a two-semesterdatabases course. Based on the results of thisexperiment, we concluded that the number offoreign keys in a relational database schema is amore solid indicator of its understandability thanthe length of the referential tree. This metric is notrelevant by itself, but can modulate the effect of thenumber of foreign keys in a database system

Object-relational experimentFive object-relational databases were used in

this experiment with an average of 10 relations perdatabase. Five subjects participated in theexperiment. All of them were experienced in bothrelational databases and object-orientedprogramming. To analyze the usefulness of themetrics, we used two techniques: C4.5 ([30]), amachine learning algorithms, and RoC ([31]), arobust Bayesian classifier. In conclusion, both thetechniques discover that the table size is a goodindicator for the understandability of a table. Thedepth of the referential tree is also presented as anindicator by C4.5 but not clearly by RoC. Thenumber of foreign keys does not seem to have areal impact on the understandability of a table.

Active experimentOur objective was to assess the influence of D

and TP in rule interaction understandability.However, such understanding could be influenced

RE NFKL

VE DRT

N

A NAL

COS

BRIANO ETAL(1996)

ZUSE(1998)

COMPLEXHY ABOVE THE ORDINA

LENGT ABOVE THE ORDINAL

SLZ ABOVE THE ORDINA

512 RATIO

SS SIZ ABOVE THE ORDINA

TS SIZE ABOVE THE ORDINA

NA COMPLEXITY ABOVE THE ORDINA

TP NOTCLASSIPIAB

ABOVE THE ORDINA

D LENGT ABOVE THE ORDINA

Table 4. Summary of metrics formal validation

With the axiomatic approach results we canknow, for example for relational databases that weneed some metrics for capturing cohesion andcoupling and covering all the characteristicsdefined by the framework From the measurementtheory results we can know what kind ofoperations it is possible to make with the definedmetrics, the statistics that it is possible to apply tothem etc.

5. Metrics Empirical VaBdaHon

In the past, empirical validation bas been aninformal process relying on the credibility of theproposer. Often , when a measure was identifiedtheoretically as an effective measure of complexity,practitioners and researchers began to use themetric without questioning its validity. Today,many researchers and practitioners assume thatvalidation of a measure (from a theoretical point ofview) is not sufficient for widespread acceptance.They expect the empirical validation to demonstratethat the measure itself can be validated. Useful

|

QuaTIC'2001 / 83

by how the reasoning is conducted. As rules can beseen as cause-and-effect links, two questions can beposed by the user: "what effects can a rule produce(forward reasoning)?" or "how can an effect beproduced (backward reasoning)?". The participantsof the experiment were final-year computer sciencestudents at the University of the Basque Country,who were enrolled in an advance database courseThe students were already familiar with relationaldatabase, and some laboratories were previouslyconducted on the definition of triggers. For theforward experiment, we concluded that thetriggering potential in a database schema is a solidindicator of its understandability, and that thedistance is not relevant by itself and cannotmodulate the effect of the triggering potential. Forthe backward experiment we concluded that bothmetrics are solid indicators of its understandability.

All the experiments described need to bereplicated in order to obtain more consistent results.However, controlled experiments made in alaboratory are useful as a starting point but presentsome problems such a the large number of variablesthat can cause differences- Therefore, it isconvenient to also run case studies working withreal data.

6. Conclusions and Future Work

Databases are becoming more complex, and it isnecessary to measure schemata complexity in orderto understand, monitor, control, predict andimprove database development and maintenanceprojects. Database metrics could help designers,choosing between alternative semanticallyequivalent schemata, to select the mostmaintainable one and understand their contributionto the overall IS maintainability.

We have put forward different measures (forinternal attributes) in order to measure thecomplexity that affects the maintainability (anexternal attribute) of the relational, object-relationaland active database schemas.

However it is not enough to propose themetrics, a formal validation is also needed forknowing their mathematical characteristics. Wehave presented the two main tendencies in metricsformal validation, axiomatic approaches andmeasurement theory. Although the informationobtained from both techniques is different, the finalobjective is the same, to obtain objectivemathematical information of the metrics we areworking on.

However, as we have indicated previously,research into software measurement is neededalso from a practical point of view. We have

presented some of the experiments that we havedeveloped for the different kinds of databases.Nevertheless controlled experiments have problems(like the large number of variables that causesdifferences) and limits (they do not scale up, aredone in a class in training situations, are made invitro and face a variety of threats of validity).Therefore it is convenient to run multiple studies,mixing controlled experiments and case studies. Forthese reasons, a more in depth empirical evaluationis under way in collaboration with industrial andpublic organizations in "rear situations.

ACKNOWLEDGEMENTThis research is part of the MANTICA project,

partially supported by the CICYT and the EuropeanUnion (CICYT-1FD97-0168) and by the CALIDATproject carried out by Cronos Iberica (supported bythe Consejeria de Educaci6n de la Comunidad deMadrid, Nr, 09/0013/1999)

REFERENCES

1-

?..

3.

4.

5.

6.

7.

8.

Aiken, A., Hellerstein, J.M. and Widow, J.(1995). Static analysis techniques forpredicting the behaviour of active databaserules. ACM Transactions on Databases, 20 (1),3 41.Batra, D., Hoffer, J.A. and Bostrom, R.P.(1990). A comparison of user performancebetween the relational and the extended entityrelationship models in the discovery phase ofdatabase design. CACM, 33 (2), 126 139.Briand, L.C., Morasca, S. And Basin, V.(1996). Property-based software engineeringmeasurement. IEEE Transactions on SoftwareEngineering, Vol.22(1). pp~ 68-85.Calero, C`, Piattini, M, Ruiz, F~ and Polo, M.(1999) Validation of metrics for Object"Relational Databases, International Workshopon Quantitative Approaches in Object-OrientedSoftware Engineering (ECOOP99), Lisbon(Portugal), 14-18 JuneCalero, C., Piattini, M , Genero, M., Serrano,M. and Caballero, I. (2000). Metrics forRelational Databases Maintainability,.CKAI52000, Cardiff, UK.Date., C.J. (1995). An Introduction to DatabaseSystems. 6th. Addison-Wesley, Reading,Massachusetts-Diaz, 0. and Piattini, M (1999). Metrics foractive databases maintainability. CAISE99.Heidelberg, June 1618.Dfaz. O., Piattini. M y Calero, C. (2001).Measuring active databases maintainability.

84 1 QuaTIC2001

Moody, D~ L. (1998). Metrics for Evaluatingthe Quality of Entity Relationship Models.Proc. of the Seventeenth InternationalConference on Conceptual Mode|ling (ER'98),Singapore.Morasca, S, and Briand, I .C. (1997). Towardsa Theoretical Framework for measuringsoftware attributes. Proceeding of the FourthInternational,Software Metrics Symposium,119-126.Pf|eager, S. L. (1997). Assessing SoftwareMeasurement. IEEE Software. March/April,25-26.Piattini, M., Calero, C., Sahraoui, H. andLoams, H. (2000) An empirical Study withobject-relational database metrics,ECOOP:2OOO, Cannes, I3 June.Piattini, M., Genera, M,, Calero, C., Polo, M.and Ruiz, F. (2001). Metrics for ManagingQuality in InfOrrnation Mode|ling. In:"Information Modeling in the NewMiJJennium" M. Rossi and K. Siau (eds.), IdeaGroup Publishing, USA.Pigoski, T.M. (1997). Practical SoftwareMaintenance. Wiley Computer Publishing.New York, USA.Quinlan, J.R.. C4.5: Programs for MachineLearning, (1993), Morgan KaufmannPublishers.Barnaul, M. and Sebastiani, P. Bayesianmethods for intelligent data analysis. In MBerthold and D.J. Hand, editors, AnIntroduction to Intelligent Data Analysis, (NewYork, 1999). Springer.Rossi, M. and Brinkkemper, S. (1996).Complexity Metrics for Systems DevelopmentMethods and Techniques. Information Systems21(2), 209-227.Shoval, P. and Even-Chaime, M (1987).Database schema design: An experimentalcomparison between normalization andinformation analysis. Database 18(3), 3039.Sneed, H.M and Foshag. O. (1998). MeasuringLegacy Database Structures. Proc of TheEuropean Software Measurement ConferenceFESMA 98, Antwerp, May 6-8, Coombes, VanHuysduynen and Peelers (eds.), 199-211.Weyuker, E.J. (1988). Evaluating softwarecomplexity measures. IFFE Transactions onSoftware Engineering Vol.14(9). pp. 1357-1365.Whitmire, S.A. (1997), Object Oriented DesignMeasurement, Ed. Wiley.2use, H. (1998). A Framework of SoftwareMeasurement. Berlin, Walter de Gruyter

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34"

35.

36.

37.

Accepted for publication in InformationSystems JournalFenton, N. (1994). Software Measurement: ANecessary Scientific Basis. IFRE Transactionson Software Engineering, 20(3), 199-206.Fenton, N, and PfJeeger S. L. ( / 997). SoftwareMetrics: A Rigorous Approach 2nd. edition.London, Chapman & Hall.Frazer, A. (1992)~ Reverse engineering-hype,hope or here?. In; P.A.V. Hall, Software Reuseand Reverse Engineering in Practice. Chapman& Hall.Glass, R. (1996). The Relationship BetweenTheory and Practice in Software Engineering.

Software, November, Vol.39 (11). pp 11-

Gray R.ff.M., Carey B.N., McGlynn N.A.,Pengelly A.D., (1991), Design metrics fordatabase systems. BT Technology J, Vol 9, 4Oct. pp. 69-79Henderson-Sellers, B- (1996). Object-OrientedMefrics - Measures of CompJexity. Prentice-Hall, Upper Saddle River, New Jersey.ISO, (1994). Software Product EvaluationQuality Characteristics and Guidelines for theirUse. ISO/IEC Standard 9126, Geneva.Jarvenpaa, S. and Machesky, J~ (1986). Enduser learning behavior in data analysis and datamodeling tools. Proc. of the 7th Int. Conf. onInformation Systems, San Diego, 152-167.Juhn, S. and Naumann, J. (1985). Theeffectiveness of data representationcharacteristics on user validation. Proc. of the6th Int. Conf. on Information Systems,Indianapolis, 212-226.Kim, Y. G. and March, S.T. (1995).Comparing Data Modeling Forrnalisms~CACM 38(6), /03-1 15.Li, H F. and Cheng, W.K. (1987). An empiricalstudy of software metrics. IEEE Trans. onSoftware Engineering, 13 (6), 679-708~MacDonell, S.G., Shepperd, MJ. and Sallis,P"J. (1997). Metrics for Database Systems; AnEmpirical Study. Proc. Fourth InternationalSoftware Metrics Symposium - Metrics'97,Albuquerque. IEEE Computer Society, 99-107.McCabe, T.J. (1976). A complexity measure.IEEE Trans. Software Engineering 2(5), 308-320.McClure, C. (1992) The Three R�s of SoftwareAutomation: Re engineering, Repository,Reusability. Englewood Cliffs: Prentice-Hall.Melton, A. (ed.) (1996). SoftwareMeasurement. London, International ThomsonComputer Press.

9.

1 O.

ll"

12.

13,

14.

15.

16.

17.

18.

19.

20.

21.

22.

QuaTIC 2001 / 85

database complexity metrics - ceur-ws.orgceur-ws.org/vol-1284/paper9.pdf · database complexity...

Documents