a method for analyzing and reducing data redundancy in object

A Method for Analyzing and ReducingData Redundancy in Object-Oriented Databases

Shuguang Hong

Department of Computer Information SystemsP.O. Box 4015

Georgia State UniversityAtlanta, GA 30302-4015

Tel: (404)651-3887Fax: (404)651-3842

[email protected]

March 1994

ABSTRACT

One important research issue in database systems is to reduce data redundancy. This subjecthas been studied extensively in relational database systems; however, little research on this subjecthas been done in object-oriented databases. We argue in this paper that if poorly designed, anobject-oriented database could raise a serious data redundancy problem that is very harmful to dataintegrity. Attempting to address this problem, we present an analysis method for detecting dataredundancy in object-oriented conceptual schemas. The proposed method is based on two types ofdependencies that capture how attributes and operations (methods) of a class together representreal-world entities. Using the dependencies, data redundancy can be revealed, and three well-defined classes are proposed to reduce the redundancy. Unlike the relation normalization, theproposed method analyzes data redundancy not only within classes but also between classes. Theproposed method attempts to address unique characteristics of object-oriented data modelsincluding complex object structures, class inheritance and relationships. The major objective ofthis research is to validate and improve the quality of databases by reducing data redundancy.

Key Words. Data Redundancy, Object-Oriented Databases, Database Management Systems

- 1 -

1. Introduction

Data redundancy occurs if a piece of information is replicated multiple times in a database. Anexample could be to store several similar copies of information about each customer: one for sales, one formarketing, and one for sales support. Whenever new customer information is entered, or the informationabout a customer is modified (e.g., changed to a different address), one must update the customerinformation in all copies in the database. If not done correctly, the update could result in inconsistentinformation. That is, depending on which part of the database is queried, different answers could be given.It might happen that the sales record shows a customer living in Atlanta, but in the marketing record, thesame customer has a Boston address.

Researchers have exerted considerable effort to tackle the data redundancy problem in relationaldatabases. In his milestone articles ([Codd 70], [Codd 72]), Codd introduced the relation normalizationthat sprouted extensive research on relation normalization theories ([Armstrong 74], [Beeri 77], [Beeri 78],[Fagin 77]). Maier [Maier 83] and Ullman [Ullman 88] provided comprehensive surveys and containedmany of the theorems and proofs of relation normalization. Relation normalization is based on the analysisof dependency relationships (functional, multivalued and joint dependencies) among attributes of a relation.The dependencies form the foundation for defining normalized relations (first to 5th normal forms). Eachnormal form reduces one type of data redundancy.

Some data redundancy in a relational database results from the limitations of the relational data modelitself. Because of the atomic value constraint, repeating group data must be stored in multiple rows in arelation. Such a relation is shown below, and it is adapted from an example given by Kent in [Kent 83].

Employee Language Skill

Kim Chinese CookingKim English TypingKim English Martial Arts

Smith French CookingSmith German Cooking

... ... ...

In the above example, if an employee has several skills and knows several foreign languages, thelanguages and skills of the employee have to be replicated because the primary key of the relation consistsof all the attributes. As Kent pointed out, such replication causes data redundancy [Kent 83].

To remove such a limitation, Makinouchi introduced nested relations [Makinouchi 77]. A nestedrelation permits an attribute to contain repeating values. In the employee example, an employee's skillswould be stored as one repeating value group, with the languages stored as another. However, differentform of data redundancy still exists in nested relations. To reduce the data redundancy, Ozsoyoglu andYuan ([Ozsoyoglu 87], [Ozsoyoglu 89]) extended the relation normalization theory and introduced thenested normal forms based on multivalued dependencies.

Batini, Ceri, and Navathe have applied the relation normalization concept to the entity-relationshipmodeling [Batini 92]. They found that data redundancy existed in poorly designed entity-relationshipmodels. For example, if an entity type included attributes about both employee information and departmentinformation, the department information would be replicated in all employee entities. To reduce the dataredundancy, they proposed several normal forms for entity-relationship models [Batini 92]. Those normal

- 2 -

forms can help the user to detect and reduce data redundancy during the design of an entity-relationshipschema. Their work represents the research in normalization of entity-relationship modeling ([Brown 83],[Ling 85], [Embley 88]).

We argue that object-oriented databases are not "data redundancy proof." Like other database systems,the quality of an object-oriented database is dependent on the designer. As shown later in Section 3, dataredundancy cannot be avoided if the designer lacks the knowledge and experience of databases in general,and object-oriented databases in specific. Unfortunately, the problem in object-oriented databases has notbeen given much attention, and not enough research has addressed this issue. The issue of data redundancyin object-oriented databases was raised by Hong [Hong 91], and normal forms similar to relational normalforms were proposed for object-oriented databases. In the same effort, Andonoff applied the nest normalforms to object-oriented databases. He proposed two normal forms, 1-ONF and 2-ONF, and syntheticalgorithms for the design of object-oriented schema [Andonoff 92].

Continuing the research effort, we present an analysis method in this paper1. The key contributions ofthe proposed method are to:

1. Extend relation normalization to deal with object-oriented modeling concepts, such as classes,complex objects, methods, and inheritance

2. Expand relation normalization scope and analyze data redundancy inner classes as well as interclasses

3. Provide a set of rules for analyzing and reducing data redundancies, which can be incorporatedinto existing database design tools or CASE tools using a knowledge-base implementation.

This research differs from the earlier research effort ([Hong 91], [Andonoff 92]) in that it proposes acomplete analysis of data redundancy in object-oriented databases. The latter gave partial analysis of theredundancy problem and did not address issues such as object methods, data redundancy inter classes, andthe impact on database performance.

Our approach differs from the research that regards normalization as a design process that synthesizesa normalized schema from dependencies (e.g., [Bernstein 76], [Fagin 77], [Andonoff 92]). This synthesisapproach has been criticized by Batini, Ceri and, Navathe as “. . . dependencies are inappropriate andobscure means for capturing requirements in the application domain.” (pp. 161, [Batini 92]). We agreewith them that normalization should be a process to validate and improve the quality of databases andshould not be the only design process. We adopted the decomposition approach (Fagin 77]) and attemptedto reduce data redundancy by restructuring database schemas, which distinguishes our approach from theAndonoff’s synthetic approach [Andonoff 92].

The remaining of this paper is organized into seven sections. Section 2 defines an object-orientedmodel that is used as a reference model for the analysis method. Section 3 presents an example of poorlydesigned databases and raises the data redundancy issue in object-oriented databases. Section 4 formalizesthe concept abstractions and lay the theoretical foundation for the method. Section 5 introduces three well

1 We chose the term method instead of methodology because the latter implies a much broader scope and represents a

paradigm. The term method could be confused with the object methods (operations). We hope that the context within whichthe term is used could make the reference clear. When necessary, we added the prefix analysis or object to the term in thediscussion.

- 3 -

defined forms to reduce data redundancy within classes, while Section 6 analyzes data redundancy betweenclasses. Section 7 summarizes this research and outlines a plan for future research.

2. An Object-Oriented Model

Unlike the relational data model, the community of object-oriented databases has not reached an agreementon a standard model for object-oriented databases despite standardization efforts ([Atkinson 89],[Bancilhon 94], [CADF 90], [Cattell 93], [Kim 94]). The model discussed here only serves as a referencemodel for our research. Instead of including all object-oriented modeling concepts commonly accepted, thisreference model is based on what was proposed by [Atkinson 89], [Zdonik 90], and [Kim 90] and collectsconcepts that are relevant to the analysis of the data redundancy problem in object-oriented databases.Thus, some concepts are deliberately omitted. The discussion in this section assumed that the readers arefamiliar with basic concepts of object-oriented databases.

In a conceptual schema, the definition of a class consists of a set of attributes and methods,

C = ({a}, { M})where a is an attribute and M is a method.

The attributes of a class specify the structural characteristics of the instances (objects) of the class,while the methods model the common behavioral characteristics of all its objects. An object in thedatabase is a representation or an abstraction of a real-world entity. Every object in a database has aunique object identity assigned by the database system. The primary key concept that is extensively usedin other databases is not required in object-oriented databases. The reference model assumes totalencapsulation that all attributes are protected from direct accesses by other objects.

Figure 1 shows the definition of an Inventory class. This class tries to capture information about:vendors (suppliers), the contact person of the vendors, the warehouse where items are stocked, the managerof the warehouse, and the inventory of items. The inventory information of items includes: item number,description, unit cost, retail price, quantity on hand, reorder point and year-to-date sold. The methodswithin this class are self-explanatory. We assume that Supplier, Person and Address are classes definedelsewhere. The key words, private and public, define object encapsulation. For convenience, classexamples to be presented in this paper are written in a syntax that is similar to C++. However, ouranalysis method is independent of any programming languages.

An attribute consists of a name n and a type t,

a = <n, t>

In the reference model, attribute types are classified into the following categories:

(1) Single-valued. The attribute type is either of primitive (e.g., integer, string, etc.) or a class (objectreference). For example, attribute item# in the Inventory class is of primitive (string), whileattribute vendor references to a Supplier object.

(2) Collection. The attribute type is a list of or a set of values of the same type. The values may beof primitive or objects. A list implies a sequential order among the values, while a set does nothave any orders among the values. Objects in a collection must belong to the class of thecollection or belong to the ancestors (superclasses) of the class. For instance, the attribute contactin Inventory is a set containing Person objects and objects of the superclass(es) of Person if any.

- 4 -

(3) Group. The attribute contains a group of values of different types, either of primitive or objectreferences. In programming languages, this structure is commonly referred to as record orstructure. There is no attribute of this type in the Inventory example.

class Inventory {

private:string item#; //-- item inventory informationstring description; // including price, quantity, etc.real cost_per_unit;real retail_price;integer qty_on_hand;integer reorder_point;integer year_to_date_sold;Supplier vendor; //-- supplier information including supplierset<Person> contact; // and the contact person of the supplierAddress warehouse_addr; //-- warehouse informationstring warehouse#;string manager_emp#; //-- manager information including employee #string manager_name // name, salary, title, addressreal salary;string title;Address manager_addr;

public:ship(integer qty); // reduce quantity-on-hand & check stockchange_sale_price(real new_price );change_vendor(Supplier new_vendor, Person new_contact);change_contact(Person new_contact);change_location(Address new_addr);inventory_report( ); // generate inventory reportsale_report( ); // produce sales reportwarehouse_report( ); // show what items are stocked inupdate_manager( ); // change manager's information. . .

}

Figure 1: The Definition of Class Inventory

From the conceptual modeling point of view, objects as attribute values actually represent relationshipsbetween objects. For example, the vendor attribute expresses a relationship between Inventory (containingobject) and Supplier (referenced object). Some object-oriented models classify those relationships asassociations (is-member-of) and aggregations (is-part-of). We do not explicitly express such relationshipsin the reference model and treat the relationships as “regular” attributes of complex types. We believe thatsuch simplification has no impact on our analysis of data redundancy. We will revisit this issue later inSection 4.1.

Methods are defined within a class and applied to all objects of the class. The effect of a method on anobject is modeled as:

- 5 -

M (N, I, R) ⇒ ( U, W, G) where

N - the method name I - a set of input arguments, {<name, type>}R - a set of attributes whose values may be read by the methodU - a set of output arguments including returned value type, {<name, type>},W - a set of attributes whose values may be modified by the methodG - a set of messages that may be initiated from the method, { [o, m, p] } where o is the

message receiver, m is the message, and p is a set of message parameters, {<n, t>}

The attributes (W) that are modified by a method represent the side effect of the method. For example,the ship method in Inventory may subtract qty-on-hand and year-to-date-sold by the specified amount(qty). If the quantity of an item drops below reorder-point, this method may send a reorder message to apurchase department. The effect of the ship method is described as:

M (ship, {<qty, integer>}, {qty_on_hand, reorder_point, year_to_date_sold})⇒ ({ }, {qty_on_hand, year_to_date_sold},

{ [purchase_dept, reorder, {<item#, string>, <quantity, integer>}] })

As described above, the representation of a method describes the static effect of the method and doesnot represent the run-time effect of the method. The run-time effect of a method varies from the run-timeenvironment that includes the specific input argument values and the object state to which the method isapplied. However, the static effect of a method covers all possible side effects and out going messages ofthe method independent of specific run-time environments. How to accurately predict the dynamic effect ofa method at specific run-time is beyond the scope of this research.

Classes may be organized into class hierarchies in which subclasses inherit all attributes and methodsfrom their superclasses. Our model permits multiple inheritance and a subclass may override inheritedattributes and methods.

3. Data Redundancy and Information Loss in Object-Oriented Databases

If we examine the Inventory class in Figure 1 closely, we can discover several sources of dataredundancy. The first one is the warehouse described by attributes warehouse_addr and warehouse# thathas been replicated in every item stocked in the warehouse. Similarly, the manager information thatconsists of manager_emp#, manager_name, salary, title and manager_addr is repeated in every item.Grouping warehouse and manager information with items in one class has created a large amount ofredundant data because one warehouse usually keeps thousands of items. In addition, if vendors alwayshave designated contact persons for their customers, this class should only keep either vendor informationor contact information, but not both. Including both vendors and contacts in the class replicate eithervendors or contact persons.

Like other databases, valuable information in an object-oriented database may be lost due to redundantdata. In the Inventory example, removing all items supplied by a particular vendor could unintentionallylose the contact person information of that vendor, assuming that the contact person information is keptonly within the Inventory objects in the database. Similarly, if a warehouse is closed, the managerinformation would be accidentally destroyed because of grouping warehouse and manager information inone class. The information loss problem is similar to deletion anomalies in relation normalization.

- 6 -

Similarly, the modification of replicated objects requires special care. If not all copies of an object areupdated correctly, data inconsistency would occur. This problem is called update anomalies in relationnormalization.

We believe that the data redundancy issue can not be ignored. One might argue that if the user followsan object-oriented design methodology (e.g., Booch's methodology [Booch 94]), data redundancy could beavoided. This assumption has two flaws. First, no methodology claims the automatic reduction of dataredundancy in the design of databases. A methodology helps the user, but does not guarantee, to reducedata redundancy. Second, users who have adequate knowledge and experience in both design methodologyand database systems are scarce. With the help of design tools with user-friendly graphical user interfaces,the task of designing databases is gradually shifting from experts to the hands of end-users, which ischaracterized as the empowerment of end-users. The database knowledge and experience of those“designers” are far from adequate. As long as the design process depends upon the skills of the designer,the quality of the product is in the hands of the designer. Poor designs of databases similar to Inventorygiven in Figure 1 can surely happen.

The design of object-oriented databases is even more difficult to assume the qualification of thedesigner. An industry trend of object-oriented database systems is to adopt object-oriented programminglanguages (e.g., Smalltalk or C++) as their database languages (e.g., [GemStone], [ObjectStore], [Ontos],[Cattell 93]). With a few changes, an ordinary program written in an object-oriented programminglanguage, such as C++, can define classes and create objects in an object-oriented database. One majorbenefit of adopting the languages is to allow programmers to work with databases transparently and avoidthe impedance mismatch problem [Zdonik 90]. However, the programmers usually do not have sufficientknowledge and experience in designing object-oriented databases. The Inventory example in Figure 1 wastaken from the actual design of an object-oriented database. Therefore, it is very important to aid the userin the design of object-oriented databases. The analysis method attempts to fill the gap and helps thedesigner to analyze, detect, and correct the data redundancy problem.

4. Formalization of Abstractions

By analyzing the Inventory class shown in Figure 1, we can group the attributes and methods ofInventory into four distinguishing clusters Item, Vendor, Warehouse, and Manager as listed inFigure 2. These clusters are called abstractions in the analysis method.

- 7 -

Item = {item#, description, vendor, cost_per_unit, retail_price, qty_on_hand, reorder_point,year_to_date_sold, ship(...), change_sale_price(...), sale_report(...),warehouse_report(...) ...}

Vendor = {vendor, contact, change_vendor(...), change_contact(...), ...}Warehouse = {warehouse#, warehouse_addr, change_location(...), warehouse_report(...) ...}Manager = {warehouse#, warehouse_addr, manager_emp#, manager_name, salary, title,

manager_addr, update_manager( ), ...}

Figure 2: Abstractions in Class Inventory

As discussed in Section 3 earlier, grouping these abstractions in Inventory results in data redundancy.The redundancy cannot be avoided because whenever one abstraction occurs in an object, otheraccompanying abstractions are replicated in the same object. In this example, the manager abstraction isidentical for all items stocked in a warehouse. Grouping the manager abstraction with the item abstractioncauses replication of manager information in all those item objects. Therefore, recognizing abstractions inclass definitions is the first step toward the analysis of data redundancy.

From the data modeling point of view, abstractions are the representations of real world entities. Ifattributes and object methods describe a same real world entity, there must be some internal force that tiesthem together in an abstraction. The internal force is characterized as cohesion in software engineering.Borrowing the software engineering concept, attributes and object methods that represent a real worldentity are said to exhibit strong cohesion. The analysis method uses two types of relationships:a-dependency and m-dependency to capture the cohesion within an abstraction. The former models howattributes of an abstraction are connected to each other, and the latter links object methods of anabstraction to the attributes of the abstraction.

4.1 A-Dependency

Cohesion among attributes of an abstraction can be described as how the attributes are related to eachother. Because the aggregation of the meanings of those attributes represents the semantics of theabstraction, those attributes should appear together most frequently in a database. That is, if one of thoseattributes shows up in the database, it is very likely that other attributes of the same abstraction wouldappear at the same time. The cohesion has been captured in relation normalization as dependencies(functional, multivalued, etc.). Following the similar approach, we define a concept, called a-dependency,to reveal the cohesion among attributes of an abstraction.

Definition 1 (a-dependency). Let X and Y be two non-empty subsets of attributes of a class C, Y is a-dependent on X if every X value always occurs with the same set of Y values in all objects of C,

independent of any other attributes of C. This relationship is denoted as: X a → Y

Figure 3 lists the a-dependencies identified in the class Inventory.

- 8 -

{item#, description, vendor} a → {cost_per_unit, retail_price, qty_on_hand,reorder_point, year_to_date_sold}

{vendor} a → {contact}

{item#, description, vendor} a → {warehouse#, warehouse_addr}

{warehouse#, warehouse_addr} a → {manager_emp#, manager_name}

{manager_emp#, manager_name} a → {salary, title, manager_addr}

Figure 3: A List of A-Dependencies in Inventory

Readers who are familiar with the dependencies in relation normalization can find that Definition 1 issimilar to the definition of multivalued dependency in relation normalization with several extensions:

(1) If both X and Y are atomic, the a-dependency is a multivalued dependency.

(2) If X (Y) is an object (non atomic), the a-dependency is equivalent to the multivalued dependencyafter the object is replaced by its object identity. After substituting objects with their identities,objects can be regarded as “atomic.”

(3) X may be a group or a collection of attributes [x1, x2, ..., xn]. If xi is atomic, the a-dependency is amultivalued dependency

(x1, x2, ..., xn) -->> YIf xi is an object, the a-dependency is equivalent to a multivalued dependency after each object xi

is substituted by its object identity. The substitution equally applies to the situations in which oneof or both of X and Y are groups or collections.

One may question the substitution of objects with their identities in the above discussion. We believethat the substitution does not change or restrict the analysis. The simplification is based on two facts.First, every object has a unique identity that is interchangeable with the object for the purpose ofidentifying the object. Second, the a-dependency is to reveal how attributes are related to each other. Thatis, we are interested in the inter-relationships (dependencies) among attributes, but not interested in theirinternal structures. Our analysis should yield the same set of a-dependencies regardless whether objectidentities or objects are involved in the analysis. Therefore, the substitution allows us to include complexobjects in our analysis without having to examine the internal structures of the objects. It is the reason thatwe did not distinguish relationships from attributes in our reference model as discussed in Section 2.Because of the simplification, we use attribute names exclusively in our discussion regardless of thecomplexity of their attribute types.

The simplification has one more benefit. After replacing objects by their identities, the a-dependencyis equivalent to the multivalued dependency of relation normalization. Hence, including complex objectsinto our analysis does not require significant redoing of the theoretical work of relation dependencies andthe analysis can utilize the research results of relation normalization if applicable.

4.2 Inference Rules for A-Dependencies

By establishing the equivalence between a-dependencies and multivalued dependencies, the inference rulesproposed by Beeri ([Beeri 77]) and Armstrong [Armstrong 74] can be applied to a-dependencies. Inrelation normalization, normalized relations can be synthesized from dependencies using the inference rules.Following the approach, we may calculate the transitive closures of a-dependencies so that each transitiveclosure forms an abstraction. For instance, an abstraction initially contains one pair of a-dependencies and

- 9 -

a new pair of a-dependencies is added into the abstraction if the pair can be logical deduced from theexisting dependencies in the abstraction based on those inference rules.

However, the calculation of transitive closures needs to be restricted to serve the purpose of theredundancy analysis. Not all a-dependencies exhibit the same degree of cohesion among the attributes ofan abstraction. For example, all attributes in the Inventory class are somehow in one closure if wefollowing Beeri’s inference rules. From the redundancy analysis done earlier, we observed that redundancyis caused if the degree of cohesion is ignored. In Inventory, for instance, the manager information istransitively related to the item information and results in redundancy. Therefore, we must consider thedegree of cohesion during the transitive closure computation and only use those a-dependencies that exhibitstrong cohesion to form an abstraction.

We restate the Beeri’s Axioms [Beeri 77] for identifying weak a-dependencies. We omitted the proofof these rules. Interested readers may refer to ([Armstrong 74], [Beeri 77], [Maier 83], [Ullman 88],[Ozsoyoglu 89]) for the discussion of the dependency theories. Instead, we interpreted the rules in thecontext of identifying abstractions. In the following rules, U represents of all attributes of a class and XY isa shorthand for X∪ Y for attribute subsets X and Y of U.

1. Rule 1.1: Reflexive. X a → Y is weak if Y ⊆ X.Explanation: If Y is a subset of X, then this dependency always exists. Since this dependencydoes not discover any new dependencies, it should be excluded.

2. Rule 1.2: Complementation. If X a → Y then X a → U − XY is weak.Explanation: This rule discards dependencies that have nothing to do with an abstraction. Thosedependencies trivially exist because of the definition of the a-dependency.

3. Rule 1.3: Augmentation. If X a → Y and Z ⊆ W, then XW a → YZ is weak.Explanation: This rule is derived from Rule 1.1. Since Z is a subset of X, the appearance of Z onthe right hand side of the dependency is redundant and can be dropped.

4. Rule 1.4: Transitive. If X a → Y and Y a → Z, then X a → Z− Y is weak.

Explanation: Transitive dependencies exhibit weak cohesion. Therefore, X a → Z-Y should beomitted.

5. Rule 1.5: Union. If X a → Y and X a → Z, then X a → YZ is weak.Explanation: The union of the two dependencies is an obvious deduction and does not discloseany new dependencies.

6. Rule 1.6: Pseudotransitive. If X a → Y and YW a → Z, then XW a → Z− WY isweak.Explanation: This rule is derived from Rule 4. For the same reason as in Rule 1.4, the

dependency XW a → Z− WY exhibits weak cohesion and should be discarded.

7. Rule 1.7: Intersection. If X a → Y, and X a → Z, then X a → Y∩Z is weak.Explanation: This rule can be derived from the dependency definition. Because the dependency

X a → Y∩Z does not reveal new cohesion, it should be dropped.

8. Rule 1.8: Deference. If X a → Y, and X a → Z, then X a → Y − Z is weak.Explanation: Same as Rule 1.7 above.

- 10 -

4.3 M-Dependency

The cohesion of object methods with respect to an abstraction is measured by the relationships betweenattributes and the methods that use the attributes. Methods are integrated parts of an abstraction andrepresent the behavior of real world entities. In other words, if a method does belong to an abstraction, thismethod must use, i.e., read or modify, some of the attributes of the abstraction. The cohesion between amethod and an abstraction is modeled as m-dependency.

Definition 2 (m-dependency). Let M be a method of a class C, M (N, I, R) ⇒ ( U, W, G), andX = R ∪ W ≠ ∅ be the set of attributes that are used (read or modified) by M. The relationship between M

and X is called an m-dependent denoted as:

X m → M

Message passing between objects is not included in the definition of m-dependencies. In messagepassing, one object (receiver) responds to the request of another object (sender). For example, an Inventoryobject may send a message to a Vendor object and request the Vendor object’s address. The m-dependencydoes not model such a link between a message sender (method) and the attributes of the receiver that usesthe attributes on behalf of the sender. Such inter-object dependencies represent weak (remote)dependencies and does not help model the cohesion between methods and attributes within a class.

Figure 4 lists some of the m-dependencies in the Inventory class. In those dependencies, the methodson the right hand side of a dependency use the attributes on the left hand side of the dependency. Weomitted the arguments of those methods since they are irrelevant to our analysis.

{qty_on_hand, year_to_date_sold, reorder_point} m → {ship(...)}

{cost_per_unit, retail_price} m → {change_sale_price(...)}

{vendor, contact} m → {change_vendor(...)}

{contact} m → {change_contact(...)}

{warehouse_addr} m → {change_location(...)}

{item#, description, vendor, qty_on_hand} m → {inventory_report(...)}{item#, description, vendor, cost_per_unit, retail_price, year_to_date_sold}

m → {sale_report(...)}

{salary, title, manager_addr} m → {update_manager( )}{item#, description, vendor, qty_on_hand, warehouse#, warehouse_addr}

m → {warehouse_report(...)}

Figure 4: Some of the M-Dependencies in Inventory

- 11 -

4.4 Inference Rules of M-Dependencies

Following the similar approach as a-dependencies, several inference rules are defined for m-dependencies. These rules are directly derived from the definition of the m-dependency. We omitted theproof of the rules since the proof is trivial. In these rules, M is a method of a class C and MiMj is ashorthand for the execution of Mi and Mj of C in some order; X and Y are attribute sets of C and XYdenotes X∪ Y.

1. Rule 2.1: Pseudoargumentation. If X m → M and Y m → M, then XY m → M.Explanation: If M uses X and Y, it obviously uses the union of them.

2. Rule 2.2: Union. If X m → Mi and X m → Mj , then X m → MiMj.Explanation: If both Mi and Mj use the same set of attributes, the combined computation of themsurely manipulates the same set of attributes.

3. Rule 2.3: Pseudodecomposition. If X m → M and Y ⊂ X, then Y m → M.Explanation: If a method uses a set of attributes, it surely uses subsets of the attributes.

The m-dependency helps eliminate attributes that never be used by any methods. The rationale behindthe action is that every attribute of a class should be used by at least one method in the class. Otherwise,the attribute serves no purpose. Because of data encapsulation, those unused attributes becomeinaccessible outside of the class. The following rule formulates the action:

1. Rule 2.4: Dependent reduction. Let U be all attributes of a class C and an attribute a ∈ U. If

there does not exist a method M in C such that {a} m → M, all a-dependencies of C, Xi a →

Yi, should be reduced to

2. Xi − { a} a → Yi − { a}, where i = 1, 2, ..., m.

The rule above assumed the total encapsulation. If partial encapsulation is permitted, this rule isapplicable only to private attributes.

4.5 Abstractions and Semantic Descriptor

Since each dependency represents a piece of information of a real-world entity, an abstraction is naturally acluster of attributes and methods that are linked together by a-dependencies and m-dependencies. In otherwords, those dependencies define the aggregate meaning (semantics) of an abstraction. After applying theinference rules discussed earlier to a transitive closure, only the dependencies that directly contribute to thesemantics of the abstraction remain. The strong cohesion among the remaining attributes determines theboundary of an abstraction.

Definition 3 (abstraction). An abstraction A within a class consists of a set of attributes, a-dependencies, methods, and m-dependencies, denoted as:

A = (Λ, Ψ, Μ, Π)where

Λ − a set of attributesΨ − a set of non-weak a-dependencies:

S a → Yi , S ⊂ Λ and Yi ⊂ Λ, for i = 1, 2, ..., nΜ − a set of methods

- 12 -

Π − a set of non-redundant m-dependencies:

Wj m → Mj, Wj ⊆ Λ and Mj ⊂ Μ, for j = 1, 2, ..., l

Note that in the above definition a method belongs to an abstraction if the method uses at least oneattribute that belongs to the abstraction. As shown in Figure 2, for example, the methodwarehouse_report(...) belongs to abstractions Item and Warehouse because it uses attributes from bothabstractions.

By Definition 3, all a-dependencies in an abstraction share a common left hand side (S). The roleplayed by S is similar to the superkey concept in relational databases, and S can be used as an indicator ofor a reference to an abstraction. For the convenience of discussion, we name S as the semantics descriptorof the abstraction as defined below:

Definition 4 (semantics descriptor). The semantics descriptor of an abstraction is the set of left handside attributes that are shared by all a-dependencies of the abstraction.

4.6 Inference Rules Affecting Abstractions

Errors may occur in semantics descriptors. Incorrect semantics descriptors may result in either toonarrow or too broad boundaries. A narrow boundary may exclude rightful attributes and results infragmented abstractions. In contrast, a broad boundary may include unnecessary attributes and fails toreveal potential data redundancy. Therefore, semantics descriptors must be checked before they are used todetermine abstraction boundaries.

To help the designer to verify and restructure a-dependencies, we constructed three additional inferencerules to correct the errors. The first rule is to expand semantics descriptors so that abstractions includenecessary attributes, while the last two rules attempt to reduce the boundary of an abstraction.

1. Rule 3.1: Augmentation. Let X, Y and Z be three non-empty sets of attributes. If

X a → Z and Y a → Z, then XY a → Z.

Explanation: This rule is derived from the definition of a-dependencies. If Z is a-dependent on Xand Z is a-dependent on Y, then surely Z is a-dependent on both X and Y. Thus, we can combinethe left hand sides of the two a-dependencies. For example, if

{Item#, description} a → {warehouse#, warehouse_addr}

{Item#, vendor} a → {warehouse#, warehouse_addr}

then: {Item#, description, vendor} a → {warehouse#, warehouse_addr}

2. Rule 3.2: Minimality. Let X, Y and Z be non-empty sets of attributes. If XY a → Z,

Y a → Z and there does not exist X a → Z, then XY a → Z should be reduced to

Y a → Z.

Explanation: If Z depends on only part of the left hand side attributes, then the left hand side ofthe dependency should be reduced to the subset of attributes on which Z is dependent. That is, Xcan be dropped from the left hand side of the relationship. For example, if

{cost_per_unit, warehouse#, warehouse_addr} a → {manager_emp#, manager_name}

{warehouse#, warehouse_addr} a → {manager_emp#, manager_name}

- 13 -

and there does not exist

{cost_per_unit} a → {manager_emp#, manager_name}

then cost_per_unit should not be included in the left hand side of the first dependency pair. Thisrule is similar to the fully functional dependency in relation normalization.

3. Rule 3.3: Self-reduction. Let X, Y and Z be non-empty, disjoint sets of attributes. If

XYZ a → W and Y a → Z, then XYZ a → W should be reduced to XY a → W.

Explanation: If some attributes within the left hand side of an a-dependency themselves form asecond a-dependency, then the right hand side of the second dependency should be excluded fromthe left hand side of the original dependency. For example, if

{Item#, description, vendor, contact} a → {cost_per_unit, retail_price, ...}

and {vendor} a → {contact}

then contact should be dropped out from the left hand side of the first dependency, i.e.,

{Item#, description, vendor, contact } a → {cost_per_unit, retail_price, ...}

Intuitively, an a-dependency may signal an abstraction. That is, Y a → Z may indicate a

separate abstraction. As shown in the above example, {vendor} a → {contact} indeed forms adifferent abstraction that represents the vendors and their contact person. Hence, this rule revealsabstractions that might hide inside other abstractions.

4.7 Processes for Identifying Abstractions

Identifying abstractions involves several steps. The first step is to identify a-dependencies and verify the a-dependencies using the inference rules discussed earlier. The second step is to identify and verify m-dependencies. The third step is to form semantics descriptors and determine boundaries of abstractions.

Identifying a-dependencies in a conceptual schema relies on both the understanding of the applicationand the semantic analysis of attributes of classes. This task cannot be completely automated and requiresinput from the designer. The inference rules discussed in Section 4.2 and Rule 2.4 in Section 4.2 can helpthe designer derive additional a-dependencies and eliminate weak a-dependencies. However, if the analysisprocess is applied to an existing database that contains both conceptual schema and data, some of the a-dependencies can be calculated by adopting the algorithms developed for relation normalization. Thosealgorithms compute dependencies by querying the data in the database. The discussion of those algorithmsis beyond the scope of this paper. Interested readers may refer to ([Maier 83], [Ozsoyoglu 89]) for detaileddiscussion.

Determining m-dependencies can be done by using a straightforward algorithm. For each method of aclass, the algorithm starts with an empty set of attributes. The algorithm scans each statement of themethod, and an attribute is added into the attribute set if the attribute is used (read and modified) by themethod. When all statements of the method are parsed, m-dependencies between the set of attributes andthe method is established. The inference rules given in Section 4.4 are used to verify the dependenciesproduced by the algorithm.

After all a-dependencies and m-dependencies have been correctly established, each left hand side of thea-dependencies forms a semantics descriptor. An abstraction is the union of all a-dependencies that have

- 14 -

the same semantics descriptor. Finally, methods are classified into abstractions based on the attributesused by the methods. Rules 3.1, 3.2 and 3.3 help restructure the dependencies and expand or reducesemantics descriptors.

Figure 5 shows the identified abstractions in the class Inventory. The dependencies in the figure havebeen listed in Figure 3 and Figure 4 and have been checked against these inference rules. The boundaries ofabstractions in the diagram are encircled by dotted lines, and the semantics descriptor of each abstraction isprinted in bold faces. For the convenience of discussion, each abstraction is designated by a name that hasbeen attached to the dotted lines. If an attribute is involved in more than one abstraction, the attribute islisted in all these abstractions to avoid a crowded picture. For example, the attribute vendor is listed inboth the Vendor and Item abstractions.

{item#, description, vendor}

{manager_emp#, manager_name}

{sale_report(...)}{warehouse#, warehouse_addr}

{cost_per_unit, retail_price, qty_on_hand, reorder_point, year_to_date_sold}

{salary, title, manager_addr}

a

a

{ship(...)} {change_sale_price(...)}

{inventory_report(...)}

m m

mm

a

a

{vendor}

{change_vendor(...)}

{contact}a

m{change_contact(...)}

m

{change_location(...)}

m

{update_manager( )}

m

VendorItem

Warehouse Manager

m {warehouse_report(...)}

m

Figure 5: Abstractions in the Inventory Class

5. Class Normalization

Given the dependencies and the abstractions, data redundancy in a conceptual schema can be analyzed.Following the approach of relation normalization, we introduce three well-defined-forms of classes tomeasure the degree of data redundancy in a class. The higher the well-defined-form a class is, the less theredundancy the class contains and the greater impact on the database performance.

5.1 First Well-Defined Form

The first well-defined form of classes serves as the starting point for the analysis of data redundancy.If a class is not in the first well-defined form, the analysis discussed in this paper may not be applicable.

Definition 5 (first well-defined forms): A class is in the first well-defined form (1WDF) if it is basedon the object-oriented model discussed in Section 2.

- 15 -

All examples presented in this paper are assumed in 1WDF. From the previous discussion in Section3, classes in 1WDF may contain data redundancy if the class contains more than one abstraction. The nexttwo well-defined forms attempt to formalize a process to reduce data redundancy in 1WDF classes.

5.2 Second Well-Defined Form

If a class contains two abstractions, those two abstractions may be exclusive, inclusive or overlapped.Two exclusive abstractions do not have any overlapped attributes and methods and there are no crossingdependencies, as defined below:

Definition 6 (exclusive abstractions): Let A and B be two abstractions. A and B is exclusive if

1. (1) They do not contain any common attributes and methods, and

2. (2) There are no a-dependencies and m-dependencies cross link attributes and/or methods of Aand B.

Based on the dependencies shown in Figure 5, there are three non-exclusive abstraction pairs, (Item,Vendor), (Item, Warehouse), and (Warehouse, Manager) because abstractions in each pair haveoverlapped attributes. There are three exclusive abstraction pairs, (Item, Manager), (Manager, Vendor)and (Warehouse, Vendor). Such inter-abstraction connections in Inventory are depicted in Figure 6. In thefigure, each semantics descriptor is drawn as a box and the dependencies are condensed into symbols andlines that link the symbols to the semantics descriptors.

Vendor

Item

Warehouse

Manager

XY

UVW

FG

HI

Figure 6: Exclusive and Inclusive Abstractions in class Inventory

If a class contains multiple abstractions, one of the abstractions must be selected as the primaryabstraction of the class. Although the selection is arbitrary, the candidate that is most appropriate to themeaning of the class should be the primary abstraction. In our Inventory example, Item is assumed to bethe primary abstraction that is drawn as a shaded box in Figure 6. The definition of the primaryabstraction of a class is given below:

- 16 -

Definition 7 (primary abstraction): The primary abstraction of a class is designated as one of theabstractions whose meaning best matches with what the class intends to represent.

The degree of data redundancy in a class depends on two factors, the number of abstractions in theclass and the inter-connection among the abstractions. Exclusive abstractions contribute to higher degreeof data redundancy than non-exclusive abstractions because there is nothing to relate them in one class.For instance, the exclusive abstraction pair (Item, Manager) grouped these two unrelated abstractions intoInventory, which results in redundant manager data. Therefore, the first step to reduce data redundancy isto remove exclusive abstractions with respect to the primary abstraction of a class. This process results insecond well-defined form classes as stated below:

Definition 8 (second well-defined form): A class is in the second well-defined-form (2WDF) if it is in1WDF and all abstractions in it are non-exclusive abstractions with respect to the primary abstraction ofthe class.

Given a class C and the primary abstraction A, the following rules check whether a class C is in2WDF:

1. Rule 4.1: For each abstraction B, there exists at least one attribute X in B such that X isinvolved in an a-dependency with the primary abstraction A, or

2. Rule 4.2: For each abstraction B, there exists at least one method M in B such that M is inboth A and B.

Applying those rules to the abstractions identified in the Inventory class as shown in Figure 5 andFigure 6, we can determine that Inventory is not in 2WDF because Manager is exclusive with respect tothe primary abstraction Item.

Given a 1WDF class C and its primary abstraction A, the following rules provide suggestions to thedesigner on how to improve C to be a 2WDF class:

1. Rule 4.3: Any attributes and methods that do not belong to any abstractions in C must bemoved out of C.

2. Rule 4.4: For each abstraction B that is exclusive with respect to the primary abstraction A,B is suggested to be moved out of the class to form a new class or merged into an existing class.

3. Rule 4.5: For each abstraction B to be moved out, compare B with each abstraction E

remained in C, if E has overlapped attributes with B, the overlapped attributes need to berestructured as follows:

A) If the number of overlapped attributes is 1 and the attribute is either a single valued attributeor a collection, that attribute may be replicated in both B and E;

B) If the number of overlapped attribute is 1 and the attribute is a group (structure) or if thenumber of overlapped attributes is greater than 1, the overlapped attribute in B and E may bereplaced by an object reference that points to a new class that contains the overlappedattributes, or let only one of them, say E, contain the overlapped attributes and let anotherone, say B, use message passing to request the overlapped attributes in E.

1. Rule 4.6: If there are methods that manipulate attributes that belong to both B and otherabstractions remaining in C, those methods need to be rewritten.

- 17 -

2. Rule 4.7: Any new classes resulted from the restructure are subject to the same redundancyanalysis discussed in this section.

Those rules above can be used to form an analysis report and provide suggestions to the designer.Applying those rules to our Inventory example, the Manager abstraction should be moved out of classInventory. The remaining class is named as Inventory2 as listed in Figure 7. We assumed that theabstraction of Manager formed a new class Warehouse-Manager whose definition is omitted.

class Inventory2 {

private:string item#; //-- item informationstring description;real cost_per_unit;real retail_price;integer qty_on_hand;integer reorder-point;integer year_to_date_sold;Supplier vendor; //-- supplier information for this itemset<Person> contact; // the contact person of the supplierAddress warehouse_addr; //-- warehouse informationstring warehouse#; // warehouse number

public:ship(integer qty); // reduce quantity-on-hand & check stockchange_sale_price(real new_price );change_vendor(Supplier new_vendor, Person new_contact);change_contact(Person new_contact);change_location(Address new_addr, string new_w#);inventory_report( ); // generate inventory reportsale_report( ); // produce sales reportwarehouse_report( ); // show what items are stocked in

. . .

}

Figure 7: The Inventory Class in Second Well-Defined Form

5.3 Third Well-Defined Form

Classes in 2WDF may still contain multiple abstractions and have redundant data. The Inventory2class is such an example. It still contains Item, Warehouse and Vendor abstractions and some of the dataredundancy we discovered earlier in Section 3 remains in Inventory2. For instance, warehouse and vendorinformation is replicated with every item, as depicted in Figure 6. The third well-defined form aims atrestructuring classes so that a class only has one abstraction.

Definition 9 (third well-defined form): A class is in third well-defined form (3WDF) if it is in 2WDFand it contains only one abstraction.

To reduce the data redundancy existing in the Inventory2 class, the Vendor and Warehouseabstractions should be moved out, and they may form separate classes. It is possible that those

- 18 -

abstractions may have overlapped attributes and methods. The rules for handling the overlapped attributesand methods between those abstractions are identical to these rules used for restructuring 2WDF classes.Please refer to the Rules 4.4 to 4.7 discussed in the previous subsection for details.

Following those restructuring rules, the class Inventory2 is broken down into three classes, Item,Supplier, and Warehouse as shown in Figure 8, Figure 9, and Figure 10 respectively. Note that we haveintroduced a new attribute location in Item that refers to a warehouse object. We also assumed the Vendorabstraction was merged into the existing class Supplier as shown in Figure 10.

class Item {

private:string item#;string description;real cost_per_unit;real retail_price;integer qty_on_hand;integer reorder-point;integer year_to_date_sold;Supplier vendor; // supplier for this itemWarehouse location; // reference to an warehouse object

public:ship(integer qty); // reduce quantity-on-hand & check stockchange_sale_price(real new_price );change_vendor(Supplier new_vendor);inventory_report( ); // generate inventory reportsale_report( ); // produce sales reportwarehouse_report( ); // show what items are stocked in

. . .

}

Figure 8: The Item Class in 3WDF

class Warehouse {

private:Address warehouse_addr;string warehouse#;Wharehouse_Manager manager;

public:change_location(Address new_addr, string new_w#);change_manager(Wharehouse_Manager new_manager);. . .

}

Figure 9: The Warehouse Class in 3WDF

- 19 -

class Supplier {

private:string name;set<Person> contact; // the contact person of the supplierstring phone_number;

. . .

public:change_contact(Person new_contact);. . .

}

Figure 10: The Supplier Class in 3WDF

5.4 Justification for Separating 3WDF from 2WDF

One may question the need for 2WDF and 3WDF. By the definition, 3WDF implies 2WDF. Why notdirectly normalize a class from 1WDF to 3WDF? There are two considerations for separating 2WDF and3WDF. The first consideration is that they define a different degree of data redundancy. The 2WDFremoves from a class attributes and methods that have no semantic connections to the abstractionrepresented by the class, while 3WDF removes attributes and methods that are inter-connected but shouldbe avoided because of the data redundancy. They address different data redundancy problems.

The second consideration is that they impose different performance penalties. Like relationnormalization, normalized classes incur extra costs for data retrieval. The extra cost can be measured bythe average time required to retrieve data from the original class compared to the average time required toretrieve data from normalized classes. Using this measure, the extra overhead for 2WDF is smaller thanthat for 3WDF because the frequency of retrieving data from exclusive abstractions together is often lessthan that of retrieving data from non-exclusive abstractions. For example, we separated the abstractionManager from the original class Inventory and obtained a 2WDF class Inventory2 as shown in Figure 7.Database queries that retrieve inventory information can be satisfied by accessing the Inventory2 objects.

To obtain 3WDF classes, the non-exclusive abstractions in Inventory2 were broken down into several3WDF classes, namely, Item (Figure 8), Warehouse (Figure 9) and Supplier (Figure 10). The same set ofinventory queries that involved only one class now has to retrieve these 3WDF classes. That is, the cost ofretrieving Item, Warehouse and Supplier from three separated 3WDF classes is greater than that ofretrieving the same information from one 2WDF class (Inventory2). Therefore, separating 2WDF from3WDF permits designers to make their own design decision based on the performance consideration.

6. Global Data Redundancy Analysis

Data redundancy may occur between classes. For example, two similar inventory classes might be definedin a conceptual schema, which may have resulted from design errors or the integration of schema segmentsdesigned by a team of designers. Like the relation normalization, the analysis discussed so far in theprevious subsections can discover data redundancy within classes, but fail to detect data redundancybetween classes.

- 20 -

Data redundancy between classes can be removed by analyzing identified abstractions among classes.An abstraction represents a real world entity. If the same real world entity has been repeatedly representedin a conceptual schema, several similar abstractions should emerge from the identification of abstractionsin the conceptual schema. Hence, taking abstractions out of their class boundaries and comparing theabstractions on the schema level can reveal data redundancy between classes.

6.1 Comparison of Attributes

Intuitively, the structural similarity of abstractions can be defined as the overlapping of attributes ofabstractions. However, attributes that represent the same world entity may have different names indifferent abstractions. Moreover, a same attribute name may represent two totally different real-worldentities. The former is called the synonyms naming conflict and the latter is called the homonyms namingconflict in the research of schema integration ([Batini 86], [Batini 92], [Gotthard 92]). The namingconflicts require the designer to identify and resolve the conflicts. Limited by the space, the discussions ofthe conflicts and the resolution of the conflicts are omitted. Interested readers may refer to [Batini 86],[Batini 92] and [Gotthard 92] for an excellent survey and further discussions.

Briefly, resolving naming conflicts involves two steps. The first step is to identify all naming conflictsand classify the conflicts into synonyms and homonyms. The second step is to rename the conflictingnames so that attributes that represent the same real-world entity use a same attribute name, and attributesthat represent different real-world entities have different attribute names. The remaining discussion of thissubsection assumes that naming conflicts have been solved following the naming conflict resolution.

6.2 Comparison of Methods

Like attributes, methods may have naming conflicts. This issue is not addressed by the aforementionedresearch in view integration ([Batini 86], [Batini 92], [Gotthard 92]). The conflict resolution presentedhere is based on an earlier research in schema integration in object-oriented database (Hong and Kumar[Hong 93]), the naming conflict of methods is defined as follows:

Definition 10 (Similarity of Methods) Given a method M1 of class C1 and a method M2 of C2:

M1(N1, I1, R1) ⇒ (U1, W1, G1)M2(N2, I2, R2) ⇒ (U2, W2, G2)

M1 and M2 are

(a) Identical iff I1 = I2, R1 = R2, U1 = U2, W1 = W2, and G1 = G2

(b) Inclusive iff I1 ⊂ I2, R1 ⊂ R2, U1 ⊂ U2, W1 ⊂ W2, and G1 ⊂ G2, or vice versa

(c) Overlapped iff both methods are not inclusive, but I1 ∩ I2 ≠ ∅ , R1 ∩ R2 ≠ ∅ , U1 ∩ U2 ≠ ∅ ,W1 ∩ W2 ≠ ∅ , and G1 ∩ G2 ≠ ∅ ;

(d) Disjoint iff both methods are not qualified for (a), (b) and (c) above.

The resolution of the method naming conflicts follows the similar approach as attribute naming conflictresolution. If M1 and M2 is identical and N1 ≠ N2, it is a synonym naming conflict and N1 and N2 musthave a same method name. In other three cases, if N1 = N2, it is a homonym naming conflict and N1 or N2should have different names because M1 and M2 are two different methods.

- 21 -

Inclusive and overlapped methods need to be restructured. Inclusive methods are resolved by removingthe common part of the methods out from the containing method to form a new method. However, theresolution for overlapped methods is not trivial and depends on the designer’s judgment. If two overlappedmethods indeed have overlapped operations, the overlapped operations may be separated out as a newmethod. If the two methods have little in common, they may be regarded as disjoint methods. Theinterested readers may refer to [Hong 93] for the discussion of restructuring methods in detail.

6.3 Comparison of Abstractions

After both attribute naming conflicts and method naming conflicts have been resolved, the similarity ofabstractions can be compared on the schema level. The degree of similarity of abstractions is defined asfollows:

Definition 11 (Similarity of Abstractions) Given two abstractions, A and B. A and B is

(a) Identical if both A and B contain the same set of a-dependencies and m-dependencies,

(b) Inclusive if the dependencies contained in B are a proper subset of that contained in A, or viceversa,

(c) Overlapped if A and B contain two different sets of dependencies, but they have some commondependencies,

(d) Disjoint if A and B have no common dependencies.

Ideally, abstractions identified in a conceptual schema should all be disjoint. Therefore, the globalanalysis of data redundancy is to compare abstractions on the schema level and reveal abstractions that arenot disjoint. Those non-disjoint classes may result in schema level data redundancy. Given twoabstractions, A and B, the following rules are suggested to remove the data redundancy.

1. Rule 5.1. If A and B are identical or inclusive, merge these two abstractions.

2. Rule 5.2. If A and B are overlapped, either extract the common portion of the abstractions into anew abstraction or remove the common portion out of one of the abstractions so that only one ofthe abstractions contains the common portion.

Note that the global analysis described here does not explicitly mention possible relationships betweenabstractions. Relationships (aggregations and associations) between abstractions have been captured inour reference model as object references and treated as normal attributes with complex attribute types asdiscussed in Section 2 and Section 4.1. During the attribute naming conflict resolution, relationships havebeen considered. Assume that two objects, say Manager and Item, both reference to a same object, sayWarehouse. If the object reference attributes in Manager and Item do not have the same attribute name(relationship name), these two attributes (relationships) are regarded as a naming conflict. The namingconflict resolution would require both attributes to have the same name. Therefore, the comparison ofsimilarity of abstractions has already included the relationships between abstractions.

Ideally, the global analysis should be performed after classes in a conceptual schema have beennormalized. If all classes are in 3WDF, the global analysis would be simplified because each abstractioncorresponds to a class, and the comparison and resolution of similar abstractions would be easier.However, global analysis may be performed immediately after abstractions have been identified. Ingeneral, the global analysis may interleave with the analysis of individual classes as discussed in thefollowing subsection.

- 22 -

6.4 Conceptual Schema Analysis Processes

Object-oriented databases have complex structures. A class may be involved in class hierarchies (is-arelationships) and may contain object-references (association or aggregation relationships). Those complexstructures affect the analysis of data redundancy within classes and the order in which classes are analyzed.The complex structures within classes have been analyzed by the class normalization and the globalanalysis discussed earlier. This subsection deals with the order in which classes are analyzed with respectto class hierarchies and object-references (relationships).

The analysis of classes involved in a class hierarchy follows a top-down order. The analysis beginswith the highest class(es) in the hierarchy and proceeds down to its subclasses. This top-down order isrequired because a superclass may be broken down into several classes as the results of the normalization.The breaking up of a superclass affects the inheritance structure among its subclasses. For example, ifclass Inventory has subclasses, the broken down of Inventory into several classes as shown in Section 5would change what its subclasses could inherit from Inventory. Hence, a class should be analyzed onlyafter all its superclasses have been analyzed.

The analysis of a subclass includes all inherited attributes and methods. Because of inheritance, asubclass inherits all its superclass’ dependencies and abstractions. The subclass may introduce its ownattributes and methods and may override some inherited attributes and methods, which affects theidentification of dependencies and abstractions in the subclass. Thus, the analysis of a subclass cannotsimply assume inherited abstractions and the inherited dependencies must be re-evaluated. Once thedependencies are determined, abstractions of the class can be identified. The analysis of data redundancyin a subclass is done in the same way as other classes.

The analysis of classes containing object-references follows an inside-out order. A class that refers toother objects may itself be referenced by others. For example, the class Inventory contains attribute vendorthat refers to class Supplier. The class Inventory itself may in turn be referenced by another class, sayCost-Center, as the inventory of the cost center. These classes form a chain of references. Thenormalization of a class in the chain may affect all classes that contain references to the class. Forinstance, breaking class Inventory into Item, Warehouse, Supplier and Manager would affect the classCost-Center in which the original reference to Inventory would be changed to Item. After all the referenceshave been updated, the dependencies in the containing class need to be re-evaluated. Thus, the analysisstarts with most inner classes that are referenced by others and moves outward to classes that containreferences to the inner classes. This process is iterated until all classes in a reference chain have beenanalyzed.

Unlike class hierarchies, object references do not cause the exchange of attributes or methods betweenreferencing and referenced objects. An object contained in (referenced by) a class has been treated as“black box.” For example, the object references in class Inventory, such as vendor and contact, have beentreated as ordinary attributes. Following the inside-out order as outlined in the previous paragraph, theanalysis of Inventory needs not consider the internal, complex structures of the referenced objects (vendorand contact) because the referenced objects have already been analyzed before the analysis reachesInventory. Thus, the analysis of a containing class can treat the referenced objects as “black box” withoutrestricting or limiting the analysis of data redundancy.

Figure 11 summarizes the proposed steps for analyzing a conceptual schema. These analysis steps donot imply a sequential order among the four major steps. These major steps may be performed in parallel

- 23 -

or in any order. Furthermore, these steps are not a waterfall process. Several iterations may be requireddepending upon the complexity of the conceptual schema to be analyzed.

7. Conclusion

We presented an analysis method to address the data redundancy problem in object-oriented databases.The contribution of this method is to extend the relation normalization to the analysis of object-orientedschemas. We expanded the concept dependency to a-dependency to include complex objects andintroduced the concept m-dependency for describing relationship between attributes and methods. Basedon the dependencies, we proposed the concept abstraction to identify the semantics clusters within a class.These concepts help formalize the analysis of data redundancy and reveal the redundancy. Traditionally,

1: Analysis of class hierarchies1.1. Identify class hierarchies1.2. Find the highest superclasses that have not been analyzed1.3. Perform data redundancy analysis of those classes1.4. Normalize the classes that contain data redundancy1.5. Repeat steps 1.2, 1.3, and 1.4 until all classes in the hierarchy have

been analyzed

2: Analysis of object references2.1. Form object reference chains2.2. Find the most inner referenced classes that have not been analyzed2.3. Perform data redundancy analysis on those classes2.4. Normalize the classes that contain data redundancy2.5. Repeat steps 2.2, 2.3, and 2.4 until all classes in the reference chain have

been analyzed

3: Analysis other classes3.1. Perform data redundancy analysis on each of the classes that do not involve

in any class hierarchies and object reference chains3.2. Normalize the classes that contain data redundancy

4: Global analysis of classes4.1. Identify and resolve attribute naming conflicts4.2. Identify and resolve method naming conflicts4.3. Determine the degree of similarity of abstractions4.4. Remove data redundancy in those non-disjoint abstractions.

Figure 11: Data Redundancy Analysis Steps

normalization process is limited to the redundancy within relations. We extended the process to include theanalysis of data redundancy on the conceptual schema level and to detect the redundancy between classes.Because data redundancy is very harmful to data integrity, this research presented a systematic study ofthe data redundancy problem in object-oriented databases.

To validate the method, we implemented a prototype as “proof of concept” to test and validate theanalysis process and the set of rules. This prototype was implemented using Kappa [KAPPA], aknowledge-based application development system. Testing examples were selected from both studentprojects and textbooks of databases, system analysis and design. The database schemas in student projectsthat contained redundant data were used to test whether the method can discover the redundancy and

- 24 -

provide meaningful diagnosis. The examples selected from several well-known textbooks were used toensure that the method did not raise false alarm. The prototype was reported in [Hong 93]. The earlierresearch has helped the refinement of the method and the rules presented in this paper.

However, the proposed method, in its current form, is descriptive. The proposed method can aid thedesigner in the analysis of data redundancy, and it cannot completely automate the analysis process.Several key concepts assumed the involvement of the designer. For instance, the identification of a-dependencies is subjective to the designer. Similarly, the identification and resolution of naming conflictsin the global data redundancy analysis rely on the designer. It is a very difficult research challenge tocompletely automate the analysis of the semantics of conceptual schemas, and it may be impossible toeliminate the human’s involvement in the analysis. However, there is much space to improve the analysismethod. Future research will construct a rigid theory for the proposed method and develop algorithms forautomating as much analysis as feasible.

On the application side, a plan has been made to incorporate the method and the rules into aCASE/database-design tool. A proposed architecture of such a tool is depicted in Figure 12. Because ofthe widely availability of database design tools in both academic laboratories and software industries (e.g.,[Bragger 85], [Byrce 86], [Deux 91], [Ellis 91], [Troger 89], GemStone [GemStone], ObjectStore[ObjectStore], ONTOS [Ontos]), we plan to implement the proposed method as a component of an existingdatabase design tool. This component will take over once the user completes the design of the conceptualschema of an object-oriented database. The component will report back to the user about the dataredundancy in the schema and assist the user in reducing the data redundancy.

Data Redundancy & Suggestions

Conceptual SchemaDefinitions

Rule Base

Data

Analysis Redundancy

Object-Oriented

Figure 12. Architecture of A Database Design Tool

- 25 -

8. REFERENCES

[Andonoff 92] Andonoff, E., “Normalization of Object-Oriented Conceptual Schema,” Proceeding ofCAiSE, pp. 349-462, 1992.

[Armstrong 74]Armstrong, W.W., “Dependency Structures of Data Base Relationships,” Proceedings ofthe IFIP Congress, Vol. 14, No. 2, pp. 245-286, 1974.

[Atkinson 89] Atkinson, M., et al., "The Object-Oriented Database System Manifesto," Proceedings of1st Conference on Deductive and Object-Oriented databases, Dec., 1989.

[Bancilhon 94] Bancilhon, F. and Ferran, G. “ODMG-93: The Object Database Standard.” Bulletin of theTechnical Committee on Data Engineering, Vol. 17, No. 4, Decemenber 1994, pp. 3-14.

[Batini 92] Batini, C.; Ceri, S., and Navathe, S.B. Conceptual Database Design: An EntityRelationship Approach, The Benjamin/Cummings Publishing Company, Inc., 1992.

[Batini 86] Batini, C., Lenzerini, M., and Navathe, S.B. "A Comparative Analysis of Methodologiesfor Database Schema Integration." ACM Computing Surveys, Vol. 18, No. 2, December1986, pp. 323-364.

[Beeri 77] Beeri, C.,, Fagin, R., and Howard, J., “A Complete Axiomatization for Functional andMultivalued Dependencies in Database Relations,” Proceedings of ACM SIGMOD,August, pp. 47-63, 1977.

[Beeri 78] Beeri, C., Bernstein, P.A., and Goodman, N. “A sophisticated’ Introduction to DatabaseNormalization Theory.” In Proc. of Fourth International Conference on Very Large DataBases. Berlin, 1978.

[Bernstein 76] Bernstein, P. “Synthesizing Third Normal Form Relations from Functional Dependencies.”ACM Transactions on Database Systems. Vol. 1, No. 4 1976.

[Booch 94] Booch, G., Object-Oriented Analysis and Design with Applications, The Benjamin/Cummings Publishing Company, Inc., Second Edition, 1994.

[Bragger 85] Bragger, R.P., et al. "Gambit: An Interactive Database Design Tool for Data Structures,Integrity Constraints, and Transactions", IEEE Trans. on Software Engineering, Vol. SE-11, No. 7, July 1985, pp. 574-583.

[Brown 83] Brown, R., and Parker, D.S. “LAURA: A Formal Data Model and Her Logical DesignMethodology.” In Proc. of Ninth International Conference on Very Large Databases.Florence, 1983.

[Byrce 86] Byrce, D. and Hull, R., “SNAP: A Graphics-Based Schema Manager”, Proc. of IEEE Int'lConf. on Data Engineering, Feb. 1986, pp. 151-164.

[CADF 90] The Committee for Advanced DBMS Function, “Third-Generation Database SystemManifesto,” ACM SIGMOD Record, Vol. 19, No. 3, September 1990, pp. 31-44.

[Cattell 93] R. Cattell, Ed, The Object Database Standard: ODMG-93, Morgan-KaufmannPublishers, San Mateo, California, 1993

[Codd 70] Codd, E.F., “A Relational Model of Data for Large Shared Data Banks,” Communicationof ACM, Vol. 13, No. 6, June, 1970, pp. 377-387.

[Codd 72] Codd, E.F., "Further Normalization of the Data Base Relational Model." in Data BaseSystems (Courant Computer Science Symposium 6), 1972, Prentice-Hall

- 26 -

[Deux 91] Deux, O., et al., "The O2 System," Communication of ACM, Vol. 34, No. 10, Oct. 1991,pp. 35-48.

[Embley 88] Embley, D. and Ling, T.W. “Synergistic Database Design with an Extended Entity-Relationship Model.” In Proc. Eighth International Conference on Entity-RelationshipApproach. Toronto, 1988.

[Ellis 91] Ellis, H.C. and Demurjian, S.A., "ADAM: A Graphical, Object-Oriented Database DesignTool and Code Generator," in Proceeding of ACM Computer Science Conference, 1991.

[Fagin 77] Fagin, R. “The Decomposition versus the Synthetic Approach to Relational DatabaseDesign.” In Proc. of Third International Conference on Very Large Data Bases, Tokyo,1977.

[GemStone] GemStone Documentation, Servio Corp., Beaverton, OR.

[Gotthard 92] Gotthard, W., Lockemann, P. C., and Neufeld, A., "System-Guided View Integration forObject-Oriented Databases," IEEE Transaction on Knowledge and Data Engineering,Vol. 4, No. 1, Feb. 1992, pp. 1-22

[Hong 91] Hong, S., "A Class Normalization Approach to the Design of Object-Oriented Databases,"In Proceedings of Technology of Object-Oriented Languages and Systems (TOOLSU.S.), Prentice Hall, July, 1991, pp. 63-71.

[Hong 92] Hong, S, Duhl, J. and Craig, H. "DBDesigner: A Tool for Object-Oriented DatabaseApplications." Journal of Database Administration (now called Journal of DatabaseManagement.) Vol. 3, Summer Issue, 1992, pp. 3-11.

[Hong 93] Hong, S. and Kumar, K. “An Approach for View-Integration of Object-Oriented Models:Integrating Object-Behavior.” in Proceedings of Beijing International Conference ofYoung Computer Scientist, Summer, 1993, pp. 8.63-8.67.

[KAPPA] Kappa User's Guide, IntelliCorp, Inc.

[Kent 83] Kent, W. "A Simple Guide to Five Normal Forms in Relational Database Theory."Communications of ACM, Vol. 26, No. 2, Feb. 1983, pp. 120-125.

[Kim 90] Kim, W., Introduction to Object-Oriented Databases, MIT press, 1990.

[Kim 87] Kim, W. et al, "Composite Object Support in an Object-Oriented Database System forEngineering Applications,", Proc. of OOPSLA'87, Oct. 1987, pp. 118-125.

[Kim 88] Kim, W. et al, "Integrating an Object-Oriented Programming System with a DatabaseSystem," Proc. of OOPSLA'88, Sept. 1988, pp. 142-152.

[Kim94] Kim, W., "Observations on the ODMG-93 Proposal," SIGMOD Record, March 1994, pp.4-9.

[Ling 85] Ling, T.W. “A Normal Form For Entity-Relationship Diagrams.” In Proc. of IEEEInternational Conference on Data Engineering. 1985, pp. 24-35.

[Maier 83] Maier, D., The Theory of Relational Databases, Computer Science Press, 1983.

[Makinouchi 77] Makinouchi, A., "A Consideration Normal Form of Not-Necessarily-NormalizedRelations in the Relational Data Model." In Proceedings of the Conference on VeryLarge Data Bases (Tokyo, 1977), pp. 447-453.

[ObjectStore] ObjectStore Documentation, Object Design Inc., Burlington, MA.

- 27 -

[Ontos] Ontos System Documentation, Ontos Inc., Burlington, MA.

[Ozsoyolu 87] Ozsoyoglu, Z.M. and Yuan, L.Y., "A New Normal Form for Nested Relations," ACMTransactions on Database Systems, Vol. 12, No. 1, 1987, pp. 111-136.

[Ozsoyoglu 89]Ozsoyoglu, Z.M. and Yuan, L.Y., "On the Normalization in Nested Relational Databases,"in Nested Relations and Complex Objects in Databases, Lecture Notes in ComputerScience, Eds. S. Abiteboul, P.C. Fischer, and H.-J. Schek, Springer-Verlag, 1989, pp.243-271.

[Rumbaugh 91]Rumbaugh, J., et al., Object-Oriented Modeling and Design, Prentice Hall, 1991.

[Troger 89] Troger, O.D., "RIDL: A Tool for the Computer Aided Engineering of Large Databases inthe Presence of Integrity Constraints," in Proc. of ACM SIGMOD, June, 1989, pp. 418-429.

[Ullman 88] Ullman, J.D., Principles of Database and Knowledge-Base Systems, Volumes 1 & 2,Computer Science Press, 1988.

[Zdonik 90] Zdonik, S.B. and Maier, D., "Fundamentals of Object-Oriented Databases", Readings inObject-Oriented Database Systems, edited by S. Zdonik and D. Maier, Morgan KaufmannPublishers, Inc., 1990.

a method for analyzing and reducing data redundancy in object

Documents