sql unit 17 normalization

221
SQL Unit 17 Normalization prepared by Kirk Scott 1

Upload: grant-fernandez

Post on 02-Jan-2016

39 views

Category:

Documents


2 download

DESCRIPTION

SQL Unit 17 Normalization. prepared by Kirk Scott. 1. Normal Forms 2. First Normal Form 3. Second Normal Form 4. Third Normal Form 5. Boyce- Codd Normal Form 6. Fourth Normal Form 7. Fifth Normal Form 8. Domain Key Normal Form 9. Nulls and Integrity. 1. Normal Forms. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SQL Unit  17 Normalization

1

SQL Unit 17Normalization

prepared by Kirk Scott

Page 2: SQL Unit  17 Normalization

2

• 1. Normal Forms• 2. First Normal Form• 3. Second Normal Form• 4. Third Normal Form• 5. Boyce-Codd Normal Form• 6. Fourth Normal Form• 7. Fifth Normal Form• 8. Domain Key Normal Form• 9. Nulls and Integrity

Page 3: SQL Unit  17 Normalization

3

1. Normal Forms

• The benefits of relational database theory can be summarized as follows:

• There is a step-by-step way of arriving at a correct design

• There is a way of detecting flaws in a design• The design process has to do with the problem

domain, not with computer-related questions

Page 4: SQL Unit  17 Normalization

4

• The database designer and user are protected from questions related to the implementation of the dbms and the hardware it’s running on

• Finally, if the design is correct, it will be possible to:

• Store all desired information in it;• Update the information on an ongoing basis;• Retrieve any/all of the information as needed.

Page 5: SQL Unit  17 Normalization

5

• Correct designs are based on what are called normal forms.

• This section presents the background information to the design process.

• It also discusses and illustrates the use of normal forms.

Page 6: SQL Unit  17 Normalization

6

Identifying Entities

• At its most basic level, design of a database depends on determining what you want to store information about.

• When deciding what the base tables will be, you are trying to identify entities.

• From a language point of view, this involves identifying nouns which do not modify other things.

Page 7: SQL Unit  17 Normalization

7

Identifying Attributes

• Identifying the entities leads to identifying their attributes.

• Attribute names usually end up being nouns too, but you figure out what they are when you try to describe entities, and the descriptions usually involve adjectives.

• One of the key points of database design is that you only store information about the entities and attributes you need to.

Page 8: SQL Unit  17 Normalization

8

• There may be many possible entities • All entities may have a long list of potential

attributes• But you limit yourself to only those things you

will need to retrieve information about in the future.

Page 9: SQL Unit  17 Normalization

9

Identifying Keys

• You are familiar with primary keys and foreign keys. • When trying to organize the attributes around

entities in the design, the idea is to equate an entity with a primary key field

• Then group the attributes with the entities that they describe.

• Relationships between tables are captured by embedding the primary keys of one or more tables as foreign keys in other tables.

Page 10: SQL Unit  17 Normalization

10

Functions, Determination, and Dependency

• When described in general, the foregoing sounds sensible enough.

• That’s why the book claims that if you can model successfully, the result will be a correct design

• In practice it can be difficult to do without formal guidelines.

• This is what the normal forms provide.

Page 11: SQL Unit  17 Normalization

11

• The normal forms are based on and described in terms of an idea taken from math.

• One field in a table may functionally determine another.

• Stated in reverse order: The other field depends functionally on the one.

Page 12: SQL Unit  17 Normalization

12

• This is an example of a mathematical function:• y = f(x), for example, y = x2

• y is a function of x. • x is in the domain and y is in the range. • x functionally determines y• Or, y functionally depends on x.

Page 13: SQL Unit  17 Normalization

13

• For a mathematical function, you find the dependent value by doing some sort of computation on the determining value.

• The key point underlying a function is the following:

• For each value of x, there can only be one corresponding value of y.

• x uniquely determines y.

Page 14: SQL Unit  17 Normalization

14

• The analogy in database design is the following: • The primary key of a table should functionally

determine the values of the other fields in the table.

• In other words, the non-key fields should functionally depend on the primary key field.

• Just as the primary key uniquely identifies a record, it uniquely determines the values of the fields in the record

Page 15: SQL Unit  17 Normalization

15

• Take this small table for example:• This is its schema:• Person(SSN, name, dob)

SSN Name dob123-45-6789 Bob 1/1/01…

Page 16: SQL Unit  17 Normalization

16

• You don’t find a person’s name or birthdate by doing a computation on their social security number.

• However, given any one social security number, there is exactly one corresponding name and exactly one corresponding date of birth.

• It is true that different people with different social security numbers may have the same name and the same date of birth, but this is not a problem.

Page 17: SQL Unit  17 Normalization

17

• The point of the primary key field is that it is the unique identifier that makes it possible to distinguish between these two people.

• This idea came up at the beginning of the course

• The point now is that the name and date of birth fields functionally depend on the social security number field.

Page 18: SQL Unit  17 Normalization

18

• A new notation can be used to indicate this. • In this notation, the arrows go from the field

that functionally determines another field, to the field that is dependent.

• This is illustrated on the next overhead.

Page 19: SQL Unit  17 Normalization

19

Page 20: SQL Unit  17 Normalization

20

Normal Forms

• Some of the normal forms are identified by number, for example 1st, 2nd, and 3rd normal forms.

• Others are identified by name, for example Boyce-Codd normal form, named after the people who discovered it.

• These four normal forms are abbreviated 1NF, 2NF, 3NF, and BCNF, respectively.

• There are also higher normal forms, 4th, 5th, and domain key normal forms (4NF, 5NF, DKNF).

Page 21: SQL Unit  17 Normalization

21

• The normal forms have to do with finding dependencies in tables which spring from fields other than the primary key.

• These dependencies are undesirable and may be referred to as stray dependencies.

• The normal forms make increasingly strict statements about the kinds of stray dependencies that have to be eliminated from correctly designed tables.

• Designs containing stray dependencies are said to violate the normal forms.

Page 22: SQL Unit  17 Normalization

22

Eliminating Dependencies

• The design process using normal forms consists of repetitive steps:

• Make a design• Identify stray dependencies (normal form violations)• Redesign to eliminate the dependencies• Once you’ve eliminated all occurrences of one type

of violation, you will have promoted the design into the next higher normal form

• Repeat until you’ve reached the highest normal form

Page 23: SQL Unit  17 Normalization

23

• The rule of thumb at every stage is to remove stray dependencies in the following way:

• Make any field which determines other fields the primary key of a new table, and move the fields that depend on that field to the new table.

• Make sure that the new table is connected to the old table by a primary key, foreign key pair.

Page 24: SQL Unit  17 Normalization

24

Anomalies

• Design problems that are based on violations of normal forms lead to what are called anomalies.

• The hallmark of a problematic design is that the same information is stored multiple times.

• In other words, there is redundancy in the database.

• Depending on the nature of the redundancy, this can lead to problems when inserting data, when updating data, and when deleting data.

Page 25: SQL Unit  17 Normalization

25

Justifying Normal Forms

• The use of normal forms may seem unnecessarily theoretical at first.

• However, they provide a convenient way of identifying problems in designs and then eliminating them.

• Normal forms are what justify these claims about relation databases:– There is a step-by-step way of arriving at a correct

design– There is a way of detecting flaws in a design

Page 26: SQL Unit  17 Normalization

26

The Plan of Action for the Following Sections

• Each of the following sections will present a normal form in this way:

• A definition of the normal form will be given. • A scenario for information to be held in a

database will be given, with the underlying assumptions given.

• An example database design which violates the normal form will be given

Page 27: SQL Unit  17 Normalization

27

• The violation will be shown using a diagram with the notation indicating functional dependencies.

• The desired functional dependencies from the primary key will be shown using arrows below the field names.

• Undesired, stray dependencies, which need to be eliminated in order to correct the design, will be shown using arrows above the field names.

Page 28: SQL Unit  17 Normalization

28

• Anomalies resulting from the incorrect design will be discussed.

• In general, there will be insert, update, and delete anomalies

• Finally, a corrected design will be given.

Page 29: SQL Unit  17 Normalization

29

Basis for Examples

• All of the examples will be based on the general topic of cars, salespeople, customers, and car sales.

• Some of the field names are abbreviated, and some of the fields clearly belong together in some way.

• Here is a little preliminary explanation regarding the fields that will be in the examples.

• Not all of the fields will appear in all of the examples.

Page 30: SQL Unit  17 Normalization

30

• vin: vehicle identification number. Vehicles have makes, models, and years.

• spno, spname: Salesperson number and name.

• custno, custname: Customer number and name.

• A car sale has a salesprice and a date.

Page 31: SQL Unit  17 Normalization

31

2. First Normal Form

• 1NF Definition:• Formally (Watson):• A relation is in first normal form if and only if

all columns are single-valued.• Informally: • Data is stored in flat files; there can be no

repeating groups in a record. • (This was mentioned in the very first unit.)

Page 32: SQL Unit  17 Normalization

32

• The assumptions underlying the design are that a salesperson can sell many cars, but each car can only be sold by one salesperson.

• In this design, each car is only sold once, so the design captures information about the sales of new cars.

• These assumptions don’t cause the problem.• It is the implementation of them that causes

the problem.

Page 33: SQL Unit  17 Normalization

33

• Here is the design that violates 1NF: • Carsale(spno, spname, {vin, salesprice})

• The example design uses {} notation to indicate repeating groups of fields.

• A diagram with arrows illustrating this design is given on the next overhead

Page 34: SQL Unit  17 Normalization

34

Page 35: SQL Unit  17 Normalization

35

• In general, normal form violations have insert, update, and delete anomalies.

• It would be possible to consider such problems with a 1NF violation

• However, the general form of anomalies is much clearer if they’re introduced with 2NF and higher normal form violations

• The repeating group alone is a sufficient problem to make this kind of design incorrect and anomalies will not be discussed in this first case

Page 36: SQL Unit  17 Normalization

36

• The solution to all basic normal form violations is the same:

• Break out the stray dependency out into a separate table

• In this case, break out the information contained in the repeating group

Page 37: SQL Unit  17 Normalization

37

• As stated in the assumptions, one salesperson can sell many cars, but each car is sold only once, so there is a 1-m relationship between the two tables in the resulting design.

• The primary key of the table containing salesperson information will have to be embedded as a foreign key in the table containing car information.

Page 38: SQL Unit  17 Normalization

38

• Here is the corrected design:• Salesperson(spno, spname)• Carsale(vin, salesprice, spno f.k.)

Page 39: SQL Unit  17 Normalization

39

3. Second Normal Form

• 2NF Definition:• Formally (Watson):• A relation is in second normal form if an only if

it is in first normal form, and all nonkey columns are dependent on the key

• Informally: • In a table with a concatenated primary key

field, there can be no stray dependencies that originate in just part of the primary key field.

Page 40: SQL Unit  17 Normalization

40

• The basic idea is that all nonkey fields have to depend on the whole key.

• Stating it in this way will lead to a useful mnemonic device which will be given later.

• When you lay it out in this way, you begin to realize that 2NF deals with tables that have concatenated key fields, where a dependency from only one field of the key might be possible.

Page 41: SQL Unit  17 Normalization

41

• In this example the underlying assumptions are that the same car can come back to the lot and be sold more than once.

• It can be sold by the same salesperson more than once, but not on the same day.

• It can also be sold by different salespeople at different times.

• Although unlikely, the design is made so that two different salespeople could sell the same car on the same date.

Page 42: SQL Unit  17 Normalization

42

• The design doesn’t contain any information about customers, but the scenario would be that one customer brought the car back, and a different salesperson sold it again.

• It seems unlikely that the same customer would buy the same car twice, whether on the same date or different dates.

Page 43: SQL Unit  17 Normalization

43

• In summary, this design works for used car sales and both the date and the salesperson information are needed, along with the car information, to distinguish between different sales.

Page 44: SQL Unit  17 Normalization

44

• Here is the design that violates 2NF:• • Carsale(vin, spno, date, spname)

• A diagram with arrows illustrating this is given on the next overhead

Page 45: SQL Unit  17 Normalization

45

Page 46: SQL Unit  17 Normalization

46

• This faulty design has insert, update, and delete anomalies.

• Suppose a salesperson has not yet sold a car. • In this case, it is not possible to insert information about

that salesperson. • On the other hand, a salesperson may make many sales. • This means that the same information about that

salesperson would be stored in more than one record in the table.

• This is redundancy.

Page 47: SQL Unit  17 Normalization

47

• Not only is the redundancy itself wasteful, it leads to the update anomaly.

• Suppose the salesperson’s name changes. • Then it’s necessary to update multiple records

to reflect this fact, not just one.

Page 48: SQL Unit  17 Normalization

48

• The delete anomaly is related to the insert anomaly.

• Suppose that as part of the maintenance of the database, on a yearly basis the sales table is cleared.

• When you delete the last record containing a sale by a particular salesperson, you not only get rid of the sales record, you also lose the salesperson’s name.

Page 49: SQL Unit  17 Normalization

49

• As usual, the solution to the problem is to break the stray dependency out into a table of its own.

• Each car sale has only one salesperson, but each salesperson can be involved in many sales, so this is a 1-m many relationship.

• The salesperson information is stored in a table by itself, and the primary key of the salesperson table is embedded as a foreign key in the car sale table.

Page 50: SQL Unit  17 Normalization

50

• Here is the corrected design:

• Salesperson(spno, spname)• Carsale(vin, date, spno f.k.)

Page 51: SQL Unit  17 Normalization

51

4. Third Normal Form

• 3NF Definition:• Formally (Watson):• A relation is in third normal form if an only if it

is in second normal form and has no transitive dependencies

• Informally: • There can be no stray dependencies from one

non-key field to another.

Page 52: SQL Unit  17 Normalization

52

• In this example, for the sake of simplicity, it is assumed that new cars are being sold and they can only be sold once.

• Information about the customer is also recorded with the sale.

• Each car can only be bought by one customer. • It would be possible for a customer to buy

more than one car.

Page 53: SQL Unit  17 Normalization

53

• Here is the design that violates 3NF:

• Carsale(vin, custno, custname, salesprice, date)

• A diagram with arrows illustrating this is given on the next overhead

Page 54: SQL Unit  17 Normalization

54

Page 55: SQL Unit  17 Normalization

55

• This design also has insert, update, and delete anomalies and the pattern of the anomalies is the same as in the previous example.

• They all stem from the presence of the stray dependency in the design.

• If you have a potential customer who has not yet bought a car, it is impossible to insert information about that person.

Page 56: SQL Unit  17 Normalization

56

• If a customer has bought more than one car, the customer information is stored redundantly.

• In that case, if the customer’s name changes, it’s necessary to change multiple records.

• Finally, if the sales table is cleared on a regular basis, when you delete the last sales record for a given customer, you not only get rid of the sales record, you also lose the customer’s name.

Page 57: SQL Unit  17 Normalization

57

• There is a situation that can arise in database designs that appears to be a violation of 3NF, but isn’t.

• The most common example of this situation is a table which includes a city, state, and zip code as part of an address.

• The postal service has divided up the country into zones which are identified by zip codes.

Page 58: SQL Unit  17 Normalization

58

• None of these zones cross city or state boundaries.

• That means that a zip code, a non key field, determines the city and state.

• It is not necessary for someone to break this dependency out of their database design.

• The rule of thumb is that if you are not responsible for maintaining the dependency, then you can ignore it.

Page 59: SQL Unit  17 Normalization

59

• The postal service has a table somewhere with zip code as the primary key and all of the descriptive fields about zip code that exist.

• The post office maintains this. • A table not maintained by the postal service

can contain addresses with zip codes and completely ignore the fact that there may in reality be a dependency.

Page 60: SQL Unit  17 Normalization

60

5. Boyce-Codd Normal Form

• BCNF Definition:• Formally (Watson):• A relation is in Boyce-Codd normal form if and

only if every determinant is a candidate key• Informally: • There can be no stray dependencies from a

non-key field to a field in the key

Page 61: SQL Unit  17 Normalization

61

• Notice that the formal definition of BCNF differs from that of the foregoing definitions

• If does not include the phrase “in 3NF and…” • BCNF is actually a summation of 1NF through

3NF which covers one other case which is not covered by the previous normal forms.

• BCNF will be explained by presenting an example which includes this additional case

Page 62: SQL Unit  17 Normalization

62

• For the purposes of this example suppose that the same car can be sold by the same salesperson more than once, but only one sale of that car is possible per date.

• Suppose also that this dealership has a system for assigning prospective customers to specific salespeople, so that each salesperson is associated with an exclusive list of clients.

Page 63: SQL Unit  17 Normalization

63

• It would be normal to assume that this system is implemented in some sort of table.

• Such a table is not shown here—it will become part of the solution to the problem.

• The point now is to show the problem that this assumption leads to in the table of interest in the original, incorrect design.

Page 64: SQL Unit  17 Normalization

64

• Here is the design which violates BCNF:

• Carsale(vin, spno, date, custno)

• A diagram with arrows illustrating this is given on the next overhead

Page 65: SQL Unit  17 Normalization

65

Page 66: SQL Unit  17 Normalization

66

• It is obvious that there is a stray dependency in the design.

• Technically we haven’t seen this kind of dependency before, because it goes into the primary key.

• If you studied the previous normal form definitions, you would discover that they don’t cover this case.

Page 67: SQL Unit  17 Normalization

67

• The anomalies in this design are analogous to the anomalies in the previous designs.

• It is not possible to insert information about the relationship between a given customer and salesperson without a sales record which matches them.

• If the given customer has bought from the same salesperson many times, their relationship is in multiple records.

Page 68: SQL Unit  17 Normalization

68

• An update would require changes in multiple records.

• Finally, if you’re down to the last record containing information about a particular salesperson-customer pairing, deleting the record would cause the information to be lost.

• As usual, the solution to the problem is to break out the stray dependency in a separate table.

Page 69: SQL Unit  17 Normalization

69

• Here is the corrected design:

• Carsale(vin, date, custno f.k.)• Customer-Salesperson(custno, spno)

• Which p.k. should be embedded as a f.k. is clear• Under the assumptions of the example, one

customer can buy many cars, but not vice-versa

Page 70: SQL Unit  17 Normalization

70

• The original design was intuitive, but wrong• It recorded information linking salespeople

directly to their sales, but this turned out to be problematic

• The new design is correct and solves the problem

• However, it is not necessarily 100% intuitive

Page 71: SQL Unit  17 Normalization

71

• Customers are uniquely associated with a single salesperson

• The car sale record now tells you who bought the car• If you want to know who the salesperson was for a

given sale, you have to do a join with the Customer-Salesperson table.

• To a user who doesn’t understand normalization and the assumptions underlying the database, this will seem like an unnecessary complication

• The “one table database” mindset is a hard one to break

Page 72: SQL Unit  17 Normalization

72

What is Allowed under BCNF

• There is another aspect of BCNF that needs to be explained.

• Consider a design which includes both a university-generated student id number and a social security number.

• It would seem to violate BCNF as explained above:

Page 73: SQL Unit  17 Normalization

73

• Here is the design which seems to violate BCNF:

• Student(studentIDno, SSN, name)

• A diagram with arrows illustrating this is given on the next overhead

Page 74: SQL Unit  17 Normalization

74

Page 75: SQL Unit  17 Normalization

75

• The additional part of BCNF is that if the stray dependency results from another field which also could have been chosen as a primary key for the table, then it is not a normal form violation.

• This is what is meant by the formal definition: • A relation is in Boyce-Codd normal form if and

only if every determinant is a candidate key

Page 76: SQL Unit  17 Normalization

76

• In the example, both studentIDno and SSN are valid, unique identifiers of students.

• You might want to record both. • It is simply necessary to choose one of them

as the primary key of the field. • The presence of the other one in the table

does no harm.

Page 77: SQL Unit  17 Normalization

77

A Summary of the Normal Forms up through BCNF

• Up through BCNF the normal forms can be explained in terms of stray dependencies.

• An easy way to remember the requirements for these normal forms is the following statement:

• Every field in a table has to depend on the key, the whole key, and nothing but the key.

Page 78: SQL Unit  17 Normalization

78

• Because they are increasingly strict, the normal forms can be thought of as nested.

• When checking a design, you begin with the lowest normal form, make sure there are no violations, and move on to the following ones.

• This is what makes the design process step-by-step.

Page 79: SQL Unit  17 Normalization

79

• This idea can be represented using a Venn diagram.

• The idea is that the set of designs which is in some normal form is always a subset of those designs which meet the conditions for a lower normal form.

• A diagram of this is shown on the following overhead

Page 80: SQL Unit  17 Normalization

80

1NF

2NF

3NF

BCNF…

Page 81: SQL Unit  17 Normalization

81

6. Fourth Normal Form

• It was claimed earlier that there are only three kinds of relationships: 1-1, 1-m, and m-n.

• This is not entirely true. • There may be many-to-many-to-many

relationships (relationships between 3 different types of entities at the same time, m-m-m),

Page 82: SQL Unit  17 Normalization

82

• In theory there is no reason why there can’t be relationships among 4 or more different types of entities at the same time.

• Fourth and fifth normal form, 4NF and 5NF, have to do with cases like these.

Page 83: SQL Unit  17 Normalization

83

• 4NF and 5NF have been previewed earlier by saying they are related to the question of a cyclical design vs. a star design

• This is still true.• Observe that there are essentially two options

for capturing a m-m-m relationship: either a cycle or a star.

Page 84: SQL Unit  17 Normalization

84

• The higher normal forms are not strictly related to the question of stray dependencies

• The presentation of them will not follow exactly the same plan as the lower normal forms.

• However, the same basic ideas will be covered: A definition plus examples and explanations of how to do it right and what can go wrong

Page 85: SQL Unit  17 Normalization

85

Fourth Normal Form

• 4NF Definition:• Formally (Watson):• A relation is in fourth normal form if it is in

Boyce-Codd normal form and all multi-valued dependencies on the relation are functional dependencies.

• Informally: • An informal definition will have to wait…

Page 86: SQL Unit  17 Normalization

86

• The formal definition illustrates why the presentation of the higher normal forms will differ from that of the lower ones.

• It is not possible to understand the formal definition or give a meaningful informal definition without further explanation.

Page 87: SQL Unit  17 Normalization

87

• There are three things to observe about the definition:

• 1. It goes back to the model of, “in the previous normal form and…”

• The definition of 4NF depends on the definition of BCNF plus another condition

Page 88: SQL Unit  17 Normalization

88

• 2. On the other hand, it is worded “if” rather than “if and only if”

• I don’t know whether the author was daydreaming or whether this change is theoretically important.

• In any case, it is of no practical importance• We can safely consider the two halves of the

proposition to be logically equivalent

Page 89: SQL Unit  17 Normalization

89

• 3. The most important thing in the definition requiring explanation is the introduction of a new concept: multivalued dependencies.

• The new concept is actually “independent multivalued dependencies”.

• It’s hard to understand a definition when it’s based on concepts that haven’t been explained yet.

Page 90: SQL Unit  17 Normalization

90

• The rest of this section will be organized in this way:

• 1. A correct example of a m-m-m relationship implemented in a star design will be given.

• 2. Then the problems that underlie the definition of the normal form will be examined.

Page 91: SQL Unit  17 Normalization

91

Correct Example

• Suppose that a given car can be sold more than one time.

• In other words, you’re dealing in used cars. • Suppose also that salespeople can sell more

than one different car, and customers can buy more than one different car.

• This means that there are three 1-m relationships: Car to Sale, Salesperson to Sale, and Customer to Sale.

Page 92: SQL Unit  17 Normalization

92

• For the three base tables, Car, Salesperson, and Customer, there could be one table in the middle, Carsale, which brought all three together.

• The idea can be represented using ER modeling. • This results in the star shaped design shown on

the next overhead.• It is reminiscent of the design of the example

database.

Page 93: SQL Unit  17 Normalization

93

Car

CarsaleSales-person

Customer

Page 94: SQL Unit  17 Normalization

94

• The relationships are captured by embedding primary keys as foreign keys, and a valid design can be given as follows:

• Car(vin, make, model, year)• Salesperson(spno, spname)• Customer(custno, custname)• Carsale(vin, spno, custno, date)

Page 95: SQL Unit  17 Normalization

95

What Can Go Wrong?

• New assumptions are necessary in order to create an example with 4NF problems.

• The oddness of the assumptions will probably already suggest that the results will be problematic.

• 1. Assume that it is desirable to keep a record of all prospective customers salespeople have talked to.

• Under this assumption, also assume that a given customer only deals with one salesperson.

Page 96: SQL Unit  17 Normalization

96

• 2. Assume also that it’s desirable to keep a record of all cars salespeople have sold.

• Once the car is sold, it’s not important what customer it was sold to.

• All that’s of interest is the salesprice, for example, so that commissions can be calculated.

• 3. Finally, assume that all of this data is to be stored in one table.

Page 97: SQL Unit  17 Normalization

97

• This set of assumptions captures the elements of the definition of 4NF.

• The relationship between salespeople and customers is a multivalued dependency.

• In other words, it’s a one-to-many relationship, not a functional relationship.

• If you like notation, it can be indicated with a double-headed arrow:

• Salesperson—>>Customer

Page 98: SQL Unit  17 Normalization

98

• A table capturing this relationship alone would have two columns, one for salesperson and one for customer.

• Each row would consist of a pairing.• In the table overall, a given salesperson could

appear more than once.

Page 99: SQL Unit  17 Normalization

99

• The relationship between salespeople and cars is also a multivalued dependency.

• It’s a one-to-many relationship, not a functional relationship.

• Salesperson—>>Car• It too, could be captured in a single table that

matched pairs, with salespeople potentially appearing more than once.

Page 100: SQL Unit  17 Normalization

100

• The two multivalued dependencies are independent of each other.

• As spelled out in the assumptions, we’re not keeping a record a salesperson, a car, and the customer that it was sold to.

• We’re keeping track of cars sold, and independently, prospective customers that have been talked to.

Page 101: SQL Unit  17 Normalization

101

• Finally, the kicker assumption is that both of these independent multivalued dependencies are to be capture in a single table.

• As someone with background in database modeling, you doubtless recognize that there are three entities in the scenario and three tables will be desirable in a solution.

• Just hang on; we’re not there yet.

Page 102: SQL Unit  17 Normalization

102

• An example table is given on the next overhead.

• This will be followed by explanations.

Page 103: SQL Unit  17 Normalization

103

Salesperson Customer Car

Bob Jay Mustang

Bob Jay Camaro

Bob Kay Mustang

Bob Kay Camaro

Sue Ann Corolla

Sue Ann Civic

Sue Jim Corolla

Sue Jim Civic

Page 104: SQL Unit  17 Normalization

104

• Salesperson multidetermines the other fields.• In other words, there is no functional

dependency from salesperson to the other fields.

• That means that salesperson alone isn’t the key.

Page 105: SQL Unit  17 Normalization

105

• The multivalued dependencies are independent of each other.

• There is no relationship between customers and cars.

• Therefore, those fields don’t determine each other.

• Neither one can serve independently as a key.

Page 106: SQL Unit  17 Normalization

106

• The conclusion is that the table is all key.• Furthermore, there are no dependencies

among the constituent parts of the concatenated key.

• Therefore, it is impossible for this design to violate any of the normal forms up through BCNF, which all depend on stray dependencies.

Page 107: SQL Unit  17 Normalization

107

• However, the design is full of anomalies.• Look at the data in the table.• Salesperson Bob talked to two customers and

sold two cars.• The result is four rows in the table, each row

one of the four possible combinations of customer and car.

Page 108: SQL Unit  17 Normalization

108

• Why is this necessary?• Because the table is all key, none of the

individual fields can be null.• In order to record a customer, for example,

there has to be a car.• Similarly, in order to record a car, there has to

be a customer.

Page 109: SQL Unit  17 Normalization

109

• Why do every possible pairing?• Note that this is incidentally pairing just because

the example is based on two values for each field.• The reality is that you’re getting a Cartesian

product for the multidetermined fields, the number in one, m, times the number in the other, n.

• Why not do what’s shown on the following overhead instead?

Page 110: SQL Unit  17 Normalization

110

Salesperson Customer Car

Bob Jay Mustang

Bob Kay Camaro

Sue Ann Corolla

Sue Jim Civic

Page 111: SQL Unit  17 Normalization

111

• The problem with this is that if you want to delete the customer Jay, you lose the car Mustang—or vice-versa.

• If you delete the customer Kay, you lose the car Camaro—or vice-versa.

• So doubling up (doing the Cartesian product) is a consistent way to protect from data loss.

Page 112: SQL Unit  17 Normalization

112

• However, the doubling up (Cartesian product) is what causes the classic anomalies.

• To insert a single customer, you have to insert as many rows as there are cars for that salesperson.

• To insert a single car, you have to insert as many rows as there are customers for that salesperson.

Page 113: SQL Unit  17 Normalization

113

• Now that there are multiple rows for each of the multidetermined fields, updating a single instance means updating every row.

• Also, deleting a single instance will require deleting many rows.

Page 114: SQL Unit  17 Normalization

114

• In short, this design doesn’t violate any of the foregoing normal forms.

• However, it’s full of awful anomalies.• Therefore, 4NF was defined, stating that a

design like this is not correct.

Page 115: SQL Unit  17 Normalization

115

• The assumptions imply two one-to-many relationships.

• Deriving a correct design seems pretty straightforward.

• Have three tables, one each for Salesperson, Customer and Car.

• Embed the primary key of Salesperson in Customer and in Car.

Page 116: SQL Unit  17 Normalization

116

• Books will sometimes tell you that you are unlikely to find a mess like this in practice.

• Anyone who makes a design by working from 1NF through BCNF, rooting out stray dependencies, will most likely have broken out all entities into their own tables.

• Even if that hasn’t happened due to normalization, if they reach this point, they are likely to realize intuitively that this is a bad design.

Page 117: SQL Unit  17 Normalization

117

• Anyone familiar with database design principles would not try to put three types of entities together in a single table in the first place.

• On the other hand, people who are unfamiliar with the rules sometimes think that they should try and cram as much information into a single table as possible.

Page 118: SQL Unit  17 Normalization

118

• If that happens, then a violation such as this is possible.

• This is the classic worst-case scenario of the one-table mindset.

• If you are ever asked to convert a “database” maintained as an Excel spreadsheet into relational form, expect to see nightmares like this.

Page 119: SQL Unit  17 Normalization

119

7. Fifth Normal Form

• 5NF Definition:• Formally (Watson):• A relation is in fifth normal form if and only if

every join dependency of the relation is a consequence of candidate keys of the relation

• Informally: • See the following commentary and

explanation

Page 120: SQL Unit  17 Normalization

120

• It is apparent that 5NF is like BCNF in this way:• It doesn’t depend on the earlier normal forms.• It is a summative definition of what

characteristic a correct design should have.• Unfortunately, because its definition is so high

level and theoretical, what it means is not clear.• Like the definition of 4NF, it involves a concept

that hasn’t been explained yet.

Page 121: SQL Unit  17 Normalization

121

• What is a join dependency?• Rather than trying to define this concept, an

example will be pursued.• After looking at what the example reveals, an

informal description of 5NF will be given.

Page 122: SQL Unit  17 Normalization

122

Example

• Consider the Salesperson, Customer, and Car entities again

• In the 4NF example, the relationship between Salesperson and Customer and the relationship between Salesperson and Car were independent

• But suppose that we’re again interested in capturing some sort of relationship which directly ties together a salesperson, a customer, and a car

Page 123: SQL Unit  17 Normalization

123

• In the sample database used for SQL practice, this kind of relationship was captured by a star design.

• The Carsale table was in the middle of the star• Each of Car, Customer, and Salesperson were

in a one-to-many relationship with it.

Page 124: SQL Unit  17 Normalization

124

• You may have realized that there is another way to relate together all three of the base tables.

• What if each pairing of Salesperson, Customer, and Car were related in an m-m relationship?

• This results in a design with a cycle in it.• This is shown in the ER diagram on the

following overhead.

Page 125: SQL Unit  17 Normalization

125

Car

Customer-Salesperson

Sales-person

Customer

Salesperson-Car

Car-Customer

Page 126: SQL Unit  17 Normalization

126

• The design could also be represented in this way:

• Car(vin, make, model, year)• Car-Customer(vin f.k., custno f.k.)• Customer(custno, custname)• Customer-Salesperson(custno f.k., spno f.k.)• Salesperson(spno, spname)• Salesperson-Car(spno f.k., vin f.k., date)

Page 127: SQL Unit  17 Normalization

127

• In general, designs with cycles in them are difficult to understand.

• What you want to determine can be stated in two different ways:

• What assumptions might make a cyclical design rather than a star design correct?

• What can go wrong with a cyclical design which would make it incorrect for a given situation?

Page 128: SQL Unit  17 Normalization

128

• If you traced all of the links in the design with the cycle, you would find that every car is connected to every salesperson is connected to every customer.

• Put in business terms, the cyclical design for car sales embodies this assumption:

• At one time or another every salesperson has sold every car to every customer.

• In other words, all sales that could have possibly occurred did occur.

Page 129: SQL Unit  17 Normalization

129

• Obviously, the foregoing is madness.• At this point the car sale example breaks down. • It is unrealistic to think that every car would be

sold by every salesperson and bought by every customer.

• It is difficult to imagine a scenario where these new assumptions would hold true.

• We are dealing with assumptions that simply aren’t realistic for this example.

Page 130: SQL Unit  17 Normalization

130

Cycles vs. Star Models

• Although not realistic, the scenario is still informative.

• The cyclical design is contrary to the star design in this way:

• In the star design the table in the middle captures information for that subset of possible sales that actually occurred.

Page 131: SQL Unit  17 Normalization

131

• Finally, this returns to the question of 5NF and what a correct design would be.

• Suppose you did have a situation where every possible pair of relationships actually does exist.

• Which design is better, the one with the star or the one with the cycle?

Page 132: SQL Unit  17 Normalization

132

• Under this scenario, with a start design, the table in the middle becomes the 3-way Cartesian product of the primary keys of the three base tables.

• This harks back to the 4NF example.• The problem there was a table consisting of

Cartesian products of independent multivalued dependencies

Page 133: SQL Unit  17 Normalization

133

• We now have the converse problem—everything really is related to everything else.

• If we know that each pairwise relationship is valid, why are we creating a table in the middle that forms an unneeded Cartesian product?

• In a theoretical sense, at least, the design with the cycle is better.

Page 134: SQL Unit  17 Normalization

134

• Are there anomalies in the star design under these assumptions?

• The anomalies aren’t unmanageable roadblocks, but there is a practical reason why the star design is undesirable:

• Suppose that p Salespeople are related to m Customers and n Cars.

• Suppose you wanted to add one salesperson to the design with a star under these assumptions.

Page 135: SQL Unit  17 Normalization

135

• That would involve adding m x n triplets to the table in the middle.

• In the cyclical design you would “only” have to manage m additions to the Salesperson-Customer table and n additions to the Salesperson-Car table.

Page 136: SQL Unit  17 Normalization

136

• Consider the problem overall• The complete Cartesian product in the table in

the middle of a star design would contain m x n x p triples

• With a cyclical design you would have three pairwise tables in the middle with m x n, n x p, and p x m entries each

• Once m, n, and p are above 3, there are fewer entries to manage with the cyclical design

Page 137: SQL Unit  17 Normalization

137

A More Realistic Example

• The book comes up with a reasonably realistic example of where the cyclical design might apply

• Suppose firms have contracts with consultants• Consultants provide advice based on their skill

sets• And the firms require advice on skills in

consultants’ skill sets

Page 138: SQL Unit  17 Normalization

138

• Add this assumption to the suppositions above:• If a company has a contract with a consultant,

and the consultant has a particular skill, and the company requires advice that depends on that skill, then a consultant under contract, by definition, can or will or does provide advice involving the skill.

• In shorthand, the relationships are “universal” (or promiscuous…)

Page 139: SQL Unit  17 Normalization

139

• This is a more general model than saying that a consultant is only contracted to provide advice on a given skill set.

• It is a realistic model from the firms’ point of view.• They may want to enforce contracts like this with

their consultants.• Whether it is completely realistic or not, if all of

the assumptions hold true, then the cyclical model is the technically correct one.

Page 140: SQL Unit  17 Normalization

140

• ER diagrams of this example are given on the following overheads.

• The first diagram is a start design where the new assumption doesn’t apply

• The second diagram is the cyclical design where the new assumption does apply

• In the captions of the diagrams, the “rule” referred to is the assumption that the relationships among entities are “universal”

Page 141: SQL Unit  17 Normalization

141

Page 142: SQL Unit  17 Normalization

142

Page 143: SQL Unit  17 Normalization

143

Cycles vs. Linear Models

• There is another aspect of this question that can be considered.

• Would it make a difference if you broke the cycle by removing one of the tables in the middle?

• You could still form a join query that included all of the base tables, but you would definitely lose information from the model.

Page 144: SQL Unit  17 Normalization

144

• Suppose you removed “Firm Requires Skill” for example.

• Now every company would be indiscriminately contracting with every consultant for every skill they had rather than every skill they had that the company required

• The same kind of argument would apply to removing any of the tables in the middle

Page 145: SQL Unit  17 Normalization

145

• Situations where every entity in every base table is related to every other entity in every other base table are rare.

• However, they can occur.• In those cases the cyclical design is

theoretically better.

Page 146: SQL Unit  17 Normalization

146

A Concrete Illustration

• Even though the car example isn’t realistic, because it’s familiar, it’s possible to develop a concrete example that illustrates that the cyclical design does ultimately capture the Cartesian product of the base tables

• The example also illustrates how information is lost if you forget to include one of the tables, making the design a linear one instead of a cyclical one

Page 147: SQL Unit  17 Normalization

147

• These tables are shown on the following overheads:• Car• Customer• Salesperson• Car-Customer• Car-Salesperson• Customer-Salesperson• The tables in the middle contain the full m x n

Cartesian product of the base tables they connect

Page 148: SQL Unit  17 Normalization

148

CarcarID carName

1 jkl

2 uio

3 fds

Page 149: SQL Unit  17 Normalization

149

CustomercustomerID customerName

1 Chris

2 Chet

3 Chad

Page 150: SQL Unit  17 Normalization

150

SalespersonsalespersonID salespersonName

1 Sam

2 Sid

3 Sue

Page 151: SQL Unit  17 Normalization

151

Car-CustomercarID customerID

1 1

1 2

1 3

2 1

2 2

2 3

3 1

3 2

3 3

Page 152: SQL Unit  17 Normalization

152

Car-SalespersoncarID SalespersonID

1 1

1 2

1 3

2 1

2 2

2 3

3 1

3 2

3 3

Page 153: SQL Unit  17 Normalization

153

Customer-SalespersoncustomerID salespersonID

1 1

1 2

1 3

2 1

2 2

2 3

3 1

3 2

3 3

Page 154: SQL Unit  17 Normalization

154

• The following overhead shows a screenshot of the cyclical relationships between the tables.

Page 155: SQL Unit  17 Normalization

155

Page 156: SQL Unit  17 Normalization

156

• A join query that goes all the way around the cycle is given on the next overhead.

• The results of the query are shown on the overhead following the next one.

Page 157: SQL Unit  17 Normalization

157

• SELECT *• FROM Car, Car-Customer, Customer, Customer-Salesperson,

Salesperson, Car-Salesperson• WHERE Car.carID = Car-Customer.carID• And Car.carID = Car-Salesperson.carID• And Customer.customerID = Car-Customer.customerID• And Customer.customerID = Customer-Salesperson.customerID • And Salesperson.salespersonID = Car-Salesperson.salespersonID • And Salesperson.salespersonID = Customer-

Salesperson.salespersonID;

Page 158: SQL Unit  17 Normalization

158

CycleQuery

Car.carID

carName

Car-Customer.carID

Car-Customer.customerID

Customer.customerID

customerName

Customer-Salesperson.customerID

Customer-Salesperson.SalespersonID

Salesperson.SalespersonID

salespersonName

Car-Salesperson.carID

Car-Salesperson.SalespersonID

1 jkl 1 1 1 Chris 1 1 1 Sam 1 1

2 uio 2 1 1 Chris 1 1 1 Sam 2 1

3 fds 3 1 1 Chris 1 1 1 Sam 3 1

1 jkl 1 2 2 Chet 2 1 1 Sam 1 1

2 uio 2 2 2 Chet 2 1 1 Sam 2 1

3 fds 3 2 2 Chet 2 1 1 Sam 3 1

1 jkl 1 3 3 Chad 3 1 1 Sam 1 1

2 uio 2 3 3 Chad 3 1 1 Sam 2 1

3 fds 3 3 3 Chad 3 1 1 Sam 3 1

1 jkl 1 1 1 Chris 1 2 2 Sid 1 2

2 uio 2 1 1 Chris 1 2 2 Sid 2 2

3 fds 3 1 1 Chris 1 2 2 Sid 3 2

1 jkl 1 2 2 Chet 2 2 2 Sid 1 2

2 uio 2 2 2 Chet 2 2 2 Sid 2 2

3 fds 3 2 2 Chet 2 2 2 Sid 3 2

1 jkl 1 3 3 Chad 3 2 2 Sid 1 2

2 uio 2 3 3 Chad 3 2 2 Sid 2 2

3 fds 3 3 3 Chad 3 2 2 Sid 3 2

1 jkl 1 1 1 Chris 1 3 3 Sue 1 3

2 uio 2 1 1 Chris 1 3 3 Sue 2 3

3 fds 3 1 1 Chris 1 3 3 Sue 3 3

1 jkl 1 2 2 Chet 2 3 3 Sue 1 3

2 uio 2 2 2 Chet 2 3 3 Sue 2 3

3 fds 3 2 2 Chet 2 3 3 Sue 3 3

1 jkl 1 3 3 Chad 3 3 3 Sue 1 3

2 uio 2 3 3 Chad 3 3 3 Sue 2 3

3 fds 3 3 3 Chad 3 3 3 Sue 3 3

Page 159: SQL Unit  17 Normalization

159

• The main thing to note about the previous results is that there are 3 x 3 x 3 = 27 rows

• The join query that goes all the way around the cycle produces the Cartesian product of the three base tables.

Page 160: SQL Unit  17 Normalization

160

• The previous join query with the last joining condition removed is shown on the next overhead.

• The results of the query are shown on the overhead following the next one.

• They are impossible to read because there are 81 rows.

Page 161: SQL Unit  17 Normalization

161

• SELECT *• FROM Car, Car-Customer, Customer, Customer-

Salesperson, Salesperson, Car-Salesperson• WHERE Car.carID = Car-Customer.carID• And Car.carID = Car-Salesperson.carID• And Customer.customerID = Car-Customer.customerID• And Customer.customerID = Customer-

Salesperson.customerID • And Salesperson.salespersonID = Car-

Salesperson.salespersonID;

Page 162: SQL Unit  17 Normalization

162

CycleQueryMinusOneLink

Car.carID

carName

Car-Customer.carID

Car-Customer.customerID

Customer.customerID

customerName

Customer-Salesperson.customerID

Customer-Salesperson.SalespersonID

Salesperson.SalespersonID

salespersonName

Car-Salesperson.carID

Car-Salesperson.SalespersonID

1 jkl 1 1 1 Chris 1 1 1 Sam 1 1

1 jkl 1 1 1 Chris 1 2 1 Sam 1 1

1 jkl 1 1 1 Chris 1 3 1 Sam 1 1

1 jkl 1 1 1 Chris 1 1 2 Sid 1 2

1 jkl 1 1 1 Chris 1 2 2 Sid 1 2

1 jkl 1 1 1 Chris 1 3 2 Sid 1 2

1 jkl 1 1 1 Chris 1 1 3 Sue 1 3

1 jkl 1 1 1 Chris 1 2 3 Sue 1 3

1 jkl 1 1 1 Chris 1 3 3 Sue 1 3

1 jkl 1 2 2 Chet 2 1 1 Sam 1 1

1 jkl 1 2 2 Chet 2 2 1 Sam 1 1

1 jkl 1 2 2 Chet 2 3 1 Sam 1 1

1 jkl 1 2 2 Chet 2 1 2 Sid 1 2

1 jkl 1 2 2 Chet 2 2 2 Sid 1 2

1 jkl 1 2 2 Chet 2 3 2 Sid 1 2

1 jkl 1 2 2 Chet 2 1 3 Sue 1 3

1 jkl 1 2 2 Chet 2 2 3 Sue 1 3

1 jkl 1 2 2 Chet 2 3 3 Sue 1 3

1 jkl 1 3 3 Chad 3 1 1 Sam 1 1

1 jkl 1 3 3 Chad 3 2 1 Sam 1 1

1 jkl 1 3 3 Chad 3 3 1 Sam 1 1

1 jkl 1 3 3 Chad 3 1 2 Sid 1 2

1 jkl 1 3 3 Chad 3 2 2 Sid 1 2

1 jkl 1 3 3 Chad 3 3 2 Sid 1 2

1 jkl 1 3 3 Chad 3 1 3 Sue 1 3

1 jkl 1 3 3 Chad 3 2 3 Sue 1 3

1 jkl 1 3 3 Chad 3 3 3 Sue 1 3

2 uio 2 1 1 Chris 1 1 1 Sam 2 1

2 uio 2 1 1 Chris 1 2 1 Sam 2 1

2 uio 2 1 1 Chris 1 3 1 Sam 2 1

2 uio 2 1 1 Chris 1 1 2 Sid 2 2

2 uio 2 1 1 Chris 1 2 2 Sid 2 2

2 uio 2 1 1 Chris 1 3 2 Sid 2 2

2 uio 2 1 1 Chris 1 1 3 Sue 2 3

2 uio 2 1 1 Chris 1 2 3 Sue 2 3

2 uio 2 1 1 Chris 1 3 3 Sue 2 3

2 uio 2 2 2 Chet 2 1 1 Sam 2 1

2 uio 2 2 2 Chet 2 2 1 Sam 2 1

2 uio 2 2 2 Chet 2 3 1 Sam 2 1

2 uio 2 2 2 Chet 2 1 2 Sid 2 2

2 uio 2 2 2 Chet 2 2 2 Sid 2 2

2 uio 2 2 2 Chet 2 3 2 Sid 2 2

2 uio 2 2 2 Chet 2 1 3 Sue 2 3

2 uio 2 2 2 Chet 2 2 3 Sue 2 3

2 uio 2 2 2 Chet 2 3 3 Sue 2 3

2 uio 2 3 3 Chad 3 1 1 Sam 2 1

2 uio 2 3 3 Chad 3 2 1 Sam 2 1

2 uio 2 3 3 Chad 3 3 1 Sam 2 1

2 uio 2 3 3 Chad 3 1 2 Sid 2 2

2 uio 2 3 3 Chad 3 2 2 Sid 2 2

2 uio 2 3 3 Chad 3 3 2 Sid 2 2

2 uio 2 3 3 Chad 3 1 3 Sue 2 3

2 uio 2 3 3 Chad 3 2 3 Sue 2 3

2 uio 2 3 3 Chad 3 3 3 Sue 2 3

3 fds 3 1 1 Chris 1 1 1 Sam 3 1

3 fds 3 1 1 Chris 1 2 1 Sam 3 1

3 fds 3 1 1 Chris 1 3 1 Sam 3 1

3 fds 3 1 1 Chris 1 1 2 Sid 3 2

3 fds 3 1 1 Chris 1 2 2 Sid 3 2

3 fds 3 1 1 Chris 1 3 2 Sid 3 2

3 fds 3 1 1 Chris 1 1 3 Sue 3 3

3 fds 3 1 1 Chris 1 2 3 Sue 3 3

3 fds 3 1 1 Chris 1 3 3 Sue 3 3

3 fds 3 2 2 Chet 2 1 1 Sam 3 1

3 fds 3 2 2 Chet 2 2 1 Sam 3 1

3 fds 3 2 2 Chet 2 3 1 Sam 3 1

3 fds 3 2 2 Chet 2 1 2 Sid 3 2

3 fds 3 2 2 Chet 2 2 2 Sid 3 2

3 fds 3 2 2 Chet 2 3 2 Sid 3 2

3 fds 3 2 2 Chet 2 1 3 Sue 3 3

3 fds 3 2 2 Chet 2 2 3 Sue 3 3

3 fds 3 2 2 Chet 2 3 3 Sue 3 3

3 fds 3 3 3 Chad 3 1 1 Sam 3 1

3 fds 3 3 3 Chad 3 2 1 Sam 3 1

3 fds 3 3 3 Chad 3 3 1 Sam 3 1

3 fds 3 3 3 Chad 3 1 2 Sid 3 2

3 fds 3 3 3 Chad 3 2 2 Sid 3 2

3 fds 3 3 3 Chad 3 3 2 Sid 3 2

3 fds 3 3 3 Chad 3 1 3 Sue 3 3

3 fds 3 3 3 Chad 3 2 3 Sue 3 3

3 fds 3 3 3 Chad 3 3 3 Sue 3 3

Page 163: SQL Unit  17 Normalization

163

• The query missing one joining condition is the equivalent of a design where one of the tables in the middle is missing

• In such a design you’ve lost information• However, the loss of design information means

that when you run the query you get more result records

• In other words, what’s lost is connecting information that would restrict the results

Page 164: SQL Unit  17 Normalization

164

Finally, an Informal Definition of 5NF

• Recall the formal definition of 5NF:• A relation is in fifth normal form if and only if

every join dependency of the relation is a consequence of candidate keys of the relation

• For the purposes of comparison, also recall the definition of BCNF:

• A relation is in Boyce-Codd normal form if and only if every determinant is a candidate key

Page 165: SQL Unit  17 Normalization

165

• BCNF seems to be directed towards the definition of the correct design of a table internally, without reference to other tables

• 5NF seems to be directed towards the definition of the correct design of a table in relation or reference to other tables

• The term “join dependency” hasn’t been defined, but it will be informally defined now

Page 166: SQL Unit  17 Normalization

166

• A general description of what 5NF says is the following:

• A design is correct if two conditions are met: • All real relationships between entities are

captured by the design; • no false relationships between entities are

captured by the design.

Page 167: SQL Unit  17 Normalization

167

• Empirically, the informal definition can be translated into two statement about queries that involve joining (join dependencies between) 2 or more tables.

• 1. In a correct design in 5NF, you can write a query that will successfully pull all “real” information (based on the tables and their relationships) out of the db.

• It may be informative to consider some related negative statements.

Page 168: SQL Unit  17 Normalization

168

• There is no information “trapped” in the database that you cannot retrieve with a query.

• There is no information that you have failed to include due to a flaw in the database design.

• You can relate this back to the cycle vs. linear model observations.

• For example, you haven’t left a (pk-fk, join) relationship out of the design.

Page 169: SQL Unit  17 Normalization

169

• 2. The second part of the informal definition of 5NF is equally important.

• It is not possible to write a query that would pull out results which did not hold true

• In other words, you haven’t included a “false” relationship in the design

• You can relate this back to the cycle vs. star model observations.

Page 170: SQL Unit  17 Normalization

170

• If the star was the correct design, you would only be able to pull out that specific of triplets that were correct.

• If you mistakenly implemented a cycle, ultimately you would be able to write a query that in a sense pulled out the full Cartesian product of the base tables.

• If that is not correct, the model should not make it possible.

Page 171: SQL Unit  17 Normalization

171

General Rules for Normalizing

• Even if a cycle is theoretically the correct design, for practical and understandability reasons, you might go with star.

• Consider firms, consultants, and skills again• Suppose the hiring of consultants constitutes a

contract• Suppose that contracts have attributes of their

own, like dates and pay rates

Page 172: SQL Unit  17 Normalization

172

• At that point a contract may become a base entity and should probably be implemented as a base table.

• If the relationships are universal, you’ll just have to deal with the fact that the table in the middle of the star is a Cartesian product.

• In any case, it will be necessary to figure out the assumptions regarding dates and pay, and the cyclical solution may not be any prettier.

Page 173: SQL Unit  17 Normalization

173

Normalizing Overall

• The general rule with normal forms is similar to the rule with ER modeling.

• You take the model to the highest normal form that is desirable.

• If you haven’t taken it to the highest normal form possible, you document the decision.

Page 174: SQL Unit  17 Normalization

174

• The whole sweep of numbered normal forms can be considered now.

• At the very least you would expect to work a design up through BCNF.

• 4NF is considered kind of exceptional because when modeling from scratch you would never arrive in that situation.

Page 175: SQL Unit  17 Normalization

175

• As noted, you might have to deal with 4NF if you are working with an existing “database” designed by a one-table fanatic.

• 5NF is the final extreme.• Whether you choose to implement 5NF, it is

definitely worthwhile to be conscious of the implications of cycles in a final design.

• If you choose to avoid them, document that decision.

Page 176: SQL Unit  17 Normalization

176

8. Domain Key Normal Form

• The highest normal form, domain-key normal form (DKNF), is a theoretical statement of which attributes belong in tables and the relationships among the attributes.

• This form is not numbered.• Like BCNF and 5NF, it is a summative normal form.• DKNF encompasses all of the other normal forms.• A design in DKNF is fully normalized.

Page 177: SQL Unit  17 Normalization

177

• DKNF Definition:• Formally (Watson):• A relation is in domain-key normal form if and only if

every constraint on the relation is a logical consequence of the domain constraints and the key constraints that apply to the relation.

• Informally: • Unless you’ve steeped yourself in the relational theory

of normalization, the meaning of this statement is virtually impossible to interpret.

Page 178: SQL Unit  17 Normalization

178

• DKNF is based on the idea of domains and constraints.

• These ideas have come up before, but they themselves haven’t been formally defined yet.

• Practically speaking, even if understood, unlike the numbered normal forms, DKNF does not give you a step-by-step procedure for creating a design, determining whether or not it has violations, and fixing them.

Page 179: SQL Unit  17 Normalization

179

• Although this normal form is of little practical use, it raises the ideas of domains and constraints, which are important.

• They are the basis for correctly capturing relationships between tables.

• The remainder of this section will consist of remarks on these topics.

Page 180: SQL Unit  17 Normalization

180

Domains and Data Types

• Domains and data types are not the same thing• When a table is created, a complete definition

has to tell the data type of each field. • Some fields may hold numeric values, some

may hold strings of characters, some may hold dates, etc.

• If a field holds strings of characters, its length, or maximum length also has to be stated.

Page 181: SQL Unit  17 Normalization

181

• So, for example, a person’s last name may be defined as containing a maximum of 24 characters.

• TEXT(24) or CHAR(24) doesn’t tell you that the field holds a name.

Page 182: SQL Unit  17 Normalization

182

• The social security number came up earlier illustrating another aspect of this.

• A social security number has 9 digits. • Although the social security number is called a

number, it is never used numerically. • There is no need to add, subtract, multiply, or

divide it, and a good design will prevent that.

Page 183: SQL Unit  17 Normalization

183

• The wise choice is to define this field as a character field containing 9 characters where valid characters in this field are limited to digits.

• The point of this is that even though the name of a field tells you something meaningful about it, the name doesn’t necessarily accurately reflect the preferred data type for it.

Page 184: SQL Unit  17 Normalization

184

Domains

• Most of the time, the name of a field is descriptive of the kind of information it can hold.

• So for a “last name” field in a table containing information about people, it is informally clear what this means.

• In general, a name would consist of a sequence of letters of the alphabet.

Page 185: SQL Unit  17 Normalization

185

• Names could come from any language or culture, translated into the English alphabet.

• Some names do contain numeric information, usually indicated with Roman numerals, for example, John Smith I, John Smith II, etc.

• It would not be practical to come up with a formula that mathematically defined all possible values.

• Still, the general idea is clear.

Page 186: SQL Unit  17 Normalization

186

• In a technical sense, the term domain refers to the whole set of values that could appear as valid data in that field.

• However, the domain is a semantic concept. • A field’s domain is a description of the

meaning of the field and the kind of data it can contain.

Page 187: SQL Unit  17 Normalization

187

• This goes back to the description of an attribute and the requirements on it.

• An attribute captures one characteristic of the entity which it belongs to.

• Throughout a table, a field should capture the same characteristic for the different entities recorded in each row.

• Different records should not be recording different kinds of information in the same field.

Page 188: SQL Unit  17 Normalization

188

Different Fields, Not on the Same Domain

• The idea of domains can be further clarified by giving examples of cases where fields are not on the same domain.

• A person’s last name field may be defined as 24 characters.

• A city field could also be defined the same way. • There may be cases where a person’s name is

the same as the name of a city.

Page 189: SQL Unit  17 Normalization

189

• There is a city of Lincoln in England. • Abraham Lincoln’s ancestors might have came

from that area. • There is also a city of Lincoln in Nebraska,

which was named after Abraham Lincoln. • Even though there may be an intersection of

the values in the city and last name fields, you could argue that conceptually, city name and person last name are two distinct domains.

Page 190: SQL Unit  17 Normalization

190

• On the other hand, you may be maintaining some sort of database where the underlying concept of interest is name itself, whether describing a person or a place.

• In that case, you could argue that the two fields, city name and person last name, really are on the same domain.

• From a table point of view, instead of two tables, PeopleName and PlaceName, you could have one table, GenericName, with a single name field.

Page 191: SQL Unit  17 Normalization

191

• Another example of two fields that are not obviously on the same domain would be social security number and zip code.

• A full zip code consists of 5 plus 4, or 9 digits, like a social security number.

• Both might be defined as character fields containing 9 characters.

Page 192: SQL Unit  17 Normalization

192

• However, social security numbers and zip codes have nothing in common.

• There are doubtless cases where someone’s social security number matches some zip code somewhere in the country, but this is purely coincidental.

• On the other hand, you may be maintaining some sort of database where the underlying concept is random numeric characteristics…

Page 193: SQL Unit  17 Normalization

193

Domains and Primary Key to Foreign Key Relationships

• The relationship between tables has been explained by the process of embedding the primary key of one table as a foreign key in another.

• The foreign key table has to have a field with a suitable name, that is defined to hold the same type of data as the primary key field.

• In other words, the primary key to foreign key relationship requires fields on the same domain.

Page 194: SQL Unit  17 Normalization

194

Fields on Common Domains in General

• There can be other relationships between tables which are the result of domains, but not the direct result of embedding keys.

• Going back to one of the earlier examples, a database may distinguish between mothers and children as different kinds of entities, and store them in different tables.

Page 195: SQL Unit  17 Normalization

195

• Each of these tables may have social security number fields and last name fields.

• You would not expect a mother and child to have the same social security number.

• This would be a mistake.

Page 196: SQL Unit  17 Normalization

196

• However, in most cases you would expect mothers and children to have the same last names.

• The idea is that a social security number is a social security number, regardless of what table it appears in.

• The idea of a social security number defines a domain.

Page 197: SQL Unit  17 Normalization

197

• Similarly, if the last name fields in both the mother and child tables were defined as containing 24 characters, a last name is a last name, regardless of what table it appears in.

• Incidentally, a last name would be a last name even if the fields were not of the same length

Page 198: SQL Unit  17 Normalization

198

• Both social security number and last name define a domains.

• You might expect to see social security numbers involved in primary key to foreign key relationships.

• Because names aren’t unique, you don’t expect that of them.

Page 199: SQL Unit  17 Normalization

199

• You can have name fields in two different tables.

• Even though they’re not a pk to fk pair, they are still on the same domain.

• In theory, if you wanted to, you could write join queries on those fields

Page 200: SQL Unit  17 Normalization

200

• The point of this discussion is that a domain is a cross-table concept.

• Any given database may contain many different fields in its tables, but the database will contain fewer domains than fields because various fields are on the same domain.

• In a sense, the idea of a domain is more fundamental than the idea of a field.

• A field is just a manifestation of a domain.

Page 201: SQL Unit  17 Normalization

201

The Definition of DKNY, Again

• This was the formal definition of DKNY:• A relation is in domain-key normal form if and

only if every constraint on the relation is a logical consequence of the domain constraints and the key constraints that apply to the relation.

• What are domain constraints?• They are constraints on the values that are valid in

a field based on the meaning of the field.

Page 202: SQL Unit  17 Normalization

202

• Thinking back to SQL, these are some constraints that can be concretely identified in all cases, regardless of the particular field in question:

• Entity integrity (primary key)• Relational integrity (foreign key)• Unique• Not null

Page 203: SQL Unit  17 Normalization

203

• In very general terms, DKNF says that a database is correctly designed if the fields in the tables, the constraints on the individual fields, and the dependencies among the tables are the result of correctly defining the domains and choosing which domain each field is on.

• It reaches beyond the question of whether one table is correctly designed, because the relationships between tables also depends on domain choice.

Page 204: SQL Unit  17 Normalization

204

• The DKNY definition is so high-level and cosmic that it boils down to saying this:

• A database is correctly designed if and only if it’s correctly designed.

• If the design stems from and captures the logical meaning of the problem, the design is correct.

• If the design is correct, then it must be the case that it stems from and captures the logical meaning of the problem…

Page 205: SQL Unit  17 Normalization

205

9. Nulls and Integrity

• This section is a review of material that was explained earlier.

• It will not be gone over in class.• However, it is provided below in its entirety in

case you want to read the overheads yourself.

Page 206: SQL Unit  17 Normalization

206

9. Nulls and Integrity

• The term “null” refers to the idea that a particular field in a particular record may not have data in it.

• In general, this is permissible. • Cases often arise in practice where the

information doesn’t exist or isn’t known.

Page 207: SQL Unit  17 Normalization

207

• It would be impractical to insist that all fields always contain data.

• If that restriction were imposed, people would get around it by putting in bogus values for information that didn’t exist or wasn’t known.

• However, filling a database with bogus values is not a very good idea.

Page 208: SQL Unit  17 Normalization

208

• When a database management system supports null values in fields, it’s important to understand what this does not mean.

• It does not mean that the fields contain the sequence of characters “null”.

• It also does not mean that the field contains invisible blanks.

• Blank spaces themselves are a form of character. • What it means is that there is absolutely nothing in the

field, and the database management system is able to recognize fields that are in that state.

Page 209: SQL Unit  17 Normalization

209

• The term integrity in database management systems refers to the validity and consistency of data entered into a database.

• The phrase “entity integrity” is the formal expression of a requirement that was stated informally earlier.

• Entity integrity puts the following requirement on a correctly implemented database:

• Every table has a primary key field, and no part of the primary key field can be null for any record in the table.

• Clearly, if all or part of a key were allowed to be null, that would defeat the purpose that the primary key field be the unique identifier for every record in the table.

Page 210: SQL Unit  17 Normalization

210

• As seen in the long discussion of normal forms, it is the primary key to foreign key relationships that support the interconnection between related entities that have been separated into different tables by the design process.

• Once this has been done, it is critically important that the data maintaining the relationships be valid and consistent.

• The phrase “referential integrity” has the following meaning: • Every value that appears in a foreign key field also has to appear

as a value in the corresponding primary key field. • This can also be stated negatively: • There can be no foreign key value that does not have a

corresponding primary key value.

Page 211: SQL Unit  17 Normalization

211

• The meaning and importance of referential integrity can be most easily explained with a small example showing a violation of it.

• Consider the tables shown on the following overhead:

Page 212: SQL Unit  17 Normalization

212

Mothermid Name1 Lily2 Matilda

Childkid name mida Ned 3b Ann 2c June

Page 213: SQL Unit  17 Normalization

213

• Child b, Ann, is shown as having a mother with mid equal to 3.

• There is no such mother in the Mother table. This is literally nonsense.

• There is no sense in which this can be correct and this is what referential integrity forbids.

Page 214: SQL Unit  17 Normalization

214

• This example also illustrates two other things, which are related to “non-existent” values.

• Observe that mother 1, Lily does not have any matching records in the Child table.

• This does not violate referential integrity. • It suggests that the Mother table is misnamed, and

should be named the Woman table, but it is reasonable to think that you might be recording information about women and children and some women will not have children.

Page 215: SQL Unit  17 Normalization

215

• The other thing visible in the table is that child c, June, does not have a mother listed.

• In other words, the foreign key field is null. • This also does not violate referential integrity. • As with null in any situation, it may mean that

the mother is not known.

Page 216: SQL Unit  17 Normalization

216

• Nobody literally doesn’t have a mother, but if the woman table only records information on living women, for example, then for an orphan, the mother “wouldn’t exist”.

• It is unlikely that you would rename the table “Children and Orphans”—but the idea is that the null value is allowed and this in some sense affects the meaning of what kinds of entities are entered into the table.

Page 217: SQL Unit  17 Normalization

217

• Referential integrity leads to one last consideration.

• The idea behind normalization was to get the stray dependency out of one table and break it into two.

• The problem with stray dependencies was redundancy and anomalies.

• By breaking a design into two tables with a primary to foreign key pair, you introduce interrelationship constraints.

Page 218: SQL Unit  17 Normalization

218

• Put simply, the question is this: What do you do with foreign key values if the corresponding primary key values in another table are deleted or updated?

• A fully-featured database management system will enforce referential integrity constraints.

• The default settings for these constraints are summarized in these two phrases:

• On delete, restrict; on update, cascade.

Page 219: SQL Unit  17 Normalization

219

• If these defaults are implemented, this is a fuller explanation of what they mean in terms of the concrete mother and child example:

• On delete, restrict: • No mother record can be deleted if she has

corresponding child records in the other table.

• To allow the deletion would lead to a referential integrity violation.

Page 220: SQL Unit  17 Normalization

220

• On update, cascade: • If the primary key value of a mother record is

updated, if she has corresponding child records in the other table, the foreign key values in those records is automatically updated to reflect the change.

• This problem arises less frequently because once a primary key value is assigned to an entity, it is rarely changed.

Page 221: SQL Unit  17 Normalization

221

The End