database normalization

20

Click here to load reader

Upload: vaibhav-kathuria

Post on 17-Jun-2015

190 views

Category:

Technology


0 download

DESCRIPTION

NORMALIZATION

TRANSCRIPT

Page 1: Database normalization

In relational database design, the process of organizing data to minimize redundancy is called normalization or the process of decomposing relation with anomalies to produce smaller ,well-structured relation. Normalization usually involves dividing a database into two or more tables and defining relationships between the tables. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.

Normalization is a process of decomposing a set of relations(table) with anomalies to produce smaller and well structured relations that contain minimum or no redundancy. The basic objectives of normalization are to reduce redundancy, which means that information is to be stored only once. Storing information several times leads to wastage of storage space and increase in the total size of the data stored. Normalization provides the designer with a systematic and scientific process of grouping of attributes in a relation. 

The normalization process can be defined as a procedure of analyzing and successive reduction of the given relational schemas based on their FDs and primary keys to achieve the desirable properties of - 

(i) Minimizing redundancy,(ii) minimizing insertion, deletion and update anomalies.

* Database normalization

Page 2: Database normalization

The objectives of the normalization process are

1- To create a formal framework for analyzing relation schemas based on their

keys and on the functional dependencies among their attributes.

2- To free relations from undesirable insertion, update and deletion anomalies.

3- To reduce the need for restructuring the relations as new data types are

introduced.

4- To carry out series of tests on individual relation schema so that the

relational database can be normalized to some degree. When a test fails, the

relation violating that test must be decomposed into relations that individually

meet the normalization tests.

Page 3: Database normalization

*Database Normalization

*The main goal of Database Normalization is to

restructure the logical data model of a database

to:

*Eliminate redundancy

*Organize data efficiently

*Reduce the potential for data anomalies.

Page 4: Database normalization

*First normal form

*First normal form (1NF) is a property of a relation in a relational

database. A relation is in first normal form if the domain of each

attribute contains only atomic values, and the value of each

attribute contains only a single value from that domain.

*First normal form is an essential property of a relation in a

relational database. Database normalization is the process of

representing a database in terms of relations in standard normal

forms, where first normal is a minimal requirement.

Page 5: Database normalization

*ExamplesThe following scenario illustrates how a database design might violate first normal form.

Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

Customer ID First Name SurnameTelephone Number

123 Robert Ingram 555-861-2025

456 Jane Wright 555-403-1659

789 Maria Fernandez 555-808-9633

Customer

The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

Customer ID First Name SurnameTelephone Number

123 Robert Ingram 555-861-2025

456 Jane Wright555-403-1659555-776-4100

789 Maria Fernandez 555-808-9633

Customer

Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.

Page 6: Database normalization

*A design that complies with 1NFA design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.

Customer Name

Customer ID

First Name Surname

123 Robert Ingram

456 Jane Wright

789 Maria Fernandez

Customer Telephone Number

Customer ID Telephone Number

123 555-861-2025

456 555-403-1659

456 555-776-4100

789 555-808-9633

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer. It is worth noting that this design meets the additional requirements for second and third normal form.

Page 7: Database normalization

*Second normal form

Second normal form (2NF) is a normal form used in database normalization.

A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if and only if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table. A non-prime attribute of a table is an attribute that is not a part of any candidate key of the table.

Put simply, a table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is dependent on the whole of a candidate key.

Page 8: Database normalization

*ExampleConsider a table describing employees' skills:

Employees' Skills

Employee Skill Current Work Location

Brown Light Cleaning 73 Industrial Way

Brown Typing 73 Industrial Way

Harrison Light Cleaning 73 Industrial Way

Jones Shorthand 114 Main Street

Jones Typing 114 Main Street

Jones Whittling 114 Main Street

Neither {Employee} nor {Skill} is a candidate key for the table.

This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"

Page 9: Database normalization

*A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:

Employees

Employee Current Work Location

Brown 73 Industrial Way

Harrison 73 Industrial Way

Jones 114 Main Street

Employees' Skills

Employee Skill

Brown Light Cleaning

Brown Typing

Harrison Light Cleaning

Jones Shorthand

Jones Typing

Jones Whittling

Page 10: Database normalization

* Third normal form

The third normal form (3NF) is a normal form used in database normalization

Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:

* The relation R (table) is in second normal form (2NF)

* Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R.

* A non-prime attribute of R is an attribute that does not belong to any candidate key of R.

 A transitive dependency is a functional dependency in which X → Z (X determines Z) indirectly, by virtue of X → Y and Y → Z (where it is not the case that Y → X).

A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if, for each of its functional dependencies X → A, at least one of the following conditions holds:

* X contains A (that is, X → A is trivial functional dependency), or

* X is a superkey, or

* Every element of A-X, the set difference between A and X, is a prime attribute (i.e., each

attribute in A-X is contained in some candidate key

Page 11: Database normalization

* An example of a 2NF table that fails to meet the requirements of 3NF is:

Tournament Year Winner Winner Date of Birth

Indiana Invitational 1998 Al Fredrickson 21 July 1975

Cleveland Open 1999 Bob Albertson 28 September 1968

Des Moines Masters 1999 Al Fredrickson 21 July 1975

Indiana Invitational 1999 Chip Masterson 14 March 1977

Tournament Winners

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:Tournament Winners

Tournament Year Winner

Indiana Invitational

1998 Al Fredrickson

Cleveland Open 1999 Bob Albertson

Des Moines Masters

1999 Al Fredrickson

Indiana Invitational

1999 Chip Masterson

Player Dates of Birth

Player Date of Birth

Chip Masterson 14 March 1977

Al Fredrickson 21 July 1975

Bob Albertson 28 September 1968

Page 12: Database normalization

*Normalization beyond 3NF

Most 3NF tables are free of update, insertion, and deletion anomalies. Certain types

of 3NF tables, rarely met with in practice, are affected by such anomalies; these

are tables which either fall short of Boyce–Codd normal form (BCNF) or, if they

meet BCNF, fall short of the higher normal forms 4NF or 5NF.

Boyce–Codd normal form (or BCNF or 3.5NF) is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF).

* If a relational schema is in BCNF then all redundancy based on functional dependency has been removed, although other types of redundancy may still exist. A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold:

* X → Y is a trivial functional dependency (Y ⊆ X)

* X is a superkey for schema R

Page 13: Database normalization

*3NF tables not meeting BCNF (Boyce–Codd normal form)

Only in rare cases does a 3NF table not meet the requirements of BCNF. A 3NF table which does not have multiple overlapping candidate keys is guaranteed to be in BCNF. Depending on what its functional dependencies are, a 3NF table with two or more overlapping candidate keys may or may not be in BCNF.

* An example of a 3NF table that does not meet BCNF is:

Today's Court Bookings

CourtStart Time

End Time Rate Type

1 09:30 10:30 SAVER

1 11:00 12:00 SAVER

1 14:00 15:30 STANDARD

2 10:00 11:30 PREMIUM-B

2 11:30 13:30 PREMIUM-B

2 15:00 16:30 PREMIUM-A

Page 14: Database normalization

* Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass court (Court 2)

* A booking is defined by its Court and the period for which the Court is reserved

* Additionally, each booking has a Rate Type associated with it. There are four distinct rate types:

* SAVER, for Court 1 bookings made by members

* STANDARD, for Court 1 bookings made by non-members

* PREMIUM-A, for Court 2 bookings made by members

* PREMIUM-B, for Court 2 bookings made by non-members

Note that even though in the above table Start Time and End Time attributes have no

duplicate values for each of them, we still have to admit that in some other days two different

bookings on court 1 and court 2 could start at the same time or end at the same time. This is

the reason why {Start Time} and {End Time} cannot be considered as the table's super keys.

However, only S1, S2, S3 and S4 are candidate keys (that is, minimal superkeys for that

relation) because e.g. S1 ⊂ S5, so S5 cannot be a candidate key.

Page 15: Database normalization

* In Today's Court Bookings table, there are no non-prime attributes: that is, all attributes belong to some candidate key. Therefore the table adheres to both 2NF and 3NF.

* The table does not adhere to BCNF. This is because of the dependency Rate Type → Court, in which the determining attribute (Rate Type) is neither a candidate key nor a superset of a candidate key.

* Dependency Rate Type → Court is respected as a Rate Type should only ever apply to a single Court.

*The design can be amended so that it meets BCNF:Rate Types

Rate Type Court Member Flag

SAVER 1 Yes

STANDARD 1 No

PREMIUM-A 2 Yes

PREMIUM-B 2 No

Today's Bookings

Rate Type Start Time End Time

SAVER 09:30 10:30

SAVER 11:00 12:00

STANDARD 14:00 15:30

PREMIUM-B 10:00 11:30

PREMIUM-B 11:30 13:30

PREMIUM-A 15:00 16:30

The candidate keys for the Rate Types table are {Rate Type} and {Court, Member Flag}; the candidate keys for the Today's Bookings table are {Rate Type, Start Time} and {Rate Type, End Time}. Both tables are in BCNF

Page 16: Database normalization

*Fourth normal formFourth normal form (4NF) is a normal form used in database normalization , 4NF is the next level of normalization after Boyce–Codd normal form (BCNF).

Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency.

A Table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X  Y, X is a super key—that is, X is either a candidate key or a superset thereof

Consider the following example:

Pizza Delivery Permutations

Restaurant Pizza Variety Delivery Area

A1 Pizza Thick Crust Springfield

A1 Pizza Thick Crust Shelbyville

A1 Pizza Thick Crust Capital City

A1 Pizza Stuffed Crust Springfield

A1 Pizza Stuffed Crust Shelbyville

A1 Pizza Stuffed Crust Capital City

Elite Pizza Thin Crust Capital City

Elite Pizza Stuffed Crust Capital City

Vincenzo's Pizza Thick Crust Springfield

Vincenzo's Pizza Thick Crust Shelbyville

Vincenzo's Pizza Thin Crust Springfield

Vincenzo's Pizza Thin Crust Shelbyville

Page 17: Database normalization

* Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.

* The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it meets all normal forms up to BCNF. If we assume, however, that pizza varieties offered by a restaurant are not affected by delivery area, then it does not meet 4NF. The problem is that the table features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a super key).

The dependencies are:

* {Restaurant}  {Pizza Variety}

* {Restaurant}  {Delivery Area}

* These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the table:

for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza starts producing Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant}  {Pizza Variety}.

* To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table from the facts about delivery areas, yielding two tables that are both in 4NF:

Page 18: Database normalization

Varieties By Restaurant

Restaurant Pizza Variety

A1 Pizza Thick Crust

A1 Pizza Stuffed Crust

Elite Pizza Thin Crust

Elite Pizza Stuffed Crust

Vincenzo's Pizza Thick Crust

Vincenzo's Pizza Thin Crust

Delivery Areas By Restaurant

Restaurant Delivery Area

A1 Pizza Springfield

A1 Pizza Shelbyville

A1 Pizza Capital City

Elite Pizza Capital City

Vincenzo's Pizza Springfield

Vincenzo's Pizza Shelbyville

In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to another, the original three-column table would satisfy 4NF

Page 19: Database normalization

*Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in it is implied by the candidate keys.

Consider the following example: Traveling Salesman

Brand Product Type

Jack Schneider Acme Vacuum Cleaner

Jack Schneider Acme Breadbox

Willy Loman Robusto Pruning Shears

Willy Loman Robusto Vacuum Cleaner

Willy Loman Robusto Breadbox

Willy Loman Robusto Umbrella Stand

Louis Ferguson Robusto Vacuum Cleaner

Louis Ferguson Robusto Telescope

Louis Ferguson Acme Vacuum Cleaner

Louis Ferguson Acme Lava Lamp

Louis Ferguson Nimbus Tie Rack

Traveling Salesman Product Availability By Brand

Page 20: Database normalization

* The table's predicate is: Products of the type designated by Product Type, made by the brand designated by Brand, are available from the traveling salesman designated by Traveling Salesman.

* In the absence of any rules restricting the valid possible combinations of Traveling Salesman, Brand, and Product Type, the three-attribute table above is necessary in order to model the situation correctly.

* Suppose, however, that the following rule applies: A Traveling Salesman has certain Brands and certain Product Types in his repertoire. If Brand B1 and Brand B2 are in his repertoire, and Product Type P is in his repertoire, then (assuming Brand B1 and Brand B2 both make Product Type P), the Traveling Salesman must offer products of Product Type P those made by Brand B1 and those made by Brand B2.

* In that case, it is possible to split the table into three:Product Types By Traveling Salesman

Traveling Salesman

Product Type

Jack SchneiderVacuum Cleaner

Jack Schneider Breadbox

Willy Loman Pruning Shears

Willy LomanVacuum Cleaner

Willy Loman Breadbox

Willy Loman Umbrella Stand

Louis Ferguson Telescope

Louis FergusonVacuum Cleaner

Louis Ferguson Lava Lamp

Louis Ferguson Tie Rack

Brands By Traveling Salesman

Traveling

Salesman

Brand

Jack Schneider

Acme

Willy Loman

Robusto

Louis Ferguson

Robusto

Louis Ferguson

Acme

Louis Ferguson

Nimbus

Product Types By Brand

Brand Product Type

Acme Vacuum Cleaner

Acme Breadbox

Acme Lava Lamp

Robusto Pruning Shears

Robusto Vacuum Cleaner

Robusto Breadbox

Robusto Umbrella Stand

Robusto Telescope

Nimbus Tie Rack