database normalization
DESCRIPTION
NORMALIZATIONTRANSCRIPT
In relational database design, the process of organizing data to minimize redundancy is called normalization or the process of decomposing relation with anomalies to produce smaller ,well-structured relation. Normalization usually involves dividing a database into two or more tables and defining relationships between the tables. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.
Normalization is a process of decomposing a set of relations(table) with anomalies to produce smaller and well structured relations that contain minimum or no redundancy. The basic objectives of normalization are to reduce redundancy, which means that information is to be stored only once. Storing information several times leads to wastage of storage space and increase in the total size of the data stored. Normalization provides the designer with a systematic and scientific process of grouping of attributes in a relation.
The normalization process can be defined as a procedure of analyzing and successive reduction of the given relational schemas based on their FDs and primary keys to achieve the desirable properties of -
(i) Minimizing redundancy,(ii) minimizing insertion, deletion and update anomalies.
* Database normalization
The objectives of the normalization process are
1- To create a formal framework for analyzing relation schemas based on their
keys and on the functional dependencies among their attributes.
2- To free relations from undesirable insertion, update and deletion anomalies.
3- To reduce the need for restructuring the relations as new data types are
introduced.
4- To carry out series of tests on individual relation schema so that the
relational database can be normalized to some degree. When a test fails, the
relation violating that test must be decomposed into relations that individually
meet the normalization tests.
*Database Normalization
*The main goal of Database Normalization is to
restructure the logical data model of a database
to:
*Eliminate redundancy
*Organize data efficiently
*Reduce the potential for data anomalies.
*First normal form
*First normal form (1NF) is a property of a relation in a relational
database. A relation is in first normal form if the domain of each
attribute contains only atomic values, and the value of each
attribute contains only a single value from that domain.
*First normal form is an essential property of a relation in a
relational database. Database normalization is the process of
representing a database in terms of relations in standard normal
forms, where first normal is a minimal requirement.
*ExamplesThe following scenario illustrates how a database design might violate first normal form.
Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:
Customer ID First Name SurnameTelephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633
Customer
The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:
Customer ID First Name SurnameTelephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright555-403-1659555-776-4100
789 Maria Fernandez 555-808-9633
Customer
Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.
*A design that complies with 1NFA design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.
Customer Name
Customer ID
First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez
Customer Telephone Number
Customer ID Telephone Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633
Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer. It is worth noting that this design meets the additional requirements for second and third normal form.
*Second normal form
Second normal form (2NF) is a normal form used in database normalization.
A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if and only if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table. A non-prime attribute of a table is an attribute that is not a part of any candidate key of the table.
Put simply, a table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is dependent on the whole of a candidate key.
*ExampleConsider a table describing employees' skills:
Employees' Skills
Employee Skill Current Work Location
Brown Light Cleaning 73 Industrial Way
Brown Typing 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Jones Shorthand 114 Main Street
Jones Typing 114 Main Street
Jones Whittling 114 Main Street
Neither {Employee} nor {Skill} is a candidate key for the table.
This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.
The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"
*A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:
Employees
Employee Current Work Location
Brown 73 Industrial Way
Harrison 73 Industrial Way
Jones 114 Main Street
Employees' Skills
Employee Skill
Brown Light Cleaning
Brown Typing
Harrison Light Cleaning
Jones Shorthand
Jones Typing
Jones Whittling
* Third normal form
The third normal form (3NF) is a normal form used in database normalization
Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:
* The relation R (table) is in second normal form (2NF)
* Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R.
* A non-prime attribute of R is an attribute that does not belong to any candidate key of R.
A transitive dependency is a functional dependency in which X → Z (X determines Z) indirectly, by virtue of X → Y and Y → Z (where it is not the case that Y → X).
A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if, for each of its functional dependencies X → A, at least one of the following conditions holds:
* X contains A (that is, X → A is trivial functional dependency), or
* X is a superkey, or
* Every element of A-X, the set difference between A and X, is a prime attribute (i.e., each
attribute in A-X is contained in some candidate key
* An example of a 2NF table that fails to meet the requirements of 3NF is:
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Tournament Winners
Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.
The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.
In order to express the same facts without violating 3NF, it is necessary to split the table into two:Tournament Winners
Tournament Year Winner
Indiana Invitational
1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters
1999 Al Fredrickson
Indiana Invitational
1999 Chip Masterson
Player Dates of Birth
Player Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968
*Normalization beyond 3NF
Most 3NF tables are free of update, insertion, and deletion anomalies. Certain types
of 3NF tables, rarely met with in practice, are affected by such anomalies; these
are tables which either fall short of Boyce–Codd normal form (BCNF) or, if they
meet BCNF, fall short of the higher normal forms 4NF or 5NF.
Boyce–Codd normal form (or BCNF or 3.5NF) is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF).
* If a relational schema is in BCNF then all redundancy based on functional dependency has been removed, although other types of redundancy may still exist. A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold:
* X → Y is a trivial functional dependency (Y ⊆ X)
* X is a superkey for schema R
*3NF tables not meeting BCNF (Boyce–Codd normal form)
Only in rare cases does a 3NF table not meet the requirements of BCNF. A 3NF table which does not have multiple overlapping candidate keys is guaranteed to be in BCNF. Depending on what its functional dependencies are, a 3NF table with two or more overlapping candidate keys may or may not be in BCNF.
* An example of a 3NF table that does not meet BCNF is:
Today's Court Bookings
CourtStart Time
End Time Rate Type
1 09:30 10:30 SAVER
1 11:00 12:00 SAVER
1 14:00 15:30 STANDARD
2 10:00 11:30 PREMIUM-B
2 11:30 13:30 PREMIUM-B
2 15:00 16:30 PREMIUM-A
* Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass court (Court 2)
* A booking is defined by its Court and the period for which the Court is reserved
* Additionally, each booking has a Rate Type associated with it. There are four distinct rate types:
* SAVER, for Court 1 bookings made by members
* STANDARD, for Court 1 bookings made by non-members
* PREMIUM-A, for Court 2 bookings made by members
* PREMIUM-B, for Court 2 bookings made by non-members
Note that even though in the above table Start Time and End Time attributes have no
duplicate values for each of them, we still have to admit that in some other days two different
bookings on court 1 and court 2 could start at the same time or end at the same time. This is
the reason why {Start Time} and {End Time} cannot be considered as the table's super keys.
However, only S1, S2, S3 and S4 are candidate keys (that is, minimal superkeys for that
relation) because e.g. S1 ⊂ S5, so S5 cannot be a candidate key.
* In Today's Court Bookings table, there are no non-prime attributes: that is, all attributes belong to some candidate key. Therefore the table adheres to both 2NF and 3NF.
* The table does not adhere to BCNF. This is because of the dependency Rate Type → Court, in which the determining attribute (Rate Type) is neither a candidate key nor a superset of a candidate key.
* Dependency Rate Type → Court is respected as a Rate Type should only ever apply to a single Court.
*The design can be amended so that it meets BCNF:Rate Types
Rate Type Court Member Flag
SAVER 1 Yes
STANDARD 1 No
PREMIUM-A 2 Yes
PREMIUM-B 2 No
Today's Bookings
Rate Type Start Time End Time
SAVER 09:30 10:30
SAVER 11:00 12:00
STANDARD 14:00 15:30
PREMIUM-B 10:00 11:30
PREMIUM-B 11:30 13:30
PREMIUM-A 15:00 16:30
The candidate keys for the Rate Types table are {Rate Type} and {Court, Member Flag}; the candidate keys for the Today's Bookings table are {Rate Type, Start Time} and {Rate Type, End Time}. Both tables are in BCNF
*Fourth normal formFourth normal form (4NF) is a normal form used in database normalization , 4NF is the next level of normalization after Boyce–Codd normal form (BCNF).
Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency.
A Table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X Y, X is a super key—that is, X is either a candidate key or a superset thereof
Consider the following example:
Pizza Delivery Permutations
Restaurant Pizza Variety Delivery Area
A1 Pizza Thick Crust Springfield
A1 Pizza Thick Crust Shelbyville
A1 Pizza Thick Crust Capital City
A1 Pizza Stuffed Crust Springfield
A1 Pizza Stuffed Crust Shelbyville
A1 Pizza Stuffed Crust Capital City
Elite Pizza Thin Crust Capital City
Elite Pizza Stuffed Crust Capital City
Vincenzo's Pizza Thick Crust Springfield
Vincenzo's Pizza Thick Crust Shelbyville
Vincenzo's Pizza Thin Crust Springfield
Vincenzo's Pizza Thin Crust Shelbyville
* Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.
* The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it meets all normal forms up to BCNF. If we assume, however, that pizza varieties offered by a restaurant are not affected by delivery area, then it does not meet 4NF. The problem is that the table features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a super key).
The dependencies are:
* {Restaurant} {Pizza Variety}
* {Restaurant} {Delivery Area}
* These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the table:
for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza starts producing Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant} {Pizza Variety}.
* To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table from the facts about delivery areas, yielding two tables that are both in 4NF:
Varieties By Restaurant
Restaurant Pizza Variety
A1 Pizza Thick Crust
A1 Pizza Stuffed Crust
Elite Pizza Thin Crust
Elite Pizza Stuffed Crust
Vincenzo's Pizza Thick Crust
Vincenzo's Pizza Thin Crust
Delivery Areas By Restaurant
Restaurant Delivery Area
A1 Pizza Springfield
A1 Pizza Shelbyville
A1 Pizza Capital City
Elite Pizza Capital City
Vincenzo's Pizza Springfield
Vincenzo's Pizza Shelbyville
In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to another, the original three-column table would satisfy 4NF
*Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in it is implied by the candidate keys.
Consider the following example: Traveling Salesman
Brand Product Type
Jack Schneider Acme Vacuum Cleaner
Jack Schneider Acme Breadbox
Willy Loman Robusto Pruning Shears
Willy Loman Robusto Vacuum Cleaner
Willy Loman Robusto Breadbox
Willy Loman Robusto Umbrella Stand
Louis Ferguson Robusto Vacuum Cleaner
Louis Ferguson Robusto Telescope
Louis Ferguson Acme Vacuum Cleaner
Louis Ferguson Acme Lava Lamp
Louis Ferguson Nimbus Tie Rack
Traveling Salesman Product Availability By Brand
* The table's predicate is: Products of the type designated by Product Type, made by the brand designated by Brand, are available from the traveling salesman designated by Traveling Salesman.
* In the absence of any rules restricting the valid possible combinations of Traveling Salesman, Brand, and Product Type, the three-attribute table above is necessary in order to model the situation correctly.
* Suppose, however, that the following rule applies: A Traveling Salesman has certain Brands and certain Product Types in his repertoire. If Brand B1 and Brand B2 are in his repertoire, and Product Type P is in his repertoire, then (assuming Brand B1 and Brand B2 both make Product Type P), the Traveling Salesman must offer products of Product Type P those made by Brand B1 and those made by Brand B2.
* In that case, it is possible to split the table into three:Product Types By Traveling Salesman
Traveling Salesman
Product Type
Jack SchneiderVacuum Cleaner
Jack Schneider Breadbox
Willy Loman Pruning Shears
Willy LomanVacuum Cleaner
Willy Loman Breadbox
Willy Loman Umbrella Stand
Louis Ferguson Telescope
Louis FergusonVacuum Cleaner
Louis Ferguson Lava Lamp
Louis Ferguson Tie Rack
Brands By Traveling Salesman
Traveling
Salesman
Brand
Jack Schneider
Acme
Willy Loman
Robusto
Louis Ferguson
Robusto
Louis Ferguson
Acme
Louis Ferguson
Nimbus
Product Types By Brand
Brand Product Type
Acme Vacuum Cleaner
Acme Breadbox
Acme Lava Lamp
Robusto Pruning Shears
Robusto Vacuum Cleaner
Robusto Breadbox
Robusto Umbrella Stand
Robusto Telescope
Nimbus Tie Rack