temple university – cis dept. cis616– principles of data management
DESCRIPTION
Temple University – CIS Dept. CIS616– Principles of Data Management. V. Megalooikonomou Database Design and Normalization (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU). Overview. Relational model formal query languages commercial query languages (SQL) - PowerPoint PPT PresentationTRANSCRIPT
Temple University – CIS Dept.CIS616– Principles of Data Management
V. Megalooikonomou
Database Design and Normalization
(based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)
Overview Relational model
formal query languages commercial query languages (SQL)
Integrity constraints domain I.C., foreign keys functional dependencies
Functional Dependencies DB design and normalization
Overview - detailed
DB design and normalization pitfalls of bad design decomposition normal forms
Design ‘good’ tables sub-goal#1: define what ‘good’ means sub-goal#2: fix ‘bad’ tables
in short: “we want tables where the attributes
depend on the primary key, on the whole key, and nothing but the key”
Let’s see why, and how:
Goal
Pitfalls
takes1 (ssn, c-id, grade, name, address)
Ssn c-id Grade Name Address
123cs331 A smith Main
Pitfalls
‘Bad’ - why? because: ssn->address, name
Ssn c-id Grade Name Address
123 cs331 A smith Main
123 cs351 B smith Main
123 cs211 A smith Main
Pitfalls
Redundancy space (inconsistencies) insertion/deletion anomalies:
Pitfalls
insertion anomaly: “jones” registers, but takes no class -
no place to store his address!
Ssn c-id Grade Name Address
123 cs331 A smith Main
… … … … …
234 null null jones Forbes
Pitfalls
deletion anomaly: delete the last record of ‘smith’ (we
lose his address!)
Ssn c-id Grade Name Address
123 cs331 A smith Main
123 cs351 B smith Main
123 cs211 A smith Main
Solution: decomposition
split offending table in two (or more), e.g.:
Ssn c-id Grade Name Address123 cs331A smith Main123 cs351B smith Main123 cs211A smith Main
? ?
Overview - detailed
DB design and normalization pitfalls of bad design decomposition
lossless join dependency preserving
normal forms
Decompositions
there are ‘bad’ decompositions we want:
lossless and dependency preserving
Decompositions - lossy:
R1(ssn, grade, name, address) R2(c-id,grade)
Ssn c-id Grade Name Address123 cs331 A smith Main123 cs351 B smith Main
234 cs211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn Grade Name Address123 A smith Main123 B smith Main
234 A jones Forbes
c-id Grade
cs331 Acs351 B
cs211 A
Decompositions - lossy:
can not recover original table with a join!
Ssn c-id Grade Name Address
123 cs331 A smith Main123 cs351 B smith Main
234 cs211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn Grade Name Address123 A smith Main123 B smith Main
234 A jones Forbes
c-id Grade
cs331 Acs351 B
cs211 A
Decompositions – lossy: Another example
Decomposition of R = (A, B) into R1 = (A), R2 = (B)
A B
121
A
B
12
rA(r) B(r)
A (r) B (r)A B
1212
Decompositions
example of non-dependency preserving
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
Decompositions
is it lossless?
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
Decompositions - lossless
Definition: Consider schema R, with FD ‘F’. R1, R2 is a lossless join decomposition
of R if we always have:
An easier criterion?
rrr 21
Decomposition - lossless
Theorem: lossless join decomposition if the
joining attribute is a superkey in at least one of the new tables
Formally:
221
121
RRR
orRRR
Decomposition - losslessexample:
Ssn c-id Grade Name Address
123 cs331 A smith Main123 cs351 B smith Main
234 cs211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn c-id Grade123 cs331 A123 cs351 B
234 cs211 A
Ssn Name Address123 smith Main234 jones Forbes
ssn->name, addressssn, c-id -> grade
R1R2
Overview - detailed
DB design and normalization pitfalls of bad design decomposition
lossless join decomp. dependency preserving
normal forms
Decomposition - depend. pres.
informally: we don’t want the original FDs to span two tables - counter-example:
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
Decomposition - depend. pres.
dependency preserving decomposition:
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# address123 London125 Paris
234 Pitts.
S# -> address address -> status
address statusLondon EParis E
Pitts. A
(but: S#->status ?)
Decomposition - depend. pres.
informally: we don’t want the original FDs to span two tables
more specifically: … the FDs of the canonical cover
Let Fi be the set of dependencies F+ that include only attributes in Ri.
Preferably the decomposition should be dependency preserving, that is, (F1 F2 … Fn)
+ = F+
Otherwise, checking updates for violation of functional dependencies may require computing joins expensive
Decomposition - depend. pres.
why is dependency preservation good?
S# address123 London125 Paris
234 Pitts.
S# -> address address -> status
address statusLondon EParis E
Pitts. A
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
(address->status: ‘lost’)
Decomposition - depend. pres.
A: eg., record that ‘Philly’ has status ‘A’
S# address123 London125 Paris
234 Pitts.
S# -> address address -> status
address statusLondon EParis E
Pitts. A
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> addressS# -> status
(address->status: ‘lost’)
Decomposition - depend. pres.
To check if a dependency is preserved in a decomposition of R into R1, R2, …, Rn we apply the following test (with attribute closure done w.r.t. F)
result = while (changes to result) do
for each Ri in the decompositiont = (result Ri)+ Ri
result = result t If result contains all attributes in , then functional dependency is
preserved
We apply the test on all dependencies in F to check if a decomposition is dependency preserving
The test takes polynomial time Computing F+ and (F1 F2 … Fn)
+ needs exponential time
Decomposition - conclusions
decompositions should always be lossless joining attribute -> superkey
whenever possible, we want them to be dependency preserving (occasionally, impossible - see ‘STJ’ example later…)
Normalization using FD When decomposing a relation schema R with a set of
functional dependencies F into R1, R2,…, Rn we want:
Lossless-join decomposition: otherwise … information loss
No redundancy: relations Ri preferably should be in either Boyce-Codd Normal Form or Third Normal Form
Dependency preservation: Let Fi be the set of dependencies in F+ that include only attributes in Ri.
Preferably the decomposition should be dependency preserving, i.e., (F1 F2 … Fn)
+ = F+
Otherwise, checking updates for violation of functional dependencies may require computing joins expensive
Normalization using FD - Example
R = (A, B, C)F = {A B, B C)
R1 = (A, B), R2 = (B, C) Lossless-join decomposition:
R1 R2 = {B} and B BC Dependency preserving
R1 = (A, B), R2 = (A, C) Lossless-join decomposition:
R1 R2 = {A} and A AB Not dependency preserving
(cannot check B C without computing R1 R2)
Overview - detailed
DB design and normalization pitfalls of bad design decomposition ( how to fix the problem) normal forms ( how to detect the
problem) BCNF, 3NF, (1NF, 2NF)
Normal forms - BCNF
We saw how to fix ‘bad’ schemas -but what is a ‘good’ schema?
Answer: ‘good’, if it obeys a ‘normal form’,i.e., a set of rules
Typically: Boyce-Codd Normal Form (BCNF)
Normal forms - BCNF
Defn.: Rel. R is in BCNF w.r.t. F, if
informally: everything depends on the full key, and nothing but the key
semi-formally: every determinant (of the cover) is a candidate key
Normal forms - BCNF
Example and counter-example:
Ssn Name Address123 smith Main123 smith Main
234 jones Forbes
ssn->name, address
Ssn c-id Grade Name Address123 cs331 A smith Main123 cs351 B smith Main
234 cs211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Normal forms - BCNF
Formally: for every FD a->b in F+ a->b is trivial (a is a superset of b) or a is a superkey (or both)
Normal forms - BCNF
Theorem: given a schema R and a set of FD ‘F’, we can always decompose it to schemas R1, … Rn, so that R1, … Rn are in BCNF and the decomposition is lossless
(…but, some decomp. might lose dependencies)
BCNF Decomposition
How? ….essentially, break off FDs of the cover
eg. TAKES1(ssn, c-id, grade, name, address)ssn -> name, addressssn, c-id -> grade
Normal forms - BCNF
eg. TAKES1(ssn, c-id, grade, name, address)ssn -> name, address ssn, c-id -> grade
name
addressgrade
c-id
ssn
Normal forms - BCNF
Ssn c-id Grade Name Address
123 cs331 A smith Main123 cs351 B smith Main
234 cs211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn c-id Grade123 cs331 A123 cs351 B
234 cs211 A
Ssn Name Address123 smith Main123 smith Main
234 jones Forbes
ssn->name, addressssn, c-id -> grade
Normal forms - BCNF
pictorially: we want a ‘star’ shape
name
addressgrade
c-id
ssn
:not in BCNF
Normal forms - BCNF
pictorially: we want a ‘star’ shape
B
C
A G
E
Dor
F
H
Normal forms - BCNF
or a star-like: (e.g., 2 cand. keys):STUDENT(ssn, st#, name, address)
name
address
ssn
st#
=
name
address
ssn
st#
Normal forms - BCNF
but not:
or
B
C
A
D
G
E
D
F
H
BCNF Decomposition
result := {R};done := false;compute F+;while (not done) do
if (there is a schema Ri in result that is not in BCNF)then begin
let be a nontrivial functionaldependency that holds on Ri
such that Ri is not in F+, and = ;
result := (result – Ri) (Ri – ) (, ); end
else done := true;
Note: each Ri is in BCNF, and decomposition is lossless-join
Normal forms - 3NF
consider the ‘classic’ case:STJ( Student, Teacher, subJect)
T-> JS,J -> T
is it BCNF? S
T
J
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
How to decompose it to BCNF?
S
T
J
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
1) R1(T,J) R2(S,J) (BCNF? - lossless? - dep. pres.? )
2) R1(T,J) R2(S,T) (BCNF? - lossless? - dep. pres.? )
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
1) R1(T,J) R2(S,J) (BCNF? Y+Y - lossless? N - dep. pres.? N )
2) R1(T,J) R2(S,T) (BCNF? Y+Y - lossless? Y - dep. pres.? N )
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
in this case: impossible to have both BCNF and dependency preservation
Welcome 3NF (…a weaker normal form)!
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
S
J
T
informally, 3NF ‘forgives’ the red arrow in the can. cover
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
S
J
T
Formally, a rel. R with FDs ‘F’ is in 3NF if:
for every a->b in F+:
•it is trivial or
•a is a superkey or
•each b-a attr.: part of a cand. key
Normal forms - 3NF
R = (J, K, L)F = {JK L, L K}Two candidate keys = JK and JL
R is not in BCNF Any decomposition of R will fail to preserve
JK L BCNF decomposition has (JL) and (LK)
Testing for JK L requires a join
R is in 3NFJK L JK is a superkeyL K K is contained in a candidate key
There is some redundancy in this schema…
Normal forms - 3NF TESTING FOR 3NF
Optimization: Need to check only FDs in F, need not check all FDs in F+
Use attribute closure to check, for each dependency , if is a superkey
If is not a superkey, we have to verify if each attribute in is contained in a candidate key of R this test is more expensive; it involves finding
candidate keys testing for 3NF is NP-hard Interestingly, decomposition into 3NF (described
shortly) can be done in polynomial time
Decomposition into 3NF
Let Fc be a canonical cover for F;i := 0;for each functional dependency in Fc do
if none of the schemas Rj, 1 j i contains then begin
i := i + 1;Ri :=
endif none of the schemas Rj, 1 j i contains a candidate key for R
then begini := i + 1;Ri := any candidate key for R;
end return (R1, R2, ..., Ri)
The dependencies are preserved by building explicitly a schema for each given dependency
Guarantees a lossless-join decomposition by having at least one schema containing a candidate key for the schema being decomposed
Normal forms - 3NF
how to bring a schema to 3NF?In short ….for each FD in the cover, put it in a
table
Normal forms - 3NF vs BCNF
If ‘R’ is in BCNF, it is always in 3NF (but not the reverse)
In practice, aim for BCNF; lossless join; and dep. preservation
if impossible, we accept 3NF; but insist on lossless join and dep.
preservation 3NF has problems with transitive
dependecies
3NF vs BCNF (cont.) Example of problems due to redundancy in 3NF
R = (J, K, L)F = {JK L, L K}
A schema that is in 3NF but not in BCNF has the problems of repetition of information (e.g., the relationship l1, k1) need to use null values (e.g., to represent the relationship
l2, k2 where there is no corresponding value for J).
J L Kj1
j2
j3
null
l1
l1
l1
l2
k1
k1
k1
k2
Normal forms - more details
why ‘3’NF? what is 2NF? 1NF? 1NF: attributes are atomic (i.e., no
set-valued attr., a.k.a. ‘repeating groups’)Ssn Name Dependents123 Smith Peter
MaryJohn
234 Jones AnnMichael
not 1NF
Normal forms - more details
2NF: 1NF and non-key attr. fully depend on the keycounter-example: TAKES1(ssn, c-id, grade, name, address)ssn -> name, address ssn, c-id -> grade
name
addressgradec-id
ssn
Normal forms - more details
3NF: 2NF and no transitive dependencies counter-example:
B
C
A
Din 2NF, but not in 3NF
Normal forms - more details
4NF, multivalued dependencies etc:
later… in practice, E-R diagrams usually
lead to tables in BCNF
Overview - conclusions
DB design and normalization pitfalls of bad design decompositions (lossless, dep.
preserving) normal forms (BCNF or 3NF)
“everything should depend on the key, the whole key, and nothing but the key”