dept. of cis, temple univ. cis661 – principles of data management
Post on 31-Dec-2015
17 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management
V. MegalooikonomouDatabase design and
normalization(based on slides by C. Faloutsos at CMU)
Overview Relational model
formal query languages commercial query languages (SQL)
Integrity constraints domain I.C., foreign keys functional dependencies
DB design and normalization
Design ‘good’ tables sub-goal#1: define what ‘good’ means sub-goal#2: fix ‘bad’ tables
in short: “we want tables where the attributes
depend on the primary key, on the whole key, and nothing but the key”
Let’s see why, and how:
Goal
Pitfalls
‘Bad’ - why? because: ssn->address, name
Ssn c-id Grade Name Address
123 413 A smith Main
123 415 B smith Main
123 211 A smith Main
Pitfalls insertion anomaly:
“jones” registers, but takes no class - no place to store his address!
Ssn c-id Grade Name Address
123 413 A smith Main
… … … … …
234 null null jones Forbes
Pitfalls deletion anomaly:
delete the last record of ‘smith’ (we lose his address!)
Ssn c-id Grade Name Address
123 413 A smith Main
123 415 B smith Main
123 211 A smith Main
Solution: decomposition split offending table in two (or
more), eg.:Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main123 211 A smith Main
? ?
Overview - detailed DB design and normalization
pitfalls of bad design decomposition
lossless join decomp. dependency preserving
normal forms
Decompositions - lossy:R1(ssn, grade, name, address) R2(c-
id,grade)
Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main
234 211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn Grade Name Address123 A smith Main123 B smith Main
234 A jones Forbes
c-id Grade413 A415 B
211 A
Decompositions - lossy:can not recover original table with a
join!
Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main
234 211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn Grade Name Address123 A smith Main123 B smith Main
234 A jones Forbes
c-id Grade413 A415 B
211 A
Decompositions - lossy: Another example
Decomposition of R = (A, B) into
R1 = (A) R2 = (B)A B
121
A
B
12
rA(r) B(r)
A (r) B (r)A B
1212
Decompositions
example of non-dependency preserving
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
Decompositions
is it lossless?S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
Decompositions - lossless
Definition: Consider schema R, with FD ‘F’. R1, R2 is a lossless join
decomposition of R if we always have:
An easier criterion?
rrr 21
Decomposition - lossless
Theorem: lossless join decomposition if the
joining attribute is a superkey in at least one of the new tables
Formally:
221
121
RRR
orRRR
Decomposition - losslessexample:
Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main
234 211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn c-id Grade123 413 A123 415 B
234 211 A
Ssn Name Address123 smith Main234 jones Forbes
ssn->name, addressssn, c-id -> grade
R1R2
Overview - detailed DB design and normalization
pitfalls of bad design decomposition
lossless join decomp. dependency preserving
normal forms
Decomposition - depend. pres.
informally: we don’t want the original FDs to span two tables - counter-example:
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
Decomposition - depend. pres.
dependency preserving decomposition:
S# address status123 London E125 Paris E
234 Pitts. A
S# -> address, status
address -> status
S# address123 London125 Paris
234 Pitts.
S# -> address address -> status
address statusLondon EParis E
Pitts. A
(but: S#->status ?)
Decomposition - depend. pres.
informally: we don’t want the original FDs to span two tables.
More specifically: … the FDs of the canonical cover.
Let Fi be the set of dependencies F+ that include only attributes in Ri.
Preferably the decomposition should be dependency preserving, that is, (F1 F2 … Fn)
+ = F+
Otherwise, checking updates for violation of functional dependencies may require computing joins, which is expensive.
Decomposition - depend. pres.
why is dependency preservation good?
S# address123 London125 Paris
234 Pitts.
S# -> address address -> status
address statusLondon EParis E
Pitts. A
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> address S# -> status
(address->status: ‘lost’)
Decomposition - depend. pres.
A: eg., record that ‘Philly’ has status ‘A’
S# address123 London125 Paris
234 Pitts.
S# -> address address -> status
address statusLondon EParis E
Pitts. A
S# status123 E125 E
234 A
S# address123 London125 Paris
234 Pitts.
S# -> addressS# -> status
(address->status: ‘lost’)
Decomposition - depend. pres.
To check if a dependency is preserved in a decomposition of R into R1, R2, …, Rn we apply the following test (with attribute closure done w.r.t. F) result =
while (changes to result) dofor each Ri in the decomposition
t = (result Ri)+ Ri
result = result t If result contains all attributes in , then the functional dependency
is preserved. We apply the test on all dependencies in F to check if a decomposition is dependency preserving This procedure takes polynomial time, instead of the exponential time to compute F+ and (F1 F2 … Fn)
+
Decomposition - conclusions decompositions should always be
lossless joining attribute -> superkey
whenever possible, we want them to be dependency preserving (occasionally, impossible - see ‘STJ’ example later…)
Normalization using FD When we decompose a relation schema R with a set of
functional dependencies F into R1, R2,.., Rn we want: Lossless-join decomposition: Otherwise decomposition
would result in information loss. No redundancy: The relations Ri preferably should be in
either Boyce-Codd Normal Form or Third Normal Form. Dependency preservation: Let Fi be the set of
dependencies F+ that include only attributes in Ri. Preferably the decomposition should be dependency
preserving, that is, (F1 F2 … Fn)+ = F+
Otherwise, checking updates for violation of functional dependencies may require computing joins, which is expensive.
Normalization using FD - Example
R = (A, B, C)F = {A B, B C)
R1 = (A, B), R2 = (B, C) Lossless-join decomposition:
R1 R2 = {B} and B BC Dependency preserving
R1 = (A, B), R2 = (A, C) Lossless-join decomposition:
R1 R2 = {A} and A AB Not dependency preserving
(cannot check B C without computing R1 R2)
Overview - detailed DB design and normalization
pitfalls of bad design decomposition (-> how to fix the
problem) normal forms (-> how to detect the
problem) BCNF, 3NF (1NF, 2NF)
Normal forms - BCNFWe saw how to fix ‘bad’ schemas -but what is a ‘good’ schema?
Answer: ‘good’, if it obeys a ‘normal form’,ie., a set of rules.
Typically: Boyce-Codd Normal Form (BCNF)
Normal forms - BCNF
Defn.: Rel. R is in BCNF wrt F, if informally: everything depends on
the full key, and nothing but the key
semi-formally: every determinant (of the cover) is a candidate key
Normal forms - BCNF
Example and counter-example:
Ssn Name Address123 smith Main123 smith Main
234 jones Forbes
ssn->name, address
Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main
234 211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Normal forms - BCNF
Formally: for every FD a->b in F+ a->b is trivial (a superset of b) or a is a superkey (or both)
Normal forms - BCNF
Theorem: given a schema R and a set of FD ‘F’, we can always decompose it to schemas R1, … Rn, so that R1, … Rn are in BCNF and the decompositions are lossless.
(but, some decomp. might lose dependencies)
BCNF Decompositionresult := {R};
done := false;compute F+;while (not done) do
if (there is a schema Ri in result that is not in BCNF)then begin
let be a nontrivial functionaldependency that holds on Ri
such that Ri is not in F+, and = ;
result := (result – Ri) (Ri – ) (, );
endelse done := true;
Note: each Ri is in BCNF, and decomposition is lossless-join.
BCNF Decomposition
How? ….essentially, break off FDs of the cover
eg. TAKES1(ssn, c-id, grade, name, address)ssn -> name, addressssn, c-id -> grade
Normal forms - BCNFeg. TAKES1(ssn, c-id, grade, name, address)
ssn -> name, address ssn, c-id -> grade
name
addressgrade
c-id
ssn
Normal forms - BCNF
Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main
234 211 A jones Forbes
ssn->name, address
ssn, c-id -> grade
Ssn c-id Grade123 413 A123 415 B
234 211 A
Ssn Name Address123 smith Main123 smith Main
234 jones Forbes
ssn->name, addressssn, c-id -> grade
Normal forms - BCNF
or a star-like: (eg., 2 cand. keys):STUDENT(ssn, st#, name, address)
name
address
ssn
st#
=
name
address
ssn
st#
Normal forms - 3NF
consider the ‘classic’ case:STJ( Student, Teacher, subJect)
T-> JS,J -> T
is it BCNF?S
T
J
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
1) R1(T,J) R2(S,J) (BCNF? - lossless? - dep. pres.?
)
2) R1(T,J) R2(S,T) (BCNF? - lossless? - dep. pres.?
)
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
1) R1(T,J) R2(S,J) (BCNF? Y+Y - lossless? N - dep. pres.? N )
2) R1(T,J) R2(S,T) (BCNF? Y+Y - lossless? Y - dep. pres.? N )
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
in this case: impossible to have both BCNF and dependency preservationWelcome 3NF (a weaker normal
form)!
Normal forms - 3NF
STJ( Student, Teacher, subJect)T-> J S,J -> T
S
J
T
informally, 3NF ‘forgives’ the red arrow in the can. cover
Normal forms - 3NFSTJ( Student,
Teacher, subJect)T-> J S,J -> T
S
J
T
Formally, a rel. R with FDs ‘F’ is in 3NF if: for every a->b in F+:
•it is trivial or
•a is a superkey or
•each b-a attr.: part of a cand. key
Normal forms - 3NF R = (J, K, L)
F = {JK L, L K}Two candidate keys = JK and JL
R is not in BCNF Any decomposition of R will fail to preserve
JK L BCNF decomposition has (JL) and (LK)
Testing for JK L requires a join
R is in 3NFJK L JK is a superkeyL K K is contained in a candidate key
There is some redundancy in this schema…
Normal forms - 3NF TESTING FOR 3NF
Optimization: Need to check only FDs in F, need not check all FDs in F+.
Use attribute closure to check, for each dependency , if is a superkey.
If is not a superkey, we have to verify if each attribute in is contained in a candidate key of R
this test is rather more expensive, since it involve finding candidate keys
testing for 3NF has been shown to be NP-hard Interestingly, decomposition into third normal form
(described shortly) can be done in polynomial time
Decomposition into 3NF Let Fc be a canonical cover for F;
i := 0;for each functional dependency in Fc do
if none of the schemas Rj, 1 j i contains then begin
i := i + 1;Ri :=
endif none of the schemas Rj, 1 j i contains a candidate key for R
then begini := i + 1;Ri := any candidate key for R;
end return (R1, R2, ..., Ri)
The dependencies are preserved by building explicitly a schema for each given dependency
Guarantees a lossless-join decomposition by having at least one schema containing a candidate key for the schema being decomposed
Normal forms - 3NF
how to bring a schema to 3NF?In short ….for each FD in the cover, put it in a
table
Normal forms - 3NF vs BCNF If ‘R’ is in BCNF, it is always in 3NF (but
not the reverse) In practice, aim for
BCNF; lossless join; and dep. preservation if impossible, we accept
3NF; but insist on lossless join and dep. Preservation
3NF has problems with transitive dependecies
3NF vs BCNF (cont.) Example of problems due to redundancy in 3NF
R = (J, K, L)F = {JK L, L K}
A schema that is in 3NF but not in BCNF has the problems of repetition of information (e.g., the relationship l1, k1) need to use null values (e.g., to represent the relationship
l2, k2 where there is no corresponding value for J).
J L Kj1
j2
j3
null
l1
l1
l1
l2
k1
k1
k1
k2
Normal forms - more details why ‘3’NF? what is 2NF? 1NF? 1NF: attributes are atomic (ie., no
set-valued attr., a.k.a. ‘repeating groups’)Ssn Name Dependents
123 Smith PeterMaryJohn
234 Jones AnnMichael
not 1NF
Normal forms - more details
2NF: 1NF and non-key attr. fully depend on the key
counter-example: TAKES1(ssn, c-id, grade, name, address)ssn -> name, address ssn, c-id -> grade
name
addressgradec-id
ssn
Normal forms - more details
3NF: 2NF and no transitive dependencies counter-example:
B
C
A
Din 2NF, but not in 3NF
Normal forms - more details 4NF, multivalued dependencies
etc: later… in practice, E-R diagrams usually
lead to tables in BCNF
top related