dept. of cis, temple univ. cis661 – principles of data management

Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management

V. MegalooikonomouDatabase design and

normalization(based on slides by C. Faloutsos at CMU)

Overview Relational model

formal query languages commercial query languages (SQL)

Integrity constraints domain I.C., foreign keys functional dependencies

DB design and normalization

Overview - detailed DB design and normalization

pitfalls of bad design decomposition normal forms

Design ‘good’ tables sub-goal#1: define what ‘good’ means sub-goal#2: fix ‘bad’ tables

in short: “we want tables where the attributes

depend on the primary key, on the whole key, and nothing but the key”

Let’s see why, and how:

Pitfalls

takes1 (ssn, c-id, grade, name, address)

Ssn c-id Grade Name Address

123 413 A smith Main

Pitfalls

‘Bad’ - why? because: ssn->address, name

123 415 B smith Main

Pitfalls Redundancy

space (inconsistencies) insertion/deletion anomalies:

Pitfalls insertion anomaly:

“jones” registers, but takes no class - no place to store his address!

… … … … …

234 null null jones Forbes

Pitfalls deletion anomaly:

delete the last record of ‘smith’ (we lose his address!)

123 415 B smith Main

Solution: decomposition split offending table in two (or

more), eg.:Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main123 211 A smith Main

pitfalls of bad design decomposition

lossless join decomp. dependency preserving

normal forms

Decompositions there are ‘bad’ decompositions: we

want lossless and dependency preserving

Decompositions - lossy:R1(ssn, grade, name, address) R2(c-

id,grade)

Ssn c-id Grade Name Address123 413 A smith Main123 415 B smith Main

234 211 A jones Forbes

ssn->name, address

ssn, c-id -> grade

Ssn Grade Name Address123 A smith Main123 B smith Main

234 A jones Forbes

c-id Grade413 A415 B

Decompositions - lossy:can not recover original table with a

ssn->name, address

ssn, c-id -> grade

Ssn Grade Name Address123 A smith Main123 B smith Main

234 A jones Forbes

c-id Grade413 A415 B

Decompositions - lossy: Another example

Decomposition of R = (A, B) into

R1 = (A) R2 = (B)A B

rA(r) B(r)

A (r) B (r)A B

Decompositions

example of non-dependency preserving

S# address status123 London E125 Paris E

234 Pitts. A

S# -> address, status

address -> status

S# status123 E125 E

S# address123 London125 Paris

234 Pitts.

S# -> address S# -> status

Decompositions

is it lossless?S# address status123 London E125 Paris E

234 Pitts. A

address -> status

S# status123 E125 E

234 Pitts.

Decompositions - lossless

Definition: Consider schema R, with FD ‘F’. R1, R2 is a lossless join

decomposition of R if we always have:

An easier criterion?

rrr 21

Decomposition - lossless

Theorem: lossless join decomposition if the

joining attribute is a superkey in at least one of the new tables

Formally:

Decomposition - losslessexample:

ssn->name, address

ssn, c-id -> grade

Ssn c-id Grade123 413 A123 415 B

234 211 A

Ssn Name Address123 smith Main234 jones Forbes

ssn->name, addressssn, c-id -> grade

pitfalls of bad design decomposition

lossless join decomp. dependency preserving

normal forms

Decomposition - depend. pres.

informally: we don’t want the original FDs to span two tables - counter-example:

234 Pitts. A

address -> status

S# status123 E125 E

234 Pitts.

dependency preserving decomposition:

234 Pitts. A

address -> status

234 Pitts.

S# -> address address -> status

address statusLondon EParis E

Pitts. A

(but: S#->status ?)

informally: we don’t want the original FDs to span two tables.

More specifically: … the FDs of the canonical cover.

Let Fi be the set of dependencies F+ that include only attributes in Ri.

Preferably the decomposition should be dependency preserving, that is, (F1 F2 … Fn)

+ = F+

Otherwise, checking updates for violation of functional dependencies may require computing joins, which is expensive.

why is dependency preservation good?

234 Pitts.

Pitts. A

S# status123 E125 E

234 Pitts.

(address->status: ‘lost’)

A: eg., record that ‘Philly’ has status ‘A’

234 Pitts.

Pitts. A

S# status123 E125 E

234 Pitts.

S# -> addressS# -> status

(address->status: ‘lost’)

To check if a dependency is preserved in a decomposition of R into R1, R2, …, Rn we apply the following test (with attribute closure done w.r.t. F) result =

while (changes to result) dofor each Ri in the decomposition

t = (result Ri)+ Ri

result = result t If result contains all attributes in , then the functional dependency

is preserved. We apply the test on all dependencies in F to check if a decomposition is dependency preserving This procedure takes polynomial time, instead of the exponential time to compute F+ and (F1 F2 … Fn)

Decomposition - conclusions decompositions should always be

lossless joining attribute -> superkey

whenever possible, we want them to be dependency preserving (occasionally, impossible - see ‘STJ’ example later…)

Normalization using FD When we decompose a relation schema R with a set of

functional dependencies F into R1, R2,.., Rn we want: Lossless-join decomposition: Otherwise decomposition

would result in information loss. No redundancy: The relations Ri preferably should be in

either Boyce-Codd Normal Form or Third Normal Form. Dependency preservation: Let Fi be the set of

dependencies F+ that include only attributes in Ri. Preferably the decomposition should be dependency

preserving, that is, (F1 F2 … Fn)+ = F+

Otherwise, checking updates for violation of functional dependencies may require computing joins, which is expensive.

Normalization using FD - Example

R = (A, B, C)F = {A B, B C)

R1 = (A, B), R2 = (B, C) Lossless-join decomposition:

R1 R2 = {B} and B BC Dependency preserving

R1 = (A, B), R2 = (A, C) Lossless-join decomposition:

R1 R2 = {A} and A AB Not dependency preserving

(cannot check B C without computing R1 R2)

pitfalls of bad design decomposition (-> how to fix the

problem) normal forms (-> how to detect the

problem) BCNF, 3NF (1NF, 2NF)

Normal forms - BCNFWe saw how to fix ‘bad’ schemas -but what is a ‘good’ schema?

Answer: ‘good’, if it obeys a ‘normal form’,ie., a set of rules.

Typically: Boyce-Codd Normal Form (BCNF)

Normal forms - BCNF

Defn.: Rel. R is in BCNF wrt F, if informally: everything depends on

the full key, and nothing but the key

semi-formally: every determinant (of the cover) is a candidate key

Normal forms - BCNF

Example and counter-example:

Ssn Name Address123 smith Main123 smith Main

234 jones Forbes

ssn->name, address

ssn, c-id -> grade

Normal forms - BCNF

Formally: for every FD a->b in F+ a->b is trivial (a superset of b) or a is a superkey (or both)

Normal forms - BCNF

Theorem: given a schema R and a set of FD ‘F’, we can always decompose it to schemas R1, … Rn, so that R1, … Rn are in BCNF and the decompositions are lossless.

(but, some decomp. might lose dependencies)

BCNF Decompositionresult := {R};

done := false;compute F+;while (not done) do

if (there is a schema Ri in result that is not in BCNF)then begin

let be a nontrivial functionaldependency that holds on Ri

such that Ri is not in F+, and = ;

result := (result – Ri) (Ri – ) (, );

endelse done := true;

Note: each Ri is in BCNF, and decomposition is lossless-join.

BCNF Decomposition

How? ….essentially, break off FDs of the cover

eg. TAKES1(ssn, c-id, grade, name, address)ssn -> name, addressssn, c-id -> grade

Normal forms - BCNFeg. TAKES1(ssn, c-id, grade, name, address)

ssn -> name, address ssn, c-id -> grade

addressgrade

Normal forms - BCNF

ssn->name, address

ssn, c-id -> grade

Ssn c-id Grade123 413 A123 415 B

234 211 A

Ssn Name Address123 smith Main123 smith Main

234 jones Forbes

ssn->name, addressssn, c-id -> grade

Normal forms - BCNF

pictorially: we want a ‘star’ shape

addressgrade

:not in BCNF

Normal forms - BCNF

pictorially: we want a ‘star’ shape

Normal forms - BCNF

or a star-like: (eg., 2 cand. keys):STUDENT(ssn, st#, name, address)

address

Normal forms - BCNF

but not:

Normal forms - 3NF

consider the ‘classic’ case:STJ( Student, Teacher, subJect)

T-> JS,J -> T

is it BCNF?S

Normal forms - 3NF

STJ( Student, Teacher, subJect)T-> J S,J -> T

How to decompose it to BCNF?

Normal forms - 3NF

1) R1(T,J) R2(S,J) (BCNF? - lossless? - dep. pres.?

2) R1(T,J) R2(S,T) (BCNF? - lossless? - dep. pres.?

Normal forms - 3NF

1) R1(T,J) R2(S,J) (BCNF? Y+Y - lossless? N - dep. pres.? N )

2) R1(T,J) R2(S,T) (BCNF? Y+Y - lossless? Y - dep. pres.? N )

Normal forms - 3NF

in this case: impossible to have both BCNF and dependency preservationWelcome 3NF (a weaker normal

form)!

Normal forms - 3NF

informally, 3NF ‘forgives’ the red arrow in the can. cover

Normal forms - 3NFSTJ( Student,

Teacher, subJect)T-> J S,J -> T

Formally, a rel. R with FDs ‘F’ is in 3NF if: for every a->b in F+:

•it is trivial or

•a is a superkey or

•each b-a attr.: part of a cand. key

Normal forms - 3NF R = (J, K, L)

F = {JK L, L K}Two candidate keys = JK and JL

R is not in BCNF Any decomposition of R will fail to preserve

JK L BCNF decomposition has (JL) and (LK)

Testing for JK L requires a join

R is in 3NFJK L JK is a superkeyL K K is contained in a candidate key

There is some redundancy in this schema…

Normal forms - 3NF TESTING FOR 3NF

Optimization: Need to check only FDs in F, need not check all FDs in F+.

Use attribute closure to check, for each dependency , if is a superkey.

If is not a superkey, we have to verify if each attribute in is contained in a candidate key of R

this test is rather more expensive, since it involve finding candidate keys

testing for 3NF has been shown to be NP-hard Interestingly, decomposition into third normal form

(described shortly) can be done in polynomial time

Decomposition into 3NF Let Fc be a canonical cover for F;

i := 0;for each functional dependency in Fc do

if none of the schemas Rj, 1 j i contains then begin

i := i + 1;Ri :=

endif none of the schemas Rj, 1 j i contains a candidate key for R

then begini := i + 1;Ri := any candidate key for R;

end return (R1, R2, ..., Ri)

The dependencies are preserved by building explicitly a schema for each given dependency

Guarantees a lossless-join decomposition by having at least one schema containing a candidate key for the schema being decomposed

Normal forms - 3NF

how to bring a schema to 3NF?In short ….for each FD in the cover, put it in a

Normal forms - 3NF vs BCNF If ‘R’ is in BCNF, it is always in 3NF (but

not the reverse) In practice, aim for

BCNF; lossless join; and dep. preservation if impossible, we accept

3NF; but insist on lossless join and dep. Preservation

3NF has problems with transitive dependecies

3NF vs BCNF (cont.) Example of problems due to redundancy in 3NF

R = (J, K, L)F = {JK L, L K}

A schema that is in 3NF but not in BCNF has the problems of repetition of information (e.g., the relationship l1, k1) need to use null values (e.g., to represent the relationship

l2, k2 where there is no corresponding value for J).

J L Kj1

Normal forms - more details why ‘3’NF? what is 2NF? 1NF? 1NF: attributes are atomic (ie., no

set-valued attr., a.k.a. ‘repeating groups’)Ssn Name Dependents

123 Smith PeterMaryJohn

234 Jones AnnMichael

not 1NF

Normal forms - more details

2NF: 1NF and non-key attr. fully depend on the key

counter-example: TAKES1(ssn, c-id, grade, name, address)ssn -> name, address ssn, c-id -> grade

addressgradec-id

Normal forms - more details

3NF: 2NF and no transitive dependencies counter-example:

Din 2NF, but not in 3NF

Normal forms - more details 4NF, multivalued dependencies

etc: later… in practice, E-R diagrams usually

lead to tables in BCNF

Overview - conclusions

DB design and normalization pitfalls of bad design decompositions (lossless, dep.

preserving) normal forms (BCNF or 3NF)

“everything should depend on the key, the whole key, and nothing but the key”

dept. of cis, temple univ. cis661 – principles of data management

address r2cid

bad decompositions

megalooikonomoudatabase

schema r

exampledecomposition

pitfallstakes1 ssn

bad tablesin short

statusaddress statuss

Documents

cis configuration assessment tool cis-cat · cis slackware...

fuel injection fundamentals - porsche club of america...

masahiro machida (kyoto univ.) shu-ichiro inutsuka (kyoto...

cis radio stream radio voice procedure for cis instructors

algorithm analysis, asymptotic notations cisc4080 cis...

virtual data integration helena galhardas dei ist (based on...

smart skins- embedded sensor...

health safety & environment management system...

cis 319 cis/319 cis319 new 2015 tutorials

c1s aug2017dc1 - belfuse.com · - small size, 1206 smd type...

university of cyprus - ucy.ac.cy · pdf filecen cen cis cis...

dynamic programming cisc5835, algorithms for big data cis...

determination of cis,cis- and cis,trans muconic acid from...

graph: representation and traversal - fordham university ·...

directing precursor flux to optimize cis,cis-muconic acid...

Çankırı karatekin university...ders kodu tdi-101 ata-101...

temple university – cis dept. cis661– principles of...

temple university – cis dept. cis661 – principles of...

cis 068 welcome to cis 068 ! stacks and recursion

how cis affected me ? by deasia mosley 8 th what is cis? cis...