database normalization revisited: an information-theoretic approach
DESCRIPTION
Database Normalization Revisited: An information-theoretic approach. Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi. Sources. M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data , PODS’03; J.ACM, 2005. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/1.jpg)
Database Normalization Revisited:
An information-theoretic approach
Leonid Libkin
Joint work with Marcelo Arenas and Solmaz Kolahi
![Page 2: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/2.jpg)
Sources
M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data,
PODS’03; J.ACM, 2005. S. Kolahi, L. Redundancy vs dependency-
preservation in normalisation: an information-theoretic analysis of 3NF,
PODS’06.
![Page 3: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/3.jpg)
Outline
Part 1 - Database Normalization from the 1970s and 1980s
Part 2: Classical theory re-done: new justification for normal forms:• BCNF and relatives (academic, eliminate
redundancies)
• 3NF (practical, may leave some redundancies)
Part 3: An XML application
2
![Page 4: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/4.jpg)
If you haven’t taught “Intro to DB” lately…
Design: decide how to represent the information in a particular data model.
• Even for simple application domains there is a large number of ways of representing the data of interest.
We have to design the schema of the database.
• Set of relations.
• Set of attributes for each relation.
• Set of data dependencies.
3
![Page 5: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/5.jpg)
Normalization Theory Today
Normalization theory for relational databases was developed in the 70s and 80s.
Why do we need normalization theory today?• New data models have emerged: XML.
• XML documents can contain redundant information.
Redundant information in XML documents:• Can be discovered if the user provides semantic
information.
• Can be eliminated.
15
![Page 6: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/6.jpg)
Designing a Database: An Example
Attributes: number, title, section, room.
Data dependency: every course number is associated with only one title.
Relational Schema:
R(number, title, section, room),
number title
GOOD alternative:
S(number, title), number title
T(number, section, room), Ø
4
BAD alternative:
![Page 7: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/7.jpg)
Problems with BAD:Redundancies and Update Anomalies
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
5
![Page 8: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/8.jpg)
Deletion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 Database Systems 1 GB248CSC434 is not given in this term.
6
![Page 9: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/9.jpg)
Deletion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 Database Systems 1 GB248CSC434 is not given in this term.
6
![Page 10: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/10.jpg)
Deletion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 is not given in this term.
Additional effect: all the information about CSC434 was deleted.
6
![Page 11: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/11.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248
8
![Page 12: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/12.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248
The instance does not store redundant information.
8
![Page 13: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/13.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248CSC434 is not given in this term.
8
![Page 14: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/14.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434 is not given in this term.
The title of CSC434 is not removed from the instance.
8
![Page 15: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/15.jpg)
Normalization Theory
Main idea: a normal form defines a condition that a well designed database should satisfy.
Normal form: syntactic condition on the database schema.• Defined for a class of data dependencies.
Main problems:
• How to test whether a database schema is in a particular normal form.
• How to transform a database schema into an equivalent one satisfying a particular normal form.
10
![Page 16: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/16.jpg)
BCNF: a Normal Form for FDs
Functional dependency (FD) over R(A1, …, An) : X Y , X, Y {A1, …, An}.
X Y : two rows with the same X-values must have the same Y-values.• Number Title in our example
Key dependency : X A1 …. An
• X is a key: two distinct rows must have distinct X-values.
11
![Page 17: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/17.jpg)
BCNF: a Normal Form for FDs
is a set of FD over R(A1, …, An).
Relation schema R(A1, …, An), is in BCNF if for every nontrivial X Y in , X is a key.
A relational schema is in BCNF if every relation schema is in BCNF.
In BCNF: S(number, title), number title
T(number, section, room),
Not in BCNF:
R(number, title, section, room),
number title
In BCNF: S(number, title), number title
T(number, section, room),
In BCNF: S(number, title), number title
T(number, section, room), Ø
12
![Page 18: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/18.jpg)
BCNF Decomposition
Relation schema: R(X,Y,Z), • Not in BCNF: implies X Y and but not X A, for
every A Z.
Basic decomposition: replace R(X,Y,Z) by S(X,Y) and T(X,Z).
Example:
13
R(number, title, section, room),
number title
S(number, title), number title
T(number, section, room), Ø
![Page 19: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/19.jpg)
Lossless Decomposition
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC434 Database Systems 1 GB248
number
title
CSC258
Computer Organization
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC434
1 GB248
∏number, title (R) ∏number, section, room (R)
14
![Page 20: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/20.jpg)
Lossless Decomposition
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC434 Database Systems 1 GB248
number
title
CSC258
Computer Organization
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC434
1 GB248
S JoinT
14
![Page 21: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/21.jpg)
How to justify good designs?
What is a good database design?
• Well-known solutions: BCNF, 4NF, 3NF…
But what is it that makes a database design good?
• Elimination of update anomalies.
• Existence of algorithms that produce good designs: lossless decomposition, dependency preservation.
34
![Page 22: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/22.jpg)
Problems with traditional approaches
Many papers tried to justify normal forms. Problem: tied very closely to the relational
model. Relied on well-defined notions of
queries/updates.
These days we want to deal with other data models, in particular XML.
We need an approach that extends to other models, in particular, XML.
![Page 23: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/23.jpg)
Justification of Normal Forms
Problematic to evaluate XML normal forms.
• No XML update language has been standardized.
• No XML query language yet has the same “yardstick” status as relational algebra.
• We do not even know if implication of XML FDs is decidable!
We need a different approach.
• It must be based on some intrinsic characteristics of the data.
• It must be applicable to new data models.
• It must be independent of query/update/constraint issues.
Our approach is based on information theory.
35
![Page 24: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/24.jpg)
Information Theory
Entropy measures the amount of information provided by a certain event.
Assume that an event can have n different outcomes with probabilities p1, …, pn.
Amount of information gained by knowing that event i occurred :Average amount of information gained (entropy) :
Entropy is maximal if each pi = 1/n :
36
ip
1log
∑=
n
i ii p
p1
1log
nlog
![Page 25: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/25.jpg)
Entropy and Redundancies
Database schema: R(A,B,C), A B
Instance I:
Pick a domain properly containing adom(I) :
• Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4
• Entropy: log 5 ≈ 2.322
A B C
1 2 3
1 2 4
A B C
1 2 3
1 2 4
A B C
1 2
1 2 4
A B C
1 2 3
1 2 4
A B C
1 3
1 2 4
Pick a domain properly containing adom(I) : {1, …, 6}
• Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2
• Entropy: log 1 = 0
{1, …, 6}
37
![Page 26: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/26.jpg)
Entropy and Normal Forms
Let be a set of FDs over a schema S.
Theorem (S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0).
A similar result holds for 4NF and MVDs.
This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ...
38
![Page 27: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/27.jpg)
Problems with the Measure
The measure cannot distinguish between different types of data dependencies.
It cannot distinguish between different instances of the same schema:
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
entropy = 0
R(A,B,C), A B
entropy = 0
39
![Page 28: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/28.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 2 3
1 2 4
40
![Page 29: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/29.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.
A B C
1 2 3
1 2 4
40
![Page 30: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/30.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.
A B C
1 2 3
1 2 4
40
![Page 31: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/31.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.
A B C
1 3
1 2 4
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
40
![Page 32: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/32.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 3
1 2 4
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
40
![Page 33: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/33.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
40
![Page 34: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/34.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
![Page 35: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/35.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
2 3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
![Page 36: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/36.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 2 3
1 2 1
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
![Page 37: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/37.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
4 2 3
1 2 7
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
![Page 38: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/38.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 2 3
1 2 3
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
40
![Page 39: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/39.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
![Page 40: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/40.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
a 3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
![Page 41: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/41.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
2 a 3
1 2 7
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
![Page 42: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/42.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 a 3
1 2 6
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) = 42
(48 + 6 * 42) = 0.16
/ (48 + 6 * 42) = 0.14
Entropy ≈ 2.8057 (log 7 ≈ 2.8073)
40
![Page 43: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/43.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 3
1 2 4
Value : we consider the average over all sets X Pos(I) – {p}.
•Average: 2.4558 < log 7 (maximal entropy)
•It corresponds to conditional entropy.
•It depends on the value of k ...40
![Page 44: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/44.jpg)
A General Measure: Relative Information Content (RIC)
Previous value: RICIk(Σ|p)
For each k, we consider the ratio: RICIk(Σ|p) / log k
• How close the given position p is to having the maximum possible information content.
General measure (Arenas, L. 2003):
RICI(Σ|p) = limk ∞ RICIk(Σ|p) / log k
41
![Page 45: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/45.jpg)
Basic Properties
The measure is well defined:
For every set of first order constraints Σ, every instance I of Σ, and every position p in I, RICI(Σ|p) exists.
Bounds: 0 ≤ RICI(Σ|p) ≤ 1
Closer to 1 = Less redundancy
42
![Page 46: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/46.jpg)
Basic Properties
The measure does not depend on a particular representation of constraints.
It overcomes the limitations of the simple measure: R(A,B,C), A B
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
0.875 0.781
43
![Page 47: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/47.jpg)
Well-Designed Databases
Definition A database specification (S,) is well-designed if for every I inst(S,) and every p Pos(I), RICI(Σ|p) = 1.
In other words, every position in every instance carries the maximum possible amount of information.
44
![Page 48: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/48.jpg)
Relational Databases (Arenas, L.’03)
is a set of data dependencies over a schema S:
= Ø: (S,) is well-designed.
is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF.
is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF.
is a set of FDs and JDs:
• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true.
• A syntactic characterization of being well-designed is given in [AL03].
45
![Page 49: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/49.jpg)
Decidability Issues
If Σ is a set of First-Order integrity constraints, then the problem of verifying whether a relational schema is well-designed is undecidable.
If Σ contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable.
High complexity (coNEXPTIME) by reduction to the (complement) of Bernays-Schönfinkel satisfiability.
46
![Page 50: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/50.jpg)
3NF
BCNF is the most popular textbook normal form.
In practice 3NF is much more common.
From Oracle's “General Database Design FAQ”:
after defining 1NF, 2NF, and 3NF, it says: that there are other normal forms but “their
definitions are of academic concern only, and are rarely required for practical purposes”
![Page 51: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/51.jpg)
Reminder: 3NF
A candidate key: a minimal (wrt subset) key A prime attribute: an attribute that belongs to
a candidate key.
BCNF: For a nontrivial FD X A, where A is an attribute, X must be a key.
3NF (Bernstein/Zaniolo): For a nontrivial FD X A, X must be a key OR A must be prime.
![Page 52: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/52.jpg)
Why 3NF?
Because some relational schemas do not have decompositions that are both:
• In BCNF, and
• Preserve all functional dependencies
Example: ABC, AB C, C A On the other hand, there always exists a
lossless dependency preserving 3NF decomposition.
![Page 53: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/53.jpg)
Redundancies vs Dependency-Preservation
To achieve complete elimination of redundancies (BCNF), one has to pay in terms of dependency preservation.
Losing constraints is often undesirable (database integrity must be enforced).
What is we only consider normal form that guarantee lossless dependency-preserving decomposition?
Which is best? Is it 3NF?
![Page 54: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/54.jpg)
3NF: how low can one go?
Is there a lower bound for RICI(Σ|p) if Σ is in 3NF?
PROPOSITION (Kolahi DBPL 2005) For every ε > 0, there exists a 3NF schema Σ, an
instance I and a position p so that RICI(Σ|p) < ε.
BUT: I has many attributes (increasing with 1/ε) Σ can be further decomposed into better 3NF
designs using the standard synthesis algorithm.
![Page 55: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/55.jpg)
How good is 3NF?
Let NF be a dependency-preserving normal form (guaranteeing lossless dependency-preserving decompositions) based on functional dependencies.
The guaranteed information content of NF is inf { c in [0,1] | for all schemas Σ, exists an NF-decomposition Σ1,…, Σm such that RI(Σi|p) ≥ c
for positions p in all instances I of Σi}
PRICE(NF) = 1 – Guaranteed Information Content(NF)
![Page 56: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/56.jpg)
Why 3NF? -- Answer
PRICE(NF): the smallest amount of information content loss one needs to tolerate to achieve dependency-preservation.
PRICE(NF) > 0 (BCNF isn’t dependency-preserving)
PRICE(NF) is lower ==> NF is better.
THEOREM (Kolahi, L.)
• PRICE(3NF) = ½.
• PRICE(NF) ≥ ½ for every other dependency- preserving NF.
![Page 57: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/57.jpg)
Why is PRICE(3NF)=1/2?
We said earlier that RICI(Σ|p) could be below any given ε > 0.
But those schemas can are “bad” 3NFs that can be further decomposed into “good” 3NFs, and for “good” 3NFs we guarantee PRICE=1/2.
“Good” 3NF = 3NF schemas produced by the standard synthesis algorithm.
So the result justifies not only 3NF but also the algorithm that is most commonly used to produce 3NF designs.
![Page 58: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/58.jpg)
Comparing normal forms
We can use the information-theoretic measure to compare normal forms.
Define, for a condition P, the set of possible values of the information-theoretic measure:
POSSP(m) = { RICI(Σ|p) | I has m attributes,
Σ satisfies P } Define the GAIN function: inf POSSP1(m)
GAINP1/P2 (m) = ------------------
inf POSSP2(m)
![Page 59: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/59.jpg)
Comparison of normal forms
THEOREM (Kolahi, L.) For all m > 2:
•GAIN3NF/All (m) = 2
•GAIN”good” 3NF/All = 2m-2
•GAIN”good” 3NF/3NF = 2m-3
![Page 60: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/60.jpg)
The measure extends beyond relations
It can be used to reason about designs in other datamodels:• Nested relational
• XML
In particular it can be used to justify a normal form proposed recently for XML:• Called XNF (Arenas, L., 2002)
• Generalizes BCNF to XML documents
![Page 61: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/61.jpg)
XML Databases
XML schema: (D,).
• D is a DTD.
• is a set of data dependencies over D.
We would like to evaluate XML normal forms.
The notion of being well-designed extends from relations to XML.
• The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T).
47
![Page 62: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/62.jpg)
Positions in an XML Tree
DBLP
conf conf
title issueissue
article articlearticle
@yeartitle title @year
“ICDT”
author @yeartitleauthor“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”
“ICDT”
“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”
48
![Page 63: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/63.jpg)
XML normalization
DBLP
conf conf
title issueissue
article articlearticle
@yeartitle title @year
@year
“ICDT”
@year
author @yeartitleauthor“1999”
“1999”
“1999”“Dong” “2001”“Jarke”
“2001”
“. . .” “. . .” “. . .”
20
![Page 64: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/64.jpg)
XNF: an XML normal form
XNF is achieved by repeated transformations of two kinds:
•As above in the DBLP example, and
•Splitting multiple attributes of the same element type in the same manner as in the case of BCNF normalization for relations.
There is also a formal definition which is a natural analog of BCNF in the XML context.
![Page 65: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/65.jpg)
Well-Designed XML Data
We consider k such that adom(T) {1, …,k}.
For each k : RICTk(Σ|p)
We consider the ratio: RICTk(Σ|p) / log k
General measure: RICT(Σ|p) = limk ∞ RICT
k(Σ|p) / log k
49
![Page 66: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/66.jpg)
XNF: XML Normal Form
For arbitrary XML data dependencies:
Definition An XML specification (D,) is well-designed if for every T inst(D,) and every p Pos(T), RICT(Σ|p) = 1.
For functional dependencies:
Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed.
50
![Page 67: Database Normalization Revisited: An information-theoretic approach](https://reader035.vdocuments.us/reader035/viewer/2022070401/56813747550346895d9edbd1/html5/thumbnails/67.jpg)
Future Work
What is an analog of 3NF for XML? We would like to develop better
characterizations of normalization algorithms using our measure.
• Why is the “usual” BCNF decomposition algorithm good?
• Why does it always stop?
What else can this measure be used for? What about nonuniform distributions?
• Are they meaningful here?
• If so, how do the results change?
52