an information-theoretic approach to normal forms for relational and xml data

40
An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto

Upload: lars-leonard

Post on 31-Dec-2015

28 views

Category:

Documents


1 download

DESCRIPTION

An Information-Theoretic Approach to Normal Forms for Relational and XML Data. Marcelo Arenas Leonid Libkin University of Toronto. Motivation. What is a good database design ? Well-known solutions: BCNF, 4NF, … But what is it that makes a database design good? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

An Information-Theoretic Approach to Normal Forms for

Relational and XML Data

Marcelo Arenas Leonid LibkinUniversity of Toronto

Page 2: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Motivation

What is a good database design?

• Well-known solutions: BCNF, 4NF, …

But what is it that makes a database design good?

• Elimination of update anomalies.

• Existence of algorithms that produce good designs: lossless decomposition, dependency preservation.

Previous work was specific for the relational model.

• Classical problems have to be revisited in the XML context.

2

Page 3: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Motivation

Problematic to evaluate XML normal forms.

• No XML update language has been standardized.

• No XML query language yet has the same “yardstick” status as relational algebra.

• We do not even know if implication of XML FDs is decidable!

We need a different approach.

• It must be based on some intrinsic characteristics of the data.

• It must be applicable to new data models.

• It must be independent of query/update/constraint issues.

Our approach is based on information theory.

3

Page 4: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Outline

Information theory.

A simple information-theoretic measure.

A general information-theoretic measure.

Definition of being well-designed.

Relational databases.

XML databases.

4

Page 5: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Information Theory

Entropy measures the amount of information provided by a certain event.

Assume that an event can have n different outcomes with probabilities p1, …, pn.

Amount of information gained by knowing that event i occurred :Average amount of information gained (entropy) :

Entropy is maximal if each pi = 1/n :

5

ip

1log

n

i ii p

p1

1log

nlog

Page 6: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Entropy and Redundancies

Database schema: R(A,B,C), A B

Instance I:

Pick a domain properly containing adom(I) :

• Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4

• Entropy: log 5 ≈ 2.322

A B C

1 2 3

1 2 4

A B C

1 2 3

1 2 4

A B C

1 2

1 2 4

A B C

1 2 3

1 2 4

A B C

1 3

1 2 4

Pick a domain properly containing adom(I) : {1, …, 6}

• Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2

• Entropy: log 1 = 0

{1, …, 6}

6

Page 7: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Entropy and Normal Forms

Let be a set of FDs over a schema S.

Theorem (S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0).

A similar result holds for 4NF and MVDs.

This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ...

7

Page 8: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Problems with the Measure

The measure cannot distinguish between different types of data dependencies.

It cannot distinguish between different instances of the same schema:

A B C

1 2 3

1 2 4

1 5

A B C

1 2 3

1 4

entropy = 0

R(A,B,C), A B

entropy = 0

8

Page 9: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 2 3

1 2 4

9

Page 10: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A B C

1 2 3

1 2 4

9

Page 11: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A B C

1 2 3

1 2 4

9

Page 12: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A B C

1 3

1 2 4

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

9

Page 13: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 3

1 2 4

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

9

Page 14: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

9

Page 15: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

9

Page 16: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

2 3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

9

Page 17: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 2 3

1 2 1

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

9

Page 18: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

4 2 3

1 2 7

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) =

9

Page 19: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 2 3

1 2 3

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

9

Page 20: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) =

9

Page 21: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

a 3

1 2

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) =

9

Page 22: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

2 a 3

1 2 7

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) =

9

Page 23: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 a 3

1 2 6

Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

P(2 | X) = 48/

For a ≠ 2, P(a | X) = 42/

(48 + 6 42) = 0.16

(48 + 6 42) = 0.14

Entropy ≈ 2.8057 (log 7 ≈ 2.8073)

9

Page 24: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Instance I of schema R(A,B,C), A B :

A B C

1 3

1 2 4

Value : we consider the average over all sets X Pos(I) – {p}.

•Average: 2.4558 < log 7 (maximal entropy)

•It corresponds to conditional entropy.

•It depends on the value of k ...9

Page 25: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A General Measure

Previous value:

For each k, we consider the ratio:

• How close the given position p is to having the maximum possible information content.

General measure:

)|( pInf kI

k

pInf kI

log

)|(

k

pInfpInf

kI

kI log

)|(lim)|(

10

Page 26: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Basic Properties

The measure is well defined:

For every set of first order constraints defined over a schema S, every I inst(S,), and every p Pos(I): exists.

Bounds:

)|( pInf I

1)|(0 pInf I

11

Page 27: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Basic Properties

The measure does not depend on a particular representation of constraints. If 1 and 2 are equivalent:

It overcomes the limitations of the simple measure: R(A,B,C), A B

)|()|( 21 pInfpInf II

A B C

1 2 3

1 2 4

1 5

A B C

1 2 3

1 4

0.875 0.781

12

Page 28: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Well-Designed Databases

Definition A database specification (S,) is well-designed if for every I inst(S,) and every p Pos(I), = 1.

In other words, every position in every instance carries the maximum possible amount of information.

We would like to test this definition in the relational world ...

)|( pInf I

13

Page 29: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Relational Databases

is a set of data dependencies over a schema S:

= : (S,) is well-designed.

is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF.

is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF.

is a set of FDs and JDs:

• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true.

• A syntactic characterization of being well-designed is given in the paper.

14

Page 30: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Relational Databases

The problem of verifying whether a relational schema is well-designed is undecidable.

If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable.

Now we would like to apply our definition in the XML world ...

15

Page 31: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

XML Databases

XML specification: (D,).

• D is a DTD.

• is a set of data dependencies over D.

We would like to evaluate XML normal forms.

The notion of being well-designed extends from relations to XML.

• The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T).

16

Page 32: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Positions in an XML Tree

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

“ICDT”

author @yeartitleauthor“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”

“ICDT”

“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”

17

Page 33: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Well-Designed XML Data

We consider k such that adom(T) {1, …,k}.

For each k :

We consider the ratio:

General measure:

)|( pInf kT

k

pInfpInf

kT

kT log

)|(lim)|(

kpInf kT log/)|(

18

Page 34: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

XNF: XML Normal Form

XNF was proposed in [AL02].

It was defined for XML FDs:

DBLP.conf.@title DBLP.confDBLP.conf.issue

DBLP.conf.issue.article.@year

It eliminates two types of anomalies.

• One of them is inspired by the type of anomalies found in relational databases containing FDs.

19

Page 35: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

XNF: XML Normal Form

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

@year

“ICDT”

@year

author @yeartitleauthor“1999”

“1999”

“1999”“Dong” “2001”“Jarke”

“2001”

“. . .” “. . .” “. . .”

20

Page 36: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

XNF: XML Normal Form

For arbitrary XML data dependencies:

Definition An XML specification (D,) is well-designed if for every T inst(D,) and every p Pos(T), = 1.

For functional dependencies:

Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed.

)|( pInfT

21

Page 37: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Normalization Algorithms

The information-theoretic measure can also be used for reasoning about normalization algorithms.

For BCNF and XNF decomposition algorithms:

Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease.

22

Page 38: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

Future Work

We would like to consider more complex XML constraints and characterize good designs they give rise to.

We would like to characterize 3NF by using the measure developed in this paper.

• In general, we would like to characterize “non-perfect” normal forms.

We would like to develop better characterizations of normalization algorithms using our measure.

• Why is the “usual” BCNF decomposition algorithm good? Why does it always stop?

23

Page 39: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A Normal Form for FDs and JDs

))()()()(( 21 xRxRxRxR m

iMi

xx

))()()(( 21 jim xxxRxRxR

Let be a set of FDs and JDs over a schema S:

Theorem (S,) is well-designed if and only if for every

R S and every nontrivial JD:

implied by , there exists M {1, ..., m} such that:

1.

2. For every i,j M, implies

Page 40: An Information-Theoretic Approach to Normal Forms for  Relational and XML Data

A Normal Form for FDs and JDs (cont’d)

Schema: S = { R(A,B,C) } and = { [AB, AC, BC],

AB C, AC B }.

(S, ) is not in PJ/NF: {AB ABC, AC ABC} does not imply [AB, AC, BC].

(S, ) is not in 5NFR: [AB, AC, BC] is strong-reduced and BC is not a superkey.

(S,) is well-designed.