compsci516 data intensive computing systems lecture 6a ... · compsci516 data intensive computing...

28
CompSci 516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization – 2/2 Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems

Upload: others

Post on 27-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

CompSci 516DataIntensiveComputingSystems

Lecture6aDesignTheoryandNormalization– 2/2

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 2: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Announcements• HW1deadline:– Dueon09/21(Thurs),11:55pm,nolatedays

• Projectproposaldeadline:– Preliminaryideaandteammembersdueby09/18(Mon)byemailtotheinstructor

– Proposaldueonsakai by09/25(Mon),11:55pm

• Everyoneshouldbeinagroupnow– otherwiselettheinstructorknowasap

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Page 3: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Today

• FinishNormalizationfromLecture5• StartDatabaseInternals

• Recap– Whyredundancyisbad– Functionaldependencies– Closureofattributesandfunctionaldependencies

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.• SomeslideshavebeenadaptedfromslidesbyProf.JunYang

Page 4: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

NormalForms

Risin4NF⇒ RisinBCNF⇒ Risin3NF⇒ Risin2NF(ahistoricalone)⇒ Risin1NF(everyfieldhasatomicvalues)

DukeCS,Fall2017 CompSci516:DatabaseSystems 4

BCNF

3NF

2NF

1NF

OnlyBCNFand4NFarecoveredintheclass

4NF

Page 5: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Boyce-CoddNormalForm(BCNF)

• RelationRwithFDsF isinBCNF if,forallX→AinF– AϵX(calledatrivial FD),or– XcontainsakeyforR

• i.e.Xisasuperkey

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Nextlecture:BCNFdecompositionalgorithm

Page 6: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Decomposition

• Eliminatesredundancy• Togetbacktotheoriginalrelation:

6

uid uname twitterid gid fromDate

142 Bart @BartJSimpson dps 1987-04-19

123 Milhouse @MilhouseVan_ gov 1989-12-17

857 Lisa @lisasimpson abc 1987-04-19

857 Lisa @lisasimpson gov 1988-09-01

456 Ralph @ralphwiggum abc 1991-04-25

456 Ralph @ralphwiggum gov 1992-09-01

… … … … …

uid uname twitterid

142 Bart @BartJSimpson

123 Milhouse @MilhouseVan_

857 Lisa @lisasimpson

456 Ralph @ralphwiggum

… … …

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1987-04-19

857 gov 1988-09-01

456 abc 1991-04-25

456 gov 1992-09-01

… … …

(ontwitter)

• Userid• username• Twitterid• Groupid• JoiningDate

(toagroup)

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 7: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

uid twitterid

142 @BartJSimpson

123 @MilhouseVan_

857 @lisasimpson

456 @ralphwiggum

… …

uid uname

142 Bart

123 Milhouse

857 Lisa

456 Ralph

… …

Unnecessarydecomposition

• Fine:joinreturnstheoriginalrelation• Unnecessary:noredundancyisremoved;schemaismore

complicated(anduid isstoredtwice!)7

uid uname twitterid

142 Bart @BartJSimpson

123 Milhouse @MilhouseVan_

857 Lisa @lisasimpson

456 Ralph @ralphwiggum

… … …

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 8: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

uid fromDate

142 1987-04-19

123 1989-12-17

857 1987-04-19

857 1988-09-01

456 1991-04-25

456 1992-09-01

… …

Baddecomposition

• Associationbetweengid andfromDate islost• Joinreturnsmorerowsthantheoriginalrelation

8

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1987-04-19

857 gov 1988-09-01

456 abc 1991-04-25

456 gov 1992-09-01

… … …uid gid

142 dps

123 gov

857 abc

857 gov

456 abc

456 gov

… …

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 9: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Losslessjoindecomposition

• Decomposerelation𝑅 intorelations𝑆 and𝑇– 𝑎𝑡𝑡𝑟𝑠 𝑅 = 𝑎𝑡𝑡𝑟𝑠 𝑆 ∪ 𝑎𝑡𝑡𝑟𝑠 𝑇– 𝑆 = 𝜋,--./ 0 𝑅– 𝑇 = 𝜋,--./ 1 𝑅

• Thedecompositionisalosslessjoindecompositionif,givenknownconstraintssuchasFD’s,wecanguaranteethat𝑅 =𝑆 ⋈ 𝑇

• 𝑅 ⊆ 𝑆 ⋈ 𝑇 or𝑅 ⊇ 𝑆 ⋈ 𝑇 ?

• Anydecompositiongives𝑅 ⊆ 𝑆 ⋈ 𝑇 (why?)– Alossy decompositionisonewith𝑅 ⊂ 𝑆 ⋈ 𝑇

9DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 10: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1987-04-19

857 gov 1988-09-01

456 abc 1991-04-25

456 gov 1992-09-01

… … …

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1988-09-01

857 gov 1987-04-19

456 abc 1991-04-25

456 gov 1992-09-01

… … …

Loss?ButIgotmorerows!

• “Loss”refersnottothelossoftuples,buttothelossofinformation– Or,theabilitytodistinguishdifferentoriginalrelations

10

Nowaytotellwhichistheoriginalrelation

uid fromDate

142 1987-04-19

123 1989-12-17

857 1987-04-19

857 1988-09-01

456 1991-04-25

456 1992-09-01

… …

uid gid

142 dps

123 gov

857 abc

857 gov

456 abc

456 gov

… …DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 11: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

BCNFdecompositionalgorithm

• FindaBCNFviolation– Thatis,anon-trivialFD𝑋 → 𝑌 in𝑅 where𝑋 isnotasuperkeyof𝑅

• Decompose𝑅 into𝑅8 and𝑅9,where– 𝑅8 hasattributes𝑋 ∪ 𝑌– 𝑅9 hasattributes𝑋 ∪ 𝑍,where𝑍 containsallattributesof𝑅 thatareinneither𝑋 nor𝑌

• RepeatuntilallrelationsareinBCNF

• Alsogivesalosslessdecomposition!

11DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 12: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

BCNFdecompositionexample- 1

• CSJDPQV,keyC,F={JP→ C,SD→ P,J→ S}– TodealwithSD→P,decomposeintoSDP,CSJDQV.– TodealwithJ→ S,decomposeCSJDQVintoJSandCJDQV

• IsJP→ CaviolationofBCNF?

• Note:– severaldependenciesmaycauseviolationofBCNF– Theorderinwhichwepickthemmayleadtoverydifferentsetsof

relations– theremaybemultiplecorrectdecompositions(canpickJ→ Sfirst)

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Page 13: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

BCNFdecompositionexample- 2

13

UserJoinsGroup (uid,uname,twitterid,gid,fromDate)

uid→ uname,twitteridtwitterid→ uiduid,gid→ fromDate

BCNFviolation:uid→ uname,twitterid

User (uid,uname,twitterid) Member(uid,gid,fromDate)

BCNFBCNF

uid→ uname,twitteridtwitterid→ uid

uid,gid→ fromDate

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 14: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

14

UserJoinsGroup (uid,uname,twitterid,gid,fromDate)

uid→ uname,twitteridtwitterid→ uiduid,gid→ fromDate

BCNFviolation:twitterid→ uid

UserId (twitterid,uid)

Member(twitterid,gid,fromDate)

BCNF

BCNF

twitterid→ unametwitterid,gid→ fromDate

UserJoinsGroup’ (twitterid,uname,gid,fromDate)

BCNFviolation:twitterid→ uname

UserName (twitterid,uname)BCNF

applyArmstrong’saxiomsandrules!

DukeCS,Fall2017 CompSci516:DatabaseSystems

BCNFdecompositionexample- 3

Page 15: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Recap

• Functionaldependencies:ageneralizationofthekeyconcept

• Non-keyfunctionaldependencies:asourceofredundancy

• BCNFdecomposition:amethodforremovingredundancies– BCNFdecompositionisalosslessjoindecomposition

• BCNF:schemainthisnormalformhasnoredundancyduetoFD’s

15DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 16: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

BCNF=noredundancy?

• User (uid,gid,place)– Ausercanbelongtomultiplegroups– Ausercanregisterplacesshe’svisited– Groupsandplaceshavenothingtodowithother– FD’s?

• None– BCNF?

• Yes– Redundancies?

• Tons!

16

uid gid place

142 dps Springfield

142 dps Australia

456 abc Springfield

456 abc Morocco

456 gov Springfield

456 gov Morocco

… … …

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 17: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Multivalueddependencies

• Amultivalueddependency(MVD)hastheform𝑋 ↠ 𝑌,where𝑋 and𝑌 aresetsofattributesinarelation𝑅

• 𝑋 ↠ 𝑌 meansthatwhenevertworowsin𝑅 agreeonalltheattributesof𝑋,thenwecanswaptheir𝑌 componentsandgettworowsthatarealsoin𝑅

17

𝑿 𝒀 𝒁𝑎 𝑏8 𝑐8𝑎 𝑏9 𝑐9… … …

𝑿 𝒀 𝒁𝑎 𝑏8 𝑐8𝑎 𝑏9 𝑐9𝑎 𝑏9 𝑐8𝑎 𝑏8 𝑐9… … …

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 18: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

MVDexamples

User(uid,gid,place)• uid↠ gid• uid↠ place– Intuition:givenuid,attributesgid andplaceare“independent”

• uid,gid↠ place– Trivial:LHS∪ RHS=allattributesof𝑅

• uid,gid↠ uid– Trivial:LHS⊇ RHS

18DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 19: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

CompleteMVD+FDrules

• FDreflexivity,augmentation,andtransitivity• MVDcomplementation:If𝑋 ↠ 𝑌,then𝑋 ↠ 𝑎𝑡𝑡𝑟𝑠 𝑅 − 𝑋 − 𝑌

• MVDaugmentation:If𝑋 ↠ 𝑌 and𝑉 ⊆ 𝑊,then𝑋𝑊 ↠ 𝑌𝑉

• MVDtransitivity:If𝑋 ↠ 𝑌 and𝑌 ↠ 𝑍,then𝑋 ↠ 𝑍 − 𝑌

• Replication(FDisMVD):If𝑋 → 𝑌,then𝑋 ↠ 𝑌

• Coalescence:If𝑋 ↠ 𝑌 and𝑍 ⊆ 𝑌 andthereissome𝑊 disjointfrom𝑌 suchthat𝑊 → 𝑍,then𝑋 → 𝑍

19

Tryprovingthingsusingthese!?

Verifytheseyourself!

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 20: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Anelegantsolution:“chase”

• GivenasetofFD’sandMVD’s𝒟,doesanotherdependency𝑑 (FDorMVD)followfrom𝒟?

• Procedure– Startwiththepremiseof𝑑,andtreatthemas“seed”tuplesinarelation

– Applythegivendependenciesin𝒟 repeatedly• IfweapplyanFD,weinferequalityoftwosymbols• IfweapplyanMVD,weinfermoretuples

– Ifweinfertheconclusionof𝑑,wehaveaproof– Otherwise,ifnothingmorecanbeinferred,wehaveacounterexample

20

Readthisslideafterlookingattheexamples

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 21: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Proofbychase• In𝑅 𝐴, 𝐵, 𝐶, 𝐷 ,does𝐴 ↠ 𝐵 and𝐵 ↠ 𝐶implythat𝐴 ↠ 𝐶?

21

𝑨 𝑩 𝑪 𝑫𝑎 𝑏8 𝑐8 𝑑8𝑎 𝑏9 𝑐9 𝑑9

𝑨 𝑩 𝑪 𝑫𝑎 𝑏8 𝑐9 𝑑8𝑎 𝑏9 𝑐8 𝑑9

Have: Need:

𝑎 𝑏9 𝑐8 𝑑8𝑎 𝑏8 𝑐9 𝑑9

𝐴 ↠ 𝐵

𝑎 𝑏9 𝑐8 𝑑9𝑎 𝑏9 𝑐9 𝑑8

𝐵 ↠ 𝐶

𝑎 𝑏8 𝑐9 𝑑8𝑎 𝑏8 𝑐8 𝑑9

𝐵 ↠ 𝐶

AA

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 22: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Anotherproofbychase• In𝑅 𝐴, 𝐵, 𝐶, 𝐷 ,does𝐴 → 𝐵 and𝐵 → 𝐶 implythat𝐴 → 𝐶?

22

𝑨 𝑩 𝑪 𝑫𝑎 𝑏8 𝑐8 𝑑8𝑎 𝑏9 𝑐9 𝑑9

Have: Need:𝑐8 = 𝑐9

𝐴 → 𝐵 𝑏8 = 𝑏9𝐵 → 𝐶 𝑐8 = 𝑐9

A

Ingeneral,withbothMVD’sandFD’s,chasecangeneratebothnewtuplesandnewequalities

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 23: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Counterexamplebychase• In𝑅 𝐴, 𝐵, 𝐶, 𝐷 ,does𝐴 ↠ 𝐵𝐶 and𝐶𝐷 → 𝐵implythat𝐴 → 𝐵?

23

𝑨 𝑩 𝑪 𝑫𝑎 𝑏8 𝑐8 𝑑8𝑎 𝑏9 𝑐9 𝑑9

Have: Need:𝑏8 = 𝑏9

𝑎 𝑏9 𝑐9 𝑑8𝑎 𝑏8 𝑐8 𝑑9

𝐴 ↠ 𝐵𝐶

D

Counterexample!

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 24: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

4NF

• Arelation𝑅 isinFourthNormalForm(4NF)if– Foreverynon-trivialMVD𝑋 ↠ 𝑌 in𝑅,𝑋 isasuperkey

– Thatis,allFD’sandMVD’sfollowfrom“key→otherattributes”(i.e.,noMVD’sandnoFD’sbesideskeyfunctionaldependencies)

• 4NFisstrongerthanBCNF– BecauseeveryFDisalsoaMVD

24DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 25: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

4NFdecompositionalgorithm

• Finda4NFviolation– Anon-trivialMVD𝑋 ↠ 𝑌 in𝑅 where𝑋 isnot asuperkey

• Decompose𝑅 into𝑅8 and𝑅9,where– 𝑅8 hasattributes𝑋 ∪ 𝑌– 𝑅9 hasattributes𝑋 ∪ 𝑍 (where𝑍 contains𝑅 attributesnotin𝑋 or𝑌)

• Repeatuntilallrelationsarein4NF

• AlmostidenticaltoBCNFdecompositionalgorithm• Anydecompositionona4NFviolationislossless

25DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 26: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

4NFdecompositionexample

26

uid gid place

142 dps Springfield

142 dps Australia

456 abc Springfield

456 abc Morocco

456 gov Springfield

456 gov Morocco

… … …

User (uid,gid,place)4NFviolation:uid↠gid

Member(uid,gid) Visited(uid,place)4NF 4NFuid gid

142 dps

456 abc

456 gov

… …

uid place

142 Springfield

142 Australia

456 Springfield

456 Morocco

… …

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 27: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Otherkindsofdependenciesandnormalforms

• Dependencypreservingdecompositions• Joindependencies• Inclusiondependencies• 5NF,3NF,2NF• Seebookifinterested(notcoveredinclass)

DukeCS,Fall2017 CompSci516:DatabaseSystems 27

Page 28: CompSci516 Data Intensive Computing Systems Lecture 6a ... · CompSci516 Data Intensive Computing Systems Lecture 6a Design Theory and Normalization –2/2 ... –Due on 09/21 (Thurs),

Summary

• PhilosophybehindBCNF,4NF:Datashoulddependonthekey,thewholekey,andnothingbutthekey!– Youcouldhavemultiplekeysthough

• Redundancyisnotdesiredtypically– notalways,mainlyduetoperformancereasons

• Functional/multivalueddependencies– captureredundancy• Decompositions– eliminatedependencies• Normalforms

– Guaranteescertainnon-redundancy– BCNF,and4NF

• Losslessjoin• HowtodecomposeintoBCNF,4NF• Chase

28DukeCS,Fall2017 CompSci516:DatabaseSystems