lecture 17 joins - github pages · 2017-12-13 · how do i choose the right graduate program?...

109
Lecture 17: Joins Lecture 17

Upload: others

Post on 07-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17:Joins

Lecture17

Page 2: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

GraduateSchoolInformationPanel

ShouldIattendgraduateschoolinCS?

Thursday,Nov9@3:00PM1240CS

HowdoIprepareacompetitiveapplication?

JoinusforaliveQ&AwithCSfaculty,graduatestudents,anda

graduateschooladmissionscoordinator!

HowdoIchoosetherightgraduateprogram?

Page 3: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17:Joins

Lecture17

Page 4: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Today’sLecture

1. Recap:Select,Project

2. Joins

3. JoinsandBufferManagement

4

Lecture17

Page 5: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

1.Recap

5

Lecture17

Page 6: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17

LogicalPlan=How

SELECT P.buyerFROM Purchase P, Person QWHERE P.buyer=Q.nameAND Q.city=‘Madison’

SELECTSELECTcity=‘Madison’

JOINbuyer=name

PROJECTonbuyer

Purchase Person

Page 7: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17

PhysicalPlan=What

SELECT P.buyerFROM Purchase P, Person QWHERE P.buyer=Q.nameAND Q.city=‘Madison’

TableScan IndexScan

NestedLoopJoin

Hash-basedProject

Purchase Person

Page 8: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture16

SelectOperator

accesspath =waytoretrievetuplesfromatable• FileScan• scantheentirefile• I/Ocost:O(N),whereN=#pages

• IndexScan:• useanindexavailableonsomepredicate• I/Ocost:itvariesdependingontheindex

Page 9: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture16

IndexMatching

• Wesaythatanindexmatches aselectionpredicateiftheindexcanbeusedtoevaluateit• Consideraconjunction-onlyselection.Anindexmatches(partof)apredicateif• Hash:onlyequalityoperation&thepredicateincludesall indexattributes

• B+Tree:theattributesareaprefixofthesearchkey(anyopsarepossible)

Page 10: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture16

ChoosingtheRightIndex

• Selectivity ofanaccesspath=fraction ofdatapagesthatneedtoberetrieved• Wewanttochoosethemostselectivepath!• Estimatingtheselectivityofanaccesspathisahardproblem

Page 11: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17

Projection

Simplecase:SELECT R.a, R.d• scanthefileandforeachtupleoutputR.a,R.d

Hardcase:SELECT DISTINCT R.a, R.d• projectouttheattributes• eliminateduplicatetuples(thisisthedifficultpart!)

Page 12: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17

Projection:Sort-based

Wecanimproveuponthenaïvealgorithmbymodifyingthesortingalgorithm:

1. InPass0 ofsorting,projectouttheattributes

2. Insubsequentpasses,eliminatetheduplicateswhilemergingtheruns

Page 13: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Lecture17

Projection:Hash-based

2-phasealgorithm:

• partitioning• projectoutattributesandsplittheinputintoB-1partitionsusingahashfunctionh

• duplicateelimination• readeachpartitionintomemoryanduseanin-memoryhashtable(withadifferent hashfunction)toremoveduplicates

Page 14: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

2.Joins

14

Lecture17

Page 15: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Whatyouwilllearnaboutinthissection

1. RECAP:Joins

2. NestedLoopJoin(NLJ)

3. BlockNestedLoopJoin(BNLJ)

4. IndexNestedLoopJoin(INLJ)

15

Lecture17

Page 16: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

1.NestedLoopJoins

16

Lecture17

Page 17: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Whatyouwilllearnaboutinthissection

1. RECAP:Joins

2. NestedLoopJoin(NLJ)

3. BlockNestedLoopJoin(BNLJ)

4. IndexNestedLoopJoin(INLJ)

17

Lecture17

Page 18: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

RECAP:Joins

Lecture17>Joins

Page 19: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

19

Joins:Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

𝐑 ⋈ 𝑺

Lecture17>Joins

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Page 20: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

20

Joins:Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

𝐑 ⋈ 𝑺

Lecture17>Joins

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Page 21: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

21

Joins:Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

2 5 2 2

𝐑 ⋈ 𝑺

Lecture17>Joins

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Page 22: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

22

Joins:Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

2 5 2 2

2 5 2 3

𝐑 ⋈ 𝑺

Lecture17>Joins

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Page 23: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

23

Joins:Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

2 5 2 2

2 5 2 3

3 1 1 7

𝐑 ⋈ 𝑺

Lecture17>Joins

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Page 24: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

24

Semantically:ASubsetoftheCrossProduct

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R S A B C D

2 3 4 2

2 3 4 3

2 5 2 2

2 5 2 3

3 1 1 7

×CrossProduct

Filterbyconditions(r.A =s.A)

… Canweactuallyimplementajoininthisway?

𝐑 ⋈ 𝑺

Lecture17>Joins

Page 25: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Notes

• Wewrite𝐑 ⋈ 𝑺 tomeanjoinRandSbyreturningalltuplepairswhereallsharedattributesareequal

• Wewrite𝐑 ⋈ 𝑺 onA tomeanjoinRandSbyreturningalltuplepairswhereattribute(s)Aareequal

• Forsimplicity,we’llconsiderjoinsontwotables andwithequalityconstraints(“equijoins”)

Howeverjoinscanmerge>2tables,andsomealgorithmsdosupportnon-equalityconstraints!

Lecture17>Joins

Page 26: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

26

NestedLoopJoins

Lecture17>NLJ

Page 27: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Notes• Weareagainconsidering“IOaware”algorithms:careaboutdiskIO

• GivenarelationR,let:• T(R)=#oftuplesinR• P(R)=#ofpagesinR

• Notealsothatweomitceilingsincalculations…goodexercisetoputbackin!

Lecture17>NLJ

Recallthatweread/writeentirepageswithdiskIO

Page 28: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NestedLoopJoin(NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:for s in S:if r[A] == s[A]:yield (r,s)

Lecture17>NLJ

Page 29: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NestedLoopJoin(NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:for s in S:if r[A] == s[A]:yield (r,s)

Lecture17>NLJ

P(R)

1. LoopoverthetuplesinR

NotethatourIOcostisbasedonthenumberofpagesloaded,notthenumberoftuples!

Cost:

Page 30: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NestedLoopJoin(NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:for s in S:if r[A] == s[A]:yield (r,s)

Lecture17>NLJ

P(R)+T(R)*P(S)

HavetoreadallofSfromdiskforeverytupleinR!

1. LoopoverthetuplesinR

2. ForeverytupleinR,loopoverallthetuplesinS

Cost:

Page 31: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NestedLoopJoin(NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:for s in S:if r[A] == s[A]:yield (r,s)

Lecture17>NLJ

P(R)+T(R)*P(S)

NotethatNLJcanhandlethingsotherthanequalityconstraints…justcheckintheifstatement!

1. LoopoverthetuplesinR

2. ForeverytupleinR,loopoverallthetuplesinS

3. Checkagainstjoinconditions

Cost:

Page 32: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NestedLoopJoin(NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:for s in S:if r[A] == s[A]:yield (r,s)

Lecture17>NLJ

P(R)+T(R)*P(S)+OUT

1. LoopoverthetuplesinR

2. ForeverytupleinR,loopoverallthetuplesinS

3. Checkagainstjoinconditions

4. Writeout(topage,thenwhenpagefull,todisk)

Cost:

WhatwouldOUTbeifourjoinconditionistrivial(ifTRUE)?

OUT couldbebiggerthanP(R)*P(S)…butusuallynotthatbad

Page 33: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NestedLoopJoin(NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:for s in S:if r[A] == s[A]:yield (r,s)

Lecture17>NLJ

P(R)+T(R)*P(S)+OUT

WhatifR(“outer”)andS(“inner”)switched?

Cost:

P(S)+T(S)*P(R)+OUT

Outervs.innerselectionmakesahugedifference-DBMSneedstoknowwhichrelationissmaller!

Page 34: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

IO-AwareApproach

Lecture17>BNLJ

Page 35: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

BlockNestedLoopJoin(BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Lecture17>BNLJ

P(𝑅)

GivenB+1pagesofmemory

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

Cost:

Note:Therecouldbesomespeeduphereduetothefactthatwe’rereadinginmultiplepagessequentiallyhoweverwe’llignorethishere!

Page 36: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

BlockNestedLoopJoin(BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Lecture17>BNLJ

P 𝑅 +𝑃 𝑅𝐵 − 1𝑃(𝑆)

GivenB+1pagesofmemory

Note:Fastertoiterateoverthesmaller relationfirst!

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

2. Foreach(B-1)-pagesegmentofR,loadeachpageofS

Cost:

Page 37: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

BlockNestedLoopJoin(BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Lecture17>BNLJ

GivenB+1pagesofmemory

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

2. Foreach(B-1)-pagesegmentofR,loadeachpageofS

3. Checkagainstthejoinconditions

BNLJcanalsohandlenon-equalityconstraints

Cost:

P 𝑅 +𝑃 𝑅𝐵 − 1𝑃(𝑆)

Page 38: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

BlockNestedLoopJoin(BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Lecture17>BNLJ

P 𝑅 +; <=>?

𝑃(𝑆) +OUT

GivenB+1pagesofmemory

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

2. Foreach(B-1)-pagesegmentofR,loadeachpageofS

3. Checkagainstthejoinconditions

4. Writeout

Cost:

Again,OUT couldbebiggerthanP(R)*P(S)…butusuallynotthatbad

Page 39: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

BNLJvs.NLJ:BenefitsofIOAware

• InBNLJ,byloadinglargerchunksofR,weminimizethenumberoffulldiskreads ofS• WeonlyreadallofSfromdiskforevery(B-1)-pagesegmentofR!• Stillthefullcross-product,butmoredoneonlyinmemory

P 𝑅 +; <=>?

𝑃(𝑆) +OUTP(R)+T(R)*P(S)+OUTNLJ BNLJ

BNLJisfasterbyroughly(=>?)@(<);(<)

!

Lecture17>BNLJ

Page 40: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

BNLJvs.NLJ:BenefitsofIOAware

• Example:• R:500pages• S:1000pages• 100tuples/page• Wehave12pagesofmemory(B=11)

• NLJ:Cost=500+50,000*1000 =50MillionIOs~=140hours

• BNLJ:Cost=500+ABB∗?BBB?B

=50Thousand IOs~=0.14hours

Lecture17>BNLJ

Averyrealdifferencefromasmallchangeinthealgorithm!

IgnoringOUThere…

Page 41: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SmarterthanCross-Products

Lecture17>INLJ

Page 42: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SmarterthanCross-Products:FromQuadratictoNearlyLinear• Alljoinsthatcomputethefullcross-product havesomequadraticterm• Forexamplewesaw:

• Nowwe’llseesome(nearly)linearjoins:• ~O(P(R)+P(S)+OUT),whereagainOUT couldbequadraticbutisusuallybetter

P 𝑅 +𝑷 𝑹=>?

𝑷(𝑺) +OUT

P(R)+T(R)P(S)+OUTNLJ

BNLJ

Wegetthisgainbytakingadvantageofstructure- movingtoequalityconstraints(“equijoin”)only!

Lecture17>INLJ

Page 43: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

IndexNestedLoopJoin(INLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:Given index idx on S.A:

for r in R:s in idx(r[A]):

yield r,s

Lecture17>INLJ

P(R)+T(R)*L+OUT

àWecanuseanindex (e.g.B+Tree)toavoiddoingthefullcross-product!

whereListheIOcosttoaccessallthedistinctvaluesintheindex;assumingthesefitononepage,L~3 isgoodest.

Cost:

Page 44: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

3.JoinsandMemory

44

Lecture17

Page 45: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Whatyouwilllearnaboutinthissection

1. Sort-MergeJoin(SMJ)

2. HashJoin(HJ)

3. SMJvs.HJ

45

Lecture17

Page 46: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Sort-MergeJoin(SMJ)

46

Lecture17

Page 47: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Whatyouwilllearnaboutinthissection

1. Sort-MergeJoin

2. “Backup”&TotalCost

3. Optimizations

47

Lecture17

Page 48: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SortMergeJoin(SMJ):BasicProcedure

TocomputeR ⋈ 𝑆𝑜𝑛𝐴:

1. SortR,SonAusingexternalmergesort

2. Scan sortedfilesand“merge”

3. [Mayneedto“backup”- seenextsubsection]

NotethatifR,SarealreadysortedonA,SMJwillbeawesome!

Lecture17>SMJ

Notethatweareonlyconsideringequalityjoinconditionshere

Page 49: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer

• Forsimplicity:Leteachpagebeonetuple,andletthefirstvaluebeA

DiskMainMemory

BufferR (5,b) (3,j)(0,a)

S (7,f) (0,j)(3,g)

WeshowthefileHEAD,whichisthenextvaluetoberead!

Lecture17>SMJ

Page 50: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer

1.SorttherelationsR,Sonthejoinkey(firstvalue)

DiskMainMemory

BufferR (5,b) (3,j)(0,a)

S (7,f) (0,j)(3,g)

(3,j) (5,b)(0,a)

(3,g) (7,f)(0,j)

Lecture17>SMJ

Page 51: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer

2.Scanand“merge”onjoinkey!

DiskMainMemory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,j)

(0,a)(0,a)

(0,j)

Lecture17>SMJ

Page 52: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer

2.Scanand“merge”onjoinkey!

DiskMainMemory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,j)(0,a)

(0,a)

(0,j)(0,a,j)

Lecture17>SMJ

Page 53: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer

2.Scanand“merge”onjoinkey!

DiskMainMemory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,a)

(0,j)

(0,a,j)

(3,j,g)

(3,j)

(3,g)

(5,b)

(7,f)

Lecture17>SMJ

Page 54: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer

2.Done!

DiskMainMemory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,a)

(0,j)

(0,a,j)

(3,j)

(3,g)

(3,j,g)

(5,b)

(7,f)

Lecture17>SMJ

Page 55: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Whathappenswithduplicatejoinkeys?

Lecture17>Backup

Page 56: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

MultipletupleswithSameJoinKey:“Backup”

1.Startwithsortedrelations,andbeginscan/merge…

DiskMainMemory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,j)

(0,g)

(0,b)

(7,f)

(0,a)

(0,j)

(0,a)

(0,j)

Lecture17>Backup

Page 57: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

1.Startwithsortedrelations,andbeginscan/merge…

DiskMainMemory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,j)

(0,g)

(0,b)

(7,f)

(0,a)

(0,a)(0,j)

(0,j) (0,a,j)

MultipletupleswithSameJoinKey:“Backup”

Lecture17>Backup

Page 58: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

1.Startwithsortedrelations,andbeginscan/merge…

DiskMainMemory

BufferR

S (0,g) 7,f

(0,j) 5,b

Output

(0,b)

(7,f)

(0,a)

(0,a)(0,j)

(0,a,j)

(0,a,g)

MultipletupleswithSameJoinKey:“Backup”

(0,g)

(0,j)

Lecture17>Backup

Page 59: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

1.Startwithsortedrelations,andbeginscan/merge…

DiskMainMemory

BufferR

S 0,g 7,f

0,j 5,b

Output

(0,j) (0,b)

(7,f)

(0,a)

(0,a,j)

(0,g)

(0,a,g)

(0,j)

Haveto“backup”inthescanofSandreadtuplewe’vealreadyread!

(0,j)

MultipletupleswithSameJoinKey:“Backup”

(0,j)

Lecture17>Backup

Page 60: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Backup

• Atbest,nobackupà scantakesP(R)+P(S)reads• Forex:ifnoduplicatevaluesinjoinattribute

• Atworst(e.g.fullbackupeachtime),scancouldtakeP(R)*P(S)reads!• Forex:ifallduplicatevaluesinjoinattribute,i.e.alltuplesinRandShavethesamevalueforthejoinattribute

• Roughly:ForeachpageofR,we’llhavetobackupandreadeachpageofS…

• Oftennotthatbadhowever,pluswecan:• Leavemoredatainbuffer(forlargerbuffers)• Can“zig-zag”(seeanimation)

Lecture17>Backup

Page 61: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJ:Totalcost

• CostofSMJiscostofsorting RandS…

• Plusthecostofscanning:~P(R)+P(S)• Becauseofbackup:inworstcaseP(R)*P(S);butthiswouldbeveryunlikely

• Plusthecostofwritingout:~P(R)+P(S)butinworstcaseT(R)*T(S)

~Sort(P(R))+Sort(P(S))+P(R)+P(S) +OUT

Recall:Sort(N)≈ 2𝑁 log=𝑵

𝟐(𝑩R𝟏)+ 1

Note:thisisusingrepacking,whereweestimatethatwecancreateinitialrunsoflength~2(B+1)

Lecture17>Backup

Page 62: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJvs.BNLJ:SteelCageMatch

• Ifwehave100bufferpages,P(R)= 1000pagesandP(S)=500pages:• Sortbothintwopasses:2*2*1000+2*2*500=6,000IOs• Mergephase1000+500=1,500IOs• =7,500IOs+OUT

WhatisBNLJ?• 500+1000* ABB

TU=6,500IOs+OUT

• But,ifwehave35bufferpages?• SortMergehassamebehavior(still2passes)• BNLJ?15,500IOs+OUT!

SMJis~linearvs.BNLJisquadratic…Butit’sallaboutthememory.

Lecture17>Backup

Page 63: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

ASimpleOptimization:MergesMerged!

• SMJiscomposedofasortphaseandamergephase

• Duringthesortphase,runpassesofexternalmergesortonRandS• Supposeatsomepoint,RandShave<=B(sorted)runsintotal

• Wecoulddotwomerges(foreachofR&S)atthispoint,completethesortphase,andstartthemergephase…

• OR,wecouldcombinethem:doone B-waymergeandcompletethejoin!

GivenB+1bufferpages

Lecture17>Backup

Page 64: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Merge/JoinPhase

SortPhase(Ext.MergeSort)

Un-OptimizedSMJ

SR

Split&sortSplit&sort

MergeMerge

MergeMerge

GivenB+1bufferpages

Joinedoutputfilecreated!

Unsortedinputrelations

Lecture17>Backup

Page 65: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Merge/JoinPhase

SortPhase(Ext.MergeSort)

SimpleSMJOptimization

SR

Split&sortSplit&sort

MergeMerge

GivenB+1bufferpages

Joinedoutputfilecreated!

Unsortedinputrelations

<=Btotalruns

B-WayMerge/Join

Lecture17>Backup

Page 66: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SimpleSMJOptimization

• Now,onthislastpass,weonlydoP(R)+P(S)IOstocompletethejoin!

• IfwecaninitiallysplitRandSintoBtotalrunseachoflengthapprox.<=2(B+1),assumingrepackingletsuscreateinitialrunsof~2(B+1)- thenweonlyneed3(P(R)+P(S))+OUT forSMJ!• 2R/Wperpagetosortrunsinmemory,1RperpagetoB-waymerge/join!

• Howmuchmemoryforthistohappen?• ; < R;(V)

=≤ 2 𝐵 + 1 ⇒ ~P R + P S ≤ 2𝐵Z

• Thus,𝐦𝐚𝐱{𝐏 𝐑 , 𝐏 𝐒 } ≤ 𝑩𝟐 isanapproximatesufficientcondition

GivenB+1bufferpages

IfthelargerofR,Shas<=B2 pages,thenSMJcosts3(P(R)+P(S))+OUT!

Lecture17>Backup

Page 67: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

TakeawaypointsfromSMJ

Ifinputalreadysortedonjoinkey,skipthesorts.• SMJisbasicallylinear.• Nastybutunlikelycase:Manyduplicatejoinkeys.

SMJneedstosortbothrelations• Ifmax{P(R),P(S)}<B2 thencostis3(P(R)+P(S))+OUT

Lecture17>Summary

Page 68: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoin(HJ)

68

Lecture17

Page 69: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Whatyouwilllearnaboutinthissection

1. HashJoin

2. Memoryrequirements

69

Lecture17

Page 70: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Recall:Hashing

• Magicofhashing:• AhashfunctionhB mapsinto[0,B-1]• Andmapsnearlyuniformly

• Ahashcollision iswhenx!=ybuthB(x)=hB(y)• Notehoweverthatitwillnever occurthatx=ybuthB(x)!=hB(y)

• WehashonanattributeA,soourhasfunctionishB(t)hastheformhB(t.A).• Collisionsmaybemorefrequent.

Lecture17>HJ

Page 71: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoin:High-levelprocedure

TocomputeR ⋈ 𝑆𝑜𝑛𝐴:

1. PartitionPhase:Usingone(shared)hashfunctionhB,partitionRandSintoB buckets

2. MatchingPhase:Takepairsofbucketswhosetupleshavethesamevaluesforh,andjointhese1. UseBNLJhere;orhashagainà eitherway,operatingonsmallpartitionsso

fast!

Noteagainthatweareonlyconsideringequalityconstraintshere

Wedecompose theproblemusinghB,thencompletethejoin

Lecture17>HJ

Page 72: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoin:High-levelprocedure

1.PartitionPhase:Usingone(shared)hashfunctionhB,partitionRandSintoB buckets

Disk

R

S

(3,j)(0,j)

(0,a)(0,a)

(3,b)

(5,b)(0,a)(0,j)

Disk

R1

S1

hB

S2

R2

Moredetailinasecond…

(0,a)(0,a)

(0,j)

(3,j)(3,b)

(0,a)(0,j)

(5,b)(5,b)

Noteournewconvention:pageseachhavetwotuples(oneperrow)

Lecture17>HJ

Page 73: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoin:High-levelprocedure

Disk

R

S

(3,j)(0,j)

(0,a)(0,a)

(3,b)

(5,b)(0,a)(0,j)

Disk

R1

S1

hB

S2

R2

(0,a)(0,a)

(0,j)

(0,a)(0,j)

(5,b)(5,b)

Joinmatchingbuckets

2.MatchingPhase:TakepairsofbucketswhosetupleshavethesamevaluesforhB,andjointhese

Lecture17>HJ

(3,j)(3,b)

Page 74: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoin:High-levelprocedure

Disk

R

S

(3,j)(0,j)

(0,a)(0,a)

(3,b)

(5,b)(0,a)(0,j)

Disk

R1

S1

hB

S2

R2

(0,a)(0,a)

(0,j)

(0,a)(0,j)

(5,b)(5,b)

Don’thavetojointheothers!E.g.(S1 andR2)!

2.MatchingPhase:TakepairsofbucketswhosetupleshavethesamevaluesforhB,andjointhese

Lecture17>HJ

(3,j)(3,b)

Page 75: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase1:Partitioning

Goal:Foreachrelation,partitionrelationintobuckets suchthatifhB(t.A)=hB(t’.A)theyareinthesamebucket

GivenB+1bufferpages,wepartitionintoBbuckets:• WeuseBbufferpagesforoutput(oneforeachbucket),and1forinput

• The“dual”ofsorting.• Foreachtupletininput,copytobufferpageforhB(t.A)• Whenpagefillsup,flushtodisk.

Lecture17>HJ

Page 76: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Howbigaretheresultingbuckets?

• GivenNinputpages,wepartitionintoBbuckets:• à Ideallyourbucketsareeachofsize~N/Bpages

• Whathappensiftherearehashcollisions?• Bucketscouldbe>N/B• We’lldoseveralpasses…

• Whathappensifthereareduplicatejoinkeys?• Nothingwecandohere…couldhavesomeskew insizeofthebuckets

GivenB+1buffer pages

Lecture17>HJ

Page 77: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Howbigdowewant theresultingbuckets?

• Ideally,ourbucketswouldbeofsize≤ 𝑩− 𝟏 pages• 1 forinputpage, 1foroutputpage,B-1 foreachbucket

• Recall:IfwewanttojoinabucketfromRandonefromS,wecandoBNLJinlineartimeifforoneofthem(wlog sayR),𝑷(𝑹) ≤ 𝑩 − 𝟏!• Andmoregenerally,beingabletofitbucketinmemoryisadvantageous

• Wecankeeppartitioningbucketsthatare>B-1pages,untiltheyare≤ 𝑩− 𝟏 pages• Usinganewhashkeywhichwillsplitthem… We’llcalleachofthese

a“pass”again…

GivenB+1buffer pages

RecallforBNLJ:

P 𝑅 +𝑃 𝑅 𝑃(𝑆)𝐵 − 1

Lecture17>HJ

Page 78: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

WepartitionintoB=2 bucketsusinghashfunctionh2 sothatwecanhaveonebufferpageforeachpartition(andoneforinput)

HashJoinPhase1:Partitioning

Disk

R

(3,j)(0,j)

GivenB+1=3buffer pages

(5,b) (5,a)(0,j)

(0,a)(3,a)

Forsimplicity,we’lllookatpartitioningoneofthetworelations- wejustdothesamefortheotherrelation!

Recall:ourgoalwillbetogetB=2buckets ofsize<=B-1à 1pageeach

Lecture17>HJ

Page 79: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

1.WereadpagesfromRintothe“input”pageofthebuffer…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(3,j)(0,j)

(5,b) (5,a)(0,j)

(0,a)(3,a)

Lecture17>HJ

Page 80: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

2.Thenweusehashfunctionh2 tosortintothebuckets,whicheachhaveonepageinthebuffer

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(3,a)

h2(0)=0

(0,a)(3,a)

(0,a)

(3,j)(0,j)

(5,b) (5,a)(0,j)

Lecture17>HJ

Page 81: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(3,a)

h2(3)=1

(0,a) (3,a)

(3,j)(0,j)

(5,b) (5,a)(0,j)

2.Thenweusehashfunctionh2 tosortintothebuckets,whicheachhaveonepageinthebuffer

Lecture17>HJ

Page 82: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

3.Werepeatuntilthebufferbucketpagesarefull…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(0,a) (3,a)

(3,j)(0,j)

(5,b) (5,a)(0,j)

Lecture17>HJ

Page 83: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

3.Werepeatuntilthebufferbucketpagesarefull…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(0,a) (3,a)(0,j)

(5,b) (5,a)(0,j)

h2(3)=1

(3,j)(0,j)

(3,a)(3,j)

Lecture17>HJ

Page 84: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

3.Werepeatuntilthebufferbucketpagesarefull…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(0,a) (3,a)(0,j)

(5,b) (5,a)(0,j)

h2(0)=0

(3,a)(3,j)

(0,a)(0,j)

Lecture17>HJ

Page 85: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

3.Werepeatuntilthebufferbucketpagesarefull…thenflushtodisk

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(5,b) (5,a)(0,j)

B0

B1

(3,a)(3,j)

(0,a)(0,j)

Lecture17>HJ

Page 86: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

3.Werepeatuntilthebufferbucketpagesarefull…thenflushtodisk

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(5,b)

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(5,a)(0,j)

Lecture17>HJ

Page 87: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Notethatcollisionscanoccur!

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(5,b)

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j)

h2(5)=1

Collision!!!

(5,a)(0,j)

(5,a)

Lecture17>HJ

h2(5)=h2(3)=1

Page 88: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Finishthispass…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

(5,b)

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j)

h2(0)=0

(5,a)(0,j)

Lecture17>HJ

Page 89: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Finishthispass…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(5,a)(0,j)

(5,b)

Lecture17>HJ

Page 90: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Finishthispass…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(5,a)(0,j)(5,b)

h2(5)=1

(5,a)(5,b)

Lecture17>HJ

h2(5)=h2(3)=1

Collision!!!

Page 91: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Finishthispass…

MainMemory

Buffer

HashJoinPhase1:Partitioning

Inputpage

0 1

Output(bucket)pages

Disk

R

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j) (5,a)(5,b)

Lecture17>HJ

Page 92: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase1:Partitioning

Disk

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j)

(5,a)(5,b)

WewantedbucketsofsizeB-1=1…howeverwegotlargeronesdueto:

(1)Duplicatejoinkeys

(2)Hashcollisions

Lecture17>HJ

Page 93: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase1:Partitioning

Disk

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j)

(5,a)(5,b)

Totakecareoflargerbucketscausedby(2)hashcollisions,wecanjustdoanotherpass!

Whathashfunctionshouldweuse?

Doanotherpasswithadifferenthashfunction,h’2,ideallysuchthat:

h’2(3)!=h’2(5)

Lecture17>HJ

Page 94: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase1:Partitioning

Disk

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j)

Totakecareoflargerbucketscausedby(2)hashcollisions,wecanjustdoanotherpass!

Whathashfunctionshouldweuse?

Doanotherpasswithadifferenthashfunction,h’2,ideallysuchthat:

h’2(3)!=h’2(5)

B2(5,a)(5,b)

Lecture17>HJ

Page 95: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase1:Partitioning

Disk

GivenB+1=3buffer pages

B0

B1

(0,a)(0,j)

(3,a)(3,j)

(0,j)

Whataboutduplicatejoinkeys?Unfortunatelythisisaproblem…butusuallynotahugeone.

B2(5,a)(5,b)

Wecallthisunevennessinthebucketsizeskew

Lecture17>HJ

Page 96: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

NowthatwehavepartitionedRandS…

Lecture17>HJ

Page 97: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

• Now,wejustjoinpairsofbucketsfromRandSthathavethesamehashvaluetocompletethejoin!

Disk

R

S

(3,j)(0,j)

(0,a)(0,a)

(3,b)

(5,b)(0,a)(0,j)

Disk

R1

S1

hB

S2

R2

(0,a)(0,a)

(0,j)

(0,a)(0,j)

(5,b)(5,b)

Joinmatchingbuckets

Lecture17>HJ

(3,j)(3,b)

Page 98: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

• Notethatsincex=yà h(x)=h(y),weonlyneedtoconsiderpairsofbuckets(onefromR,onefromS)thathavethesamehashfunctionvalue

• Ifourbucketsare~𝑩 − 𝟏 pages, canjoineachsuchpairusingBNLJinlineartime;recall(withP(R)=B-1):

BNLJCost: P 𝑅 +; < ;(V)=>?

= 𝑃(𝑅) + (=>?);(V)=>?

=P(R)+P(S)

Joiningthepairsofbucketsislinear!(Aslongassmallerbucket<=B-1pages)

Lecture17>HJ

Page 99: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

h(1)=0

h(1)=0

h(2)=0

h(3)=1

h(3)=1

h(4)=1

h(5)=2

h(6)=2

h(6)=2

h(1) h(1) h(2) h(2) h(3) h(4) h(5) h(6) h(6)

R.Ahashedvalues

S.Ahashedvalues

R ⋈ 𝑆𝑜𝑛𝐴

Lecture17>HJ

Page 100: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

h(1)=0

h(1)=0

h(2)=0

h(3)=1

h(3)=1

h(4)=1

h(5)=2

h(6)=2

h(6)=2

h(1) h(1) h(2) h(2) h(3) h(4) h(5) h(6) h(6)

R.Ahashedvalues

S.Ahashedvalues

R ⋈ 𝑆𝑜𝑛𝐴

Toperformthejoin,weideallyjustneedtoexplorethedarkblueregions

=thetupleswithsamevaluesofthejoinkeyA

A=1

A=2

A=3

A=4A=5

A=6

Lecture17>HJ

Page 101: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

h(1)=0

h(1)=0

h(2)=0

h(3)=1

h(3)=1

h(4)=1

h(5)=2

h(6)=2

h(6)=2

h(1) h(1) h(2) h(2) h(3) h(4) h(5) h(6) h(6)

R.Ahashedvalues

S.Ahashedvalues

R ⋈ 𝑆𝑜𝑛𝐴

WithajoinalgorithmlikeBNLJthatdoesn’ttakeadvantageofequijoinstructure,we’dhavetoexplorethiswholegrid!

Lecture17>HJ

Page 102: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

h(1)=0

h(1)=0

h(2)=0

h(3)=1

h(3)=1

h(4)=1

h(5)=2

h(6)=2

h(6)=2

h(1) h(1) h(2) h(2) h(3) h(4) h(5) h(6) h(6)

R.Ahashedvalues

S.Ahashedvalues

R ⋈ 𝑆𝑜𝑛𝐴h(A)=0

h(A)=1

h(A)=2

WithHJ,weonlyexploretheblueregions

=thetupleswithsamevaluesofh(A)!

WecanapplyBNLJtoeachoftheseregions

Lecture17>HJ

Page 103: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinPhase2:Matching

R.Ahashedvalues

S.Ahashedvalues

R ⋈ 𝑆𝑜𝑛𝐴h'(A)=0

h'(A)=2

AnalternativetoapplyingBNLJ:

Wecouldalsohashagain,andkeepdoingpassesinmemorytoreducefurther!

h'(A)=1

h'(A)=3 h'(A)

=4

h'(A)=5

Lecture17>HJ

Page 104: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HowmuchmemorydoweneedforHJ?

• GivenB+1bufferpages

• Suppose(reasonably)thatwecanpartitionintoBbucketsin2passes:• ForR,wegetBbucketsofsize~P(R)/B• Tojointhesebucketsinlineartime,weneedthesebucketstofitinB-1pages,sowehave:

+WLOG:AssumeP(R)<=P(S)

𝐵 − 1 ≥𝑃 𝑅𝐵 ⇒ ~𝑩𝟐 ≥ 𝑷(𝑹)

Quadraticrelationshipbetweensmallerrelation’ssize&memory!

Lecture17>Memoryrequirements

Page 105: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

HashJoinSummary

• Givenenoughbufferpagesasonpreviousslide…

• Partitioning requiresreading+writingeachpageofR,S• à 2(P(R)+P(S))IOs

• Matching (withBNLJ)requiresreadingeachpageofR,S• à P(R)+P(S)IOs

• Writingoutresults couldbeasbadasP(R)*P(S)…butprobablyclosertoP(R)+P(S)

HJtakes~3(P(R)+P(S))+OUT IOs!

Lecture17>Memoryrequirements

Page 106: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

SMJvs.HJ

Lecture17

Page 107: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Sort-Mergev.HashJoin

• Givenenoughmemory,bothSMJandHJhaveperformance:

• “Enough”memory=

• SMJ:B2 >max{P(R),P(S)}

• HJ:B2 >min{P(R),P(S)}

HashJoinsuperiorifrelationsizesdiffergreatly.Why?

~3(P(R)+P(S))+OUT

Lecture17

Page 108: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

FurtherComparisonsofHashandSortJoins

• HashJoinsarehighlyparallelizable.

• Sort-Mergelesssensitivetodataskew andresultissorted

Lecture17

Page 109: Lecture 17 Joins - GitHub Pages · 2017-12-13 · How do I choose the right graduate program? Lecture 17: Joins Lecture 17. Today’s Lecture 1. Recap: Select, Project 2. Joins 3

Summary

• SawIO-awarejoinalgorithms• Massivedifference

• Memorysizeskeyinhashversussortjoin• HashJoin=Littledog(dependsonsmallerrelation)

• Skewisalsoamajorfactor

Lecture17