joinspjm/co572/lectures/joins.pdf · revision youshouldknowaboutjoins...
TRANSCRIPT
![Page 1: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/1.jpg)
Joins
Holger Pirk
1 / 60
![Page 2: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/2.jpg)
Purpose of this lecture
You shouldUnderstand and know the basic join algorithms: nested loop (withvariants), hash, sort-mergeBe able to select an appropriate join implementation for a givensituation
2 / 60
![Page 3: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/3.jpg)
Why Join?
3 / 60
![Page 4: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/4.jpg)
Why care?
Joins are everywhereIn part due to to whole normalization business
I These are mostly Foreign-Key joins (we’ll talk about those in thecontext of indexing)
In part because combining (joining) data produces valueI These are more complicated (and interesting)
ExamplesFind users that have bought the same productsFind the shortest route visting 5 of London’s best sightsFind advertisements that worked
I (lead to users searching for a specific term within a timeframe)
4 / 60
![Page 5: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/5.jpg)
Revision
You should know about joinsJoins are basically cross products with selections on top
I select R.r, S.s from R,S where R.id = S.id
Quick quiz: join or cross productselect R.r, S.s from R,S where R.r = R.id
Quick quiz: join or cross productselect R.r from R,S where R.r = "something"
Quick quiz: join or cross productselect R.r from R,S where R.r = S.s
5 / 60
![Page 6: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/6.jpg)
Revision
You should know about joinsJoins are basically cross products with selections on top
I select R.r, S.s from R,S where R.id = S.id
Quick quiz: join or cross productselect R.r, S.s from R,S where R.r = R.id
Quick quiz: join or cross productselect R.r from R,S where R.r = "something"
Quick quiz: join or cross productselect R.r from R,S where R.r = S.s
5 / 60
![Page 7: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/7.jpg)
Revision
You should know about joinsJoins are basically cross products with selections on top
I select R.r, S.s from R,S where R.id = S.id
Quick quiz: join or cross productselect R.r, S.s from R,S where R.r = R.id
Quick quiz: join or cross productselect R.r from R,S where R.r = "something"
Quick quiz: join or cross productselect R.r from R,S where R.r = S.s
5 / 60
![Page 8: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/8.jpg)
Revision
You should know about joinsJoins are basically cross products with selections on top
I select R.r, S.s from R,S where R.id = S.id
Quick quiz: join or cross productselect R.r, S.s from R,S where R.r = R.id
Quick quiz: join or cross productselect R.r from R,S where R.r = "something"
Quick quiz: join or cross productselect R.r from R,S where R.r = S.s
5 / 60
![Page 9: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/9.jpg)
Left and Right Outer JoinsLeft Join
A left join RLon S returns every row in R , even if no rows in S match. In
such cases where no row in S matches a row from R , the columns of S arefilled with NULL values.
Right Join
A right join RRon S returns every row in S , even if no rows in R match. In
such cases where no row in R matches a row from S , the columns of R arefilled with NULL values.
Outer Join
An outer join ROon S returns every row in R , even if no rows in S match,
and also returns every row in S even if no row in R matches.
ROon S ≡ (R
Lon S) ∪ (R
Ron S)
6 / 60
![Page 10: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/10.jpg)
Revision Quiz
What are the bounds of for the size of the output (O) of this queryselect * from R full outer join S on (R.r = S.s)
AnswersA |R| ≤ |O| ≤ |R| × |S |B max(|R|, |S |) ≤ |O| ≤ max(|R|, |S |, |R| × |S |)C |R|+ |S | ≤ |O| ≤ max(|R|+ |S |, |R| × |S |)D |S | ≤ |O| ≤ |R| × |S |
7 / 60
![Page 11: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/11.jpg)
On matching criteria
The matching functionselect * from R full outer join S on (R.r = S.s)
The matching function need not be equalityIf it is, we call the join an equi-join (these are the most importantjoins)If it is an inequality constraint (< or >), we call them inequality joins
select count(*) from event, marker where event.timebetween marker.time and marker.time+60
If it is an <> (!= in C syntax), we call it an anti-joinAll other joins are called Theta joins
8 / 60
![Page 12: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/12.jpg)
Nested Subqueries in where-clauses: the In operator
Definitiontrue if the attribute value is contained in the setsubquery must only return a single column
The simple case: statically defined setsselect * from Customer where name IN ("holger", "sam")
The correlated caseselect Name from Customer where CustomerID IN (select CustomerID from Order)
equivalent toselect distinct Name from Customer, Orderwhere Customer.CustomerID = Order.CustomerID
9 / 60
![Page 13: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/13.jpg)
Join algorithms
10 / 60
![Page 14: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/14.jpg)
Nested Loop - the naive one
11 / 60
![Page 15: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/15.jpg)
Nested Loop - the naive one
Implementationfor (size_t i = 0; i < leftRelationSize; i++) {
auto leftInput = left[ i];for (size_t j = 0; j < rightRelationSize; j++) {
auto rightInput = right[ j];if(leftInput[leftAttribute] == rightInput[rightAttribute])
writeToOutput({leftInput, rightInput});}
}
Example dataR: 10, 17, 7, 16, 12, 8, 13
S: 8, 16, 12, 1, 17, 2, 7
12 / 60
![Page 16: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/16.jpg)
Nested Loop Join - the naive one
PropertiesSimpleSequential I/OTrivial to parallelize (no dependent loop iterations)
EffortΘ(|left| × |right|)Can be reduced to Θ( |left|×|right|2 ) if value uniqueness can be assumed
Remember: Databases are I/O bound. . .. . . well, they are on disk. . .. . . so, lets see how many pages we need to read
13 / 60
![Page 17: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/17.jpg)
Worksheet
ParametersAssume a nested loop join of two disk-resident tables in N-ary formThe first table has 500,000 tuples of three integer attributesThe second has 50,000 tuples of two integer attributesPages are plain (unslotted) 4KBThe page buffer can hold 10 pagesThe replacement strategy of the buffer manager is LRU
How many page replacements need to be performed?
14 / 60
![Page 18: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/18.jpg)
Calculation
15 / 60
![Page 19: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/19.jpg)
Sort-Merge Joins - a join for special cases
16 / 60
![Page 20: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/20.jpg)
Sort-Merge Joins - a join for special cases
Implementation (assuming values are unique and sorted)auto leftI = 0;auto rightI = 0;while (leftI < leftInputSize && rightI < rightInputSize) {
auto leftInput = left[ leftI];auto rightInput = right[ rightI];if(leftInput[leftAttribue] < rightInput[rightAttribue])
leftI++;else if(rightInput[rightAttribue] < leftInput[leftAttribue])
rightI++;else {
writeToOutput({leftInput, rightInput});rightI++;leftI++;
}}
Example dataR = {7, 8, 10, 12, 13, 16, 17};
S = {1, 2, 7, 8, 12, 16, 17};
17 / 60
![Page 21: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/21.jpg)
Sort-Merge Joins - a join for special casesExample data
R = {7, 8, 10, 12, 13, 16, 17};, S = {1, 2, 7, 8, 12, 16, 17};
18 / 60
![Page 22: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/22.jpg)
Why Sort-Merge Joins works
InvariantsAssume, w.l.o.g., that the value on the rightis less than the value on the leftAll values succeeding the value on the rightare greater than the value on right⇒ No value beyond the value on the rightcan be a join partner⇒ The value on the left has no joinpartners succeeding the value on the right⇒ The cursor on the left can be advanced
Visualisation
>7>5
57
Merge Direction
Comparison
19 / 60
![Page 23: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/23.jpg)
Sort-Merge Joins - a join for special cases
EffortO(|left| × log |left|+ |right| × log |right|+ |left|+ |right|)
I Assuming uniqueness
PropertiesSequential I/O in the merge phaseTricky to parallelizeWorks for inequality joins
I Careful when advancing the cursors
20 / 60
![Page 24: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/24.jpg)
Worksheet: Sort-Merge Equality Joins
ParametersYou have two sorted relations with a single attribute
I R: 2, 5, 6, 7, 9, 11, 12I S: 1, 4, 5, 6, 10, 11, 13
Evaluate the query select r,s from R,S where r = sI Use a sort-merge joinI You can assume the values in each relation are unique
21 / 60
![Page 25: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/25.jpg)
Calculation
22 / 60
![Page 26: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/26.jpg)
Hash joins - the special case that is so important
23 / 60
![Page 27: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/27.jpg)
Hash joins - the special case that is so importantImplementationextern int* hashTable;
for (size_t i = 0; i < buildSideSize; i++) {auto buildInput = build[ i];auto hashValue = hash(buildInput[buildAttribute]);while (hashTable[hashValue])
hashValue = nextSlot(hashValue);hashTable[hashValue] = buildInput;
}
for (size_t i = 0; i < probeSideSize; i++) {auto probeInput = probe[ i];auto hashValue = hash(probeInput[probeAttribute]);while (hashTable[hashValue] &&
hashTable[hashValue][buildAttribute] != probeInput[probeAttribute])hashValue = nextSlot(hashValue);
if(hashTable[hashValue][buildAttribute] == probeInput[probeAttribute])writeToOutput({hashTable[hashValue], probeInput});
}
NomenclatureWe distinguish build-side (the side that is buffered in the hashtable)and probe-side (the one used to look up tuples in the hashtabl)
24 / 60
![Page 28: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/28.jpg)
Hash join details. . . the hash functionHash-function requirements
Pure no stateKnown output domain we need to know the range of generated valuesContiguous output domain we do not want holes in the output domain
Nice to haveUniform all values should be equally likely
Typical examplesCRC32MD5MurmurHash
I This is one of the fastest "decent" hash-functionsModulo-Division
I Arguably the simplest function you could have
25 / 60
![Page 29: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/29.jpg)
Conflict Handling
When a slot is already filled but there is space in the table. . .We need to put the value somewhere. . .The conflict handling strategy prescribes where
RequirementsLocality (but not too much :-))No holes (probe all slots)
Many exist - let’s talk about threeLinear probingQuadratic probingRehashing: specifically, cyclic group probing
26 / 60
![Page 30: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/30.jpg)
Linear Probing
DescriptionWhen a slot is filled, try the next one (distance 1). . .. . . and the next one (distance 2). . .. . . continue until you find one that is free (3,4,5,6, etc.)
27 / 60
![Page 31: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/31.jpg)
Hash-join with modulo hashing and linear probing
Implementationextern int* hashTable;
for(size_t i = 0; i < buildSideSize; i++) {auto buildInput = build[i];auto hashValue = buildInput[buildAttribute] % 10; // hash-functionwhile(hashTable[hashValue])
hashValue = (hashValue++ % 10); // probe functionhashTable[hashValue] = buildInput;
}
for(size_t i = 0; i < probeSideSize; i++) {auto probeInput = probe[i];auto hashValue = probeInput[probeAttribute] % 10;while(hashTable[hashValue] && //
hashTable[hashValue][buildAttribute] != probeInput[probeAttribute])hashValue = (hashValue++ % 10);
if(hashTable[hashValue][buildAttribute] == probeInput[probeAttribute])writeToOutput({hashTable[hashValue], probeInput});
}
28 / 60
![Page 32: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/32.jpg)
Hash-join with modulo hashing and linear probingExample data (, linear probing)
auto hash = [](auto v){return v % 10; };
auto probe = [](auto v) { return (v + 1) % 10; }
Probe = {7, 8, 10, 12, 13, 16, 17}; Build = {1, 2, 7, 8, 12, 16, 17};
29 / 60
![Page 33: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/33.jpg)
Linear Probing
DescriptionWhen a slot is filled, try the next one (distance 1). . .. . . and the next one (distance 2). . .. . . continue until you find one that is free (3,4,5,6, etc.)
DisadvantagesLeads to long probe-chains for high-locality dataFor example, 9,8,7,6,5,4,3,2,2
30 / 60
![Page 34: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/34.jpg)
Quadratic Probing
DescriptionWhen a slot is filled, try the next one (distance 1). . .. . . double the distance (distance 2). . .. . . continue until you find one that is free (4, 8, 16, etc.)
AdvantagesSimpleGood locality in the first three probes
I Bad after that (that is the point)
DisadvantagesThe first few probes are still likely to incur conflictsFor example, 9,8,7,6,5,4,3,2,2
31 / 60
![Page 35: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/35.jpg)
Rehashing, specifically, cyclic group probing
The challenge:Randomize the probesMake them deterministicMake sure all slots are probed
32 / 60
![Page 36: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/36.jpg)
Cyclic group probing
Hardcore math to the rescueThere is a way to generate a (pseudorandom) sequence of numbers
I while making sure every number in a range is generatedA multiplicative group of integers modulo n
I The equation is f (x) = (x × g) mod n for certaing g and n
How do we determine g and n (n is the size of our hash table)I n = pk with p an odd prime and k > 0I There is no easy way to calculate the primitive roots (g) for a given nI Using Mathematica: PrimitiveRootList@11 = {2,6,7,8}I group = 6, 3, 7, 9, 10, 5, 8, 4, 2, 1, 6, ...
33 / 60
![Page 37: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/37.jpg)
Worksheet: Hash-building
Parameters (Hashtable size = 11)You have two relations with a single attribute
I Probe: 6, 7, 9, 12, 11, 15, 2I Build: 1, 4, 5, 6, 15, 11, 14
Evaluate the query select r,s from R,S where s = rI Use a hash-joinI You can assume the values in each relation are unique
The hash function is f (x) = x mod 10 + 1
Which is the best probing strategy in this case(with respect to the comparisons in the build phase)?
I Linear ProbingI Quadratic ProbingI Rehashing with the cyclic group f (x) = (x × 6) mod 11
34 / 60
![Page 38: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/38.jpg)
Calculation
35 / 60
![Page 39: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/39.jpg)
Hash joins - the special case that is so important
PropertiesSequential I/O on the inputs
I (Pseudo-random access to the hashtable during build and probe)
Parallelizable over the values on the probe sideParallelizing the build is tricky (Research opportunities!)
EffortO(|build | × |probe|) in the worst caseΘ(|build |+ |probe|) in the best case
36 / 60
![Page 40: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/40.jpg)
Hash Joins practicalities
Cyclic groups are the theory. . .The practice is a lot less elegantPeople simply rehash using some arbitrary hash-function
I (often the same they used for initial hashing)I does this guarantee hole-freeness?
Hashing is expensiveEspecially good hashing
I Lots of CPU cycles (often more expensive than multiple data accesses)
Slots are often allocated in bucketsBuckets are slots with space for more than one tupleYou will sometimes see people implementing buckets as linked lists
I A horrible idea if you care about lookup performance (inserts are okay)
37 / 60
![Page 41: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/41.jpg)
Hash Joins practicalities
Hashtables are arrays tooThey occupy spaceThey are usually overallocated by at least a factor two
I i.e., you allocate twice as many slots as (estimated) tuple inputs(obviously adapting the hash-function)
They are probed randomly in the probe phase (a lot)I You really want to make sure they stay in the buffer pool/cache
For this class, assume that, if the hashtable does not fit, every accessis a page fault (consistent with what we established previously)Rule of thumb: use Hash Joins when one relation is much smaller thanthe other
38 / 60
![Page 42: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/42.jpg)
Partitioning. . . for buffer pool friendlyness
39 / 60
![Page 43: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/43.jpg)
Hashjoins
RecallRule of thumb: use Hash Joins when one relation is much smaller thanthe other
How do you make sure of that?You partition!
40 / 60
![Page 44: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/44.jpg)
Partitioning - when you don’t have enough memory
Fundamental premise:Sequential access is much cheaper than random access
I Difference grows with the page sizeI Assume: Random value access cost cI Sequential value access cost c
pagesizeI Why?
Assume your hashtable does not fit in the buffer poolI.e., if the relation is larger than half the buffer poolIt can pay off to invest in an extra pass for partitioning
41 / 60
![Page 45: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/45.jpg)
Partitioning - an example
AssumeYour smaller relation is exactly the size of the buffer pool minus onepage
I Hashtable is overallocated by a factor 2I Probability of a page fault upon hash is roughly 50 percent
Partitioning it into 4 smaller partitions costs you one scanI Plus writing output (which is also one scan)
Now, you join the large relation with each of the 4 smaller, buffer poolresident relations
I Costs: 4 scans over the larger relationI Remember: we assume |buidl | << |probe|, i .e., costsroughlyquadrupleBut: each of the hash lookups is now hitting the buffer pool
I Disk seek latency: 3ms, RAM access latency: 30ns – a factor of 100K
42 / 60
![Page 46: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/46.jpg)
Partitioning - an example
Visualization
1412175
9…
inputBuffer Pool
(size 5)2011
partitioning function (mod 4)
Pages on disk
when page is filled(semi-random but
without bandwidth waste)
for every tuple
43 / 60
![Page 47: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/47.jpg)
Partitioning
BonusYou can parallelize the processing of each of the smaller joins
I because they are disjointYou can partition the larger relation as well. . .
I . . . and only join the overlapping partitionsI this is the state of the art in join processing
44 / 60
![Page 48: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/48.jpg)
Blocked Nested Loops - the slightly smarter NLJ
45 / 60
![Page 49: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/49.jpg)
Blocked Nested Loops - the slightly smarter
Implementation (assuming all pages are filled)auto tuplesOnLeftInputPages = 512;auto tuplesOnRightInputPages = 1024;for (size_t i = 0; i < leftRelationSize; i+= tuplesOnLeftInputPages)
for (size_t j = 0; j < rightRelationSize; j+= tuplesOnRightInputPages)for (size_t i_page = 0; i_page < tuplesOnLeftInputPages; i_page++) {
auto leftInput = left[ i+i_page];for (size_t j_page = 0; j_page < tuplesOnRightInputPages; j_page++) {
auto rightInput = right[ j+j_page];if(leftInput[leftAttribute] == rightInput[rightAttribute])
writeToOutput({leftInput, rightInput});}
}
46 / 60
![Page 50: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/50.jpg)
Visualization
47 / 60
![Page 51: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/51.jpg)
Blocked Nested Loops - the slightly smarter
PropertiesSequential I/OTrivial to parallelize (no dependent loop iterations)
EffortSame number of comparisons as nested loop
I Θ(|left| × |right|)I Can be reduced to Θ( |left|×|right|2 ) if value uniqueness can be assumed
Better I/O behaviour
48 / 60
![Page 52: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/52.jpg)
Worksheet
ParametersAssume a nested loop join of two disk-resident tables in N-ary formThe first table has 500,000 tuples of three integer attributesThe second has 50,000 tuples of two integer attributesPages are plain (unslotted) 4KBThe page buffer can hold 10 pagesThe replacement strategy of the buffer manager is LRU
How many page replacements need to be performed?
49 / 60
![Page 53: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/53.jpg)
Calculation
50 / 60
![Page 54: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/54.jpg)
Indexed Nested Loops - when you know what you’re joining
51 / 60
![Page 55: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/55.jpg)
Indexed Nested Loops - when you know what you’re joining
AssumptionsOne side has an indexWhat is an index, you ask?
I Well, let’s, for now, assume it is sortedI We’re going to talk about this more later
The other is not
IdeaScan the unindexed side (like you would in the nested loops case)Look up values on the inner side using the appropriate lookup function
I In the case of a sorted relation, that means binary searching
52 / 60
![Page 56: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/56.jpg)
Indexed Nested Loops - when you know what you’re joining
PropertiesSequential I/O on the unindexed sideQuasi-random on the indexed-sideParallelizable over the values on the unindexed side
EffortΘ(|unindexed | × | log indexed |)O(|left| × |right|)Note: I/O behaviour on the indexed side is not-quite random
DisadvantagesWe are basically cheating: we pushed part of the computation intoanother componentMaintaining sortedness is expensive
53 / 60
![Page 57: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/57.jpg)
Recall
54 / 60
![Page 58: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/58.jpg)
Ensuring sortedness
In C++class Table {
vector<Tuple> storage;bool isSorted = true;bool isDense = true;
public:void insert(Tuple t) {
storage.push_back(t);sort(storage.begin(), storage.end(), [](auto l, auto r) { return l.id <
r.id; });↪→isDense = true;for(size_t i = 1; i < storage.size(); i++)
isDense &= storage[i].id == storage[i - 1].id + 1;};vector<Tuple> findByKey(int id);
};
55 / 60
![Page 59: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/59.jpg)
Conclusion
56 / 60
![Page 60: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/60.jpg)
So, which join algorithm should I use?
My DBMS Professor:The answer is always sorting or hashing
The truth is more complicatedSort-Merge join if
I Relations are sortedI Relations are unsorted but have similar sizeI If you are evaluating an inequality-join
Indexed nested loop join if one relation has an index (e.g., is sorted)Hashjoin if one relation is much smaller than the other (less than 10%)Nested loop join if one relation is tiny (less than 20 values)Blocked nested loops join for theta-joins
57 / 60
![Page 61: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/61.jpg)
Joins vs. Cross Products
Function calls
Cross Product Hash Join0
5.0×1010
1.0×1011
1.5×1011
58 / 60
![Page 62: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/62.jpg)
Joins vs. Cross Products
Function calls
Cross Product Hash Join0
5.0×1010
1.0×1011
1.5×1011
58 / 60
![Page 63: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/63.jpg)
Joins vs. Cross Products
What factor is Hashjoin bounding performance?
CPU I/O0
5.0×106
1.0×107
1.5×107
59 / 60
![Page 64: Joinspjm/CO572/lectures/Joins.pdf · Revision Youshouldknowaboutjoins Joinsarebasicallycrossproductswithselectionsontop I select R.r, S.s from R,S where R.id = S.id Quickquiz: joinorcrossproduct](https://reader033.vdocuments.us/reader033/viewer/2022050508/5f997298db7ea208ef67c6ef/html5/thumbnails/64.jpg)
The end
60 / 60