presentation of group 9 w ong s uet -f ai n ewman chau man-hau dee

Cache Conscious Algorithms for Relational Query Processing

by Ambuj Shatdal, Chander Kant, Jeffrey F. Naughton

Presentation of Group 9

Wong Suet-Fai Newman

Chau Man-Hau Dee

Why cache performance is so important: The performance gap between the processor and memory.

– Memory access speeds: annual improvement of only 25%– Processor clock speeds: increase by about 60% every year.

Introduction

Introduction (Cont.)

Wrong perception:

Once data is in memory it is accessed as fast as it could be.

In cache: 2-4 processor cycles In main memory: 15-25 cycles If we can keep data in cache:

Result: 8% - 200% faster!!

Introduction (cont.)

Ways to improve cache performance:– Larger cache– Better algorithm

To show the benefits in redesigning traditional query processing algorithms so that they can make better use of cache.

This paper focus on join and aggregation algorithms

Major Parameters of Cache

1. Capacity(C): how big it is

2. Block Size(B): how many bytes to fetch each time

3. Associativity(A):

No. of unique places in the cache a particular block may reside in.

A = C/B

i. Fully-associative: A cache where data from any address can be stored in any cache location.

ii. Direct mapped: A == 1, i.e. B == C

iii. A-way set associative: A > 1 (A compromise between fully-associative And direct mapped)

Most caches use direct mapped or very small set-associativity. And LRU replacement policy

3 Types of cache miss:

1. Compulsory misses– it is the very first reference to a cache block, i.e. The cache line

was not accessed before.

2. Capacity misses– the cache cannot contain all the blocks needed during execution

of a program

3. Conflict misses – also called collision or interference misses– A reference that hits in a fully associative cache but misses in an

A-way set associative cache. i.e. Placement restrictions (not fully-associative) cause useful blocks to be displaced. E.g. different memory locations that are mapped to the same cache index.

Optimization Techniques

Background:– Algorithm optimization: To ensure as few cache misses as possible,

without much CPU overhead– NOT concern with exact cache configuration: block size and

associativity– Use of cache profiler (cprof) to localize optimization space

Technique 1. Blocking

Restructures algorithm to reuse chunks of data that can fit into cache, e.g. with cache size BKSZ:for (i = 0; i < M; i++)

for (j = 0; j< N; j++)

process(a[i],b[j])

for (bkNo = 0; bkNo < N / BKSZ; bkNo++)

for (i = 0; i < M; i++)

for (j = bkNo*BKSZ; j< (bkNo+1)*BKSZ; j++)

process(a[i],b[j])

Technique 2. Partitioning

To distribute data in partitions, e.g. in sorting:

quicksort(relation[n])

partition relation into blocks(size < BKSZ)

for each partition r

quicksort(relation[BKSZ])

merge partitions

Trade-off: overhead of partition creation, but usually the benefit gained should over-shadow it

Difference between blocking and partitioning– Blocking: restructure the algorithm,

no change in layout of data– Partitioning: layout of data is

reorganized to maximize use of cache

Technique 3. Extracting Relevant Data

Reducing data required. E.g. in sorting: Instead of sorting whole records, we extract the sorting key and pointer of record only. So more relevant data can fit into the cache.

Technique 4. Loop Fusion

for (i = 0; i < N; i++) {

extractKey(a[i]);

extractPointer(a[i]);

}

for (i = 0; i < N; i++)

buildHashTable(a[i])

for (i = 0; i < N; i++) {

extractKey(a[i]);

extractPointer(a[i]);

buildHashTable(a[i])

}

Technique 5. Data Clustering

Group related attributes together. E.g. in physical database design level, fields contemporaneously are stored together.

This paper concentrates on reducing capacity misses. It focuses on improving temporal and spatial locality of the memory accesses rather than optimal memory layout of relations.

Performance EvaluationExperiment 1. Hash Joins 2 sets of tuples, R & S, build hash on R, and then join S into it.

1. Basic Hash JoinBuildHastTable (H[R]) ;

for each s in S

Probe(s, H[R]) ;

for each r in R {

ExtractKeyPointers(r)

BuildHastTable (H[R]) ;

}

for each s in S

Probe(s, H[R]) ;

2. With Key Extraction

3. Partitioned

ExtractKeyPointers_And_Partition(R)

ExtractKeyPointers_And_Partition(S)

for each partition i

BuildHashTable(H[R[i]])

for each s in S[i]

Probe(s,H[R[i]])

Step Compulsory Capacity Conflict TotalBuild 37500 118731 2181 158412Probe 25159 193137 2352 220648Overall 62659 311868 4533 379060

Types of Cache Misses

Step Compulsory Capacity Conflict TotalExtract(R) 25000 50000 5 75005Build 25000 42692 1791 69483Probe 25159 165491 6055 196705Overall 75159 258183 7851 341193

Types of Cache Misses

Step Compulsory Capacity Conflict TotalPartition (R) 37514 51250 174 88938Partition (S) 25045 51212 244 76501Build 2192 40445 9837 52474Probe 25118 13860 27146 66124Overall 89869 156767 37401 284037

Types of Cache Misses no key extractionimprovment: divide relation into partitions to ensure hash in building and probling processes, there are less cache missesoverall performance: 6.6% faster than BashHashabout 25% less caches misses

overhead of attribute & pointer extractionreduction in cache misses in the build and probe phasesoverall performance: 7.2% faster than BaseHashabout 10% less caches misses

Algorithm Cache Misses Time SpeedupBase 379060 0.699 -- Extraction 341193 0.652 7.20%Partitioned 284037 0.656 6.60%

BaseHash(R,S)

Extraction(R,S)

Partitioned(R,S)

BaseHash has the fewest compulsory missesThe results hold in general – not to specific machinesComplier plays a strong role – it affects the the efficiency greatly

Findings in Hash Joins experiment:

Algorithm Time Speedup Time Speedup Time Speedup Time SpeedupBase 0.699 ---- 0.203 ---- 0.472 ---- 0.349 ----Extraction 0.652 7.20% 0.198 2.50% 0.434 8.70% 0.291 19.90%Partitioned 0.656 6.60% 0.186 9.10% 0.432 9.30% 0.324 7.70%

DECst'n 5k/125 DEC 3k/300 HP 9k/720 SUN 10/51

Experiment 2. The Sort Merge Join

BaseSort(R,S)ExtractKeyPointers(R)

ExtractKeyPointers(S)

Sort(R)

Sort(S)

Merge(R,S)

ImmediateSort(R,S)ExtractKeyPointers(R)

Sort(R)


Sort(S)

Merge(R,S)

PartitionedSort(R,S)ExtractKeyPointer_and_Partitioned(R)


Sort(R[i])

ExtractKeyPointer_and_Partitioned(S)


Sort(S[i])


Merge(R[i],S[i])

ImprovedSort(R,S)ExtractKeyPointer_and_Partitioned(R)

ExtractKeyPointer_and_Partitioned(S)


Sort(R[i])

Sort(S[i])

Merge(R[i],S[i])

Algorithm Time Speedup Time Speedup Time Speedup Time SpeedupBase 1.789 ---- 0.504 ---- 0.794 ---- 0.672 ----Immediate 1.769 1.10% 0.495 1.80% 0.793 1.30% 0.667 0.70%Partitioned 1.344 33.10% 0.336 50.00% 0.648 22.50% 0.532 26.30%Improved 1.301 37.50% 0.327 54.10% 0.64 24.10% 0.523 28.50%


Finding in Sort Merge experiment:

Much better improvement in Sort Merge than in Hash Join. It’s because Sort operation is much more memory intensive and computationally expensive.

Experiment 3. Nested LoopsTraditionally, people think that we can do nothing to improve nested loops which is in memory already

BaseNestedLoop(R,S)

BlockedNestedLoop(R,S)

ExtractKeyPointers(R)


for each tuple r in R

for each tuple s in S

if join(r,s) then produce result

ExtractKeyPointers(R)


for each block b of S

for each tuple r in R

for each tuple s in b

if join(r,s) then produce result

Algorithm Time Speedup Time Speedup Time Speedup Time SpeedupNestedLoop 2244.05 ---- 490.11 ---- 569.1 ---- 413.51 ----Blocked 741.54 202.60% 205.11 138.90% 305.71 86.20% 348.16 18.80%


SUN 10/51 performance improvement is not significant because it got 1MB secondary cache, which helps a lot even in the BaseNestedLoop case.

Experiment 4. Aggregation

for each tuple t in R

Hash(t)

Insert/update the hashtable entry for the group

2. Extraction (R)for each tuple t in R

ExtractKeyPointer(t)

Hash(t)

Insert/update the hashtable entry for the group

Hash Based Aggregation

1. BaseHash (R)

Algorithm Time Speedup Time Speedup Time Speedup Time SpeedupBaseHash 0.465 ---- 0.096 ---- 0.277 ---- 0.171 ----Extraction 0.465 0.00% 0.097 -1.00% 0.282 -1.80% 0.17 0.60%


No improvement !!!

Reason:The hash table is accessed only once. All are compulsory misses and therefore key pointer extraction doesn’t help!

Lesson:Cache optimizations can be subtle and specific to a particular algorithm.

Parametric Studies

Choices of Result Generation in join algorithm

1. On the Fly:Result tuple is produced as soon as a match is found in join

2. Lazy:When a match is found, 2 pointers to the responding tuples are stored, generates an in-memory join index. The result is generated later, depending upon need.

Algorithm On the Fly Lazy

Extraction 1.492 1.527PartitionedHash 1.602 1.527BaseSortmerge 2.586 2.59ImprovedSort Sort-Merge 2.156 2.132

Why “Lazy” algorithm is not much slower than “On the Fly” ?

Conclusions

Main memory should not be the end of optimization for databases algorithms.

Designing algorithms with cache consideration can significantly improves their performance.

Most of the time we have to use cache profiler to find out the poorly performing parts of the code

References[BE771 M. W. Blasgen and K. P. Eswaran. Storage and access in relational databases. IBM Systems Journal, 16(4), 1977.[DK0+84] David J. Dewitt, Randy H. Katz, Frank Olken, Lenard D. Shapiro, Michael R. Stonebraker, and David Wood. Implementation techniques for main memory database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages l-8, June 1984.[EM’91 Robert Epstein. Techniques for Processing of Aggregates in Relational Database Systems. Memorandum UCB/ERL M79/8, Electronics Research Laboratory, College of Engineering, University of California, Berkeley, February 1979.[HS89] Mark D. Hill and Alan Jay Smith. Evaluating Associativity in CPU Caches. IEEE tinsaca-tions on Computers, 38(12):1612-1630, December 1989.R. E. Kessler and Mark D. Hill. Page Placement Algorithms for Real-Indexed Caches. ACM Tkxansactions in Computer Systems, 10(4):338-359, November 1992.James R. Larus. Efficient Program Tracing. IEEE Computer, 26(5):52-61, May 1993.Alvin R. Lebeck and David A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer (to appear), June 1994.Chris Nyberg, Tom Barclay, Zarca Cvetanovic, Jim Gray, and Dave Lomet. AlphaSort: A RISC Machine Sort. In Proc. of the 1994 ACM SIGMOD Conf., pages 233-242, May 1994.Alan J. Smith. Cache Memories. Computing Surveys, 14(3):473-530, September 1982.Patrick Valduriez. Join Indices. ACM transactions on Database Systems, 12(2):218 - 246, June 1987.

http://www-2.cs.cmu.edu/~manjhi/cs740proj/finalReport/node1.htmlhttp://burks.brighton.ac.uk/burks/foldoc/17/46.htmhttp://www.complang.tuwien.ac.at/anton/memory-wall.html#misseshttp://www.cse.iitd.ernet.in/~csd98414/ISReport/report/node15.htmlhttp://www.cae.wisc.edu/~mikko/552/ch7b.pdf

~ The End ~

Q & A

presentation of group 9 w ong s uet -f ai n ewman chau man-hau dee

Documents