![Page 1: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/1.jpg)
Towards a Theory of Cache-Efficient Algorithms
Summary for the seminar:
Analysis of algorithms in hierarchical memory – Spring 2004
by Gala Golan
![Page 2: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/2.jpg)
The RAM Model
In the previous lecture we discussed a cache in an operating system
We saw a lower bound on sorting:
N = number of sorted elements B = number of elements in each block M = memory size
log
log
N N B
B M B
![Page 3: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/3.jpg)
The I/O Model1. A datum can be accessed only from fast
memory
2. B elements are brought to memory in each access
3. Computation cost << I/O cost
4. A block of data can be placed anywhere in fast memory
5. I/O operations are explicit
![Page 4: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/4.jpg)
The Cache Model1. A datum can be accessed only from fast memory √
2. B elements are brought to memory in each access √3. Computation cost << I/O cost
L denotes normalized cache latency, accessing a block from cache costs 1
4. A block of data can be placed anywhere in fast memoryA fixed mapping distributes main memory in the cache
5. I/O operations are explicitThe cache is not visible to the programmer
![Page 5: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/5.jpg)
Notation
I(M,B) - The I/O modelC(M,B,L) - The cache model n = N/B, m = M/B – The size of the data and
of memory in blocks (instead of elements) The goal of an algorithm design is to
minimize running time = (number of cache accesses) + (L* number of memory accesses)
![Page 6: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/6.jpg)
Reminder – Cache Associativity Associativity specifies the number of different
frames in which a memory block can reside
Fully Associative
Direct Mapped
2-Way Associative
Set
![Page 7: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/7.jpg)
Emulation Theorem An algorithm A in I(M,B) using T block
transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T ) steps.
The additional memory requirement is m blocks.
In other words – an algorithm that is efficient in main memory, can be efficient in cache.
![Page 8: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/8.jpg)
Proof (1)
1 mC[] 2
21 m n
Mem[]
Buf[]
![Page 9: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/9.jpg)
Proof (2)
1 m
2
C[]
1
2
m n
Mem[]
Buf[] ba
![Page 10: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/10.jpg)
Proof (3)
1 m
2
C[]
1
2
m n
Mem[]
Buf[] ba
q
![Page 11: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/11.jpg)
Proof (4)
1 m
2
C[]
1
2
m n
Mem[]
Buf[] ba
b
![Page 12: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/12.jpg)
Proof (5)
1 m
2
C[]
1
2
m n
Mem[]
Buf[] ba
b
q
![Page 13: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/13.jpg)
Proof (6)
1 m
2
C[]
1
2
m n
Mem[]
Buf[] ba
b
q
![Page 14: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/14.jpg)
Block efficient algorithms
For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred.
In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps.
The algorithms for sorting, FFT, and matrix transposition are block efficient.
![Page 15: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/15.jpg)
Extension to set-associative cache
In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block.
In the emulation technique described before we do not have explicit control of the replacement.
Instead, a property of LRU will be used, and the cache will be used only partially.
![Page 16: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/16.jpg)
Optimal Replacement Algorithm for Cache OPT or MIN – a hypothetical algorithm that
minimizes cache misses for a given (finite) access trace.
Offline – it knows in advance which blocks will be accessed next.
Evicts from cache the block which will be accessed again in the longest time in the future.
Was proven to be optimal – better than any online algorithm.
Proposed by Belady in 1966. Used to theoretically test efficiency of online
algorithms.
![Page 17: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/17.jpg)
LRU vs. OPT
For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m.
For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m.
LRU – 3X misses OPT – X misses
6 = (1-1/3) 9
1 69 1
![Page 18: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/18.jpg)
Extension to set-associative cache – Cont.
Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2.
We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2
These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm.
In the real cache, k lines will be managed by LRU and will experience twice the misses.
![Page 19: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/19.jpg)
Extension to set-associative cache – Cont.
1 mC[] 2
21 m n
Mem[]
Buf[]
![Page 20: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/20.jpg)
Generalized Emulation Theorem
An algorithm A in I(M/2,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T ) steps.
The additional memory requirement is m/2 blocks.
![Page 21: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/21.jpg)
The cache complexity of sorting
The lower bound for sorting in I(M,B) is
The lower bound for sorting in C(M,B,L) is
loglog
log
N N BN N L
B M B
log
log
N N B
B M B
I = computations T = I/O operations
![Page 22: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/22.jpg)
Cache Miss Classes
Compulsory Miss – a block is being referenced for the first time
Capacity Miss – a block was evicted from the cache because it is too small
Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.
![Page 23: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/23.jpg)
Average case performance of merge-sort in the cache model
We want to estimate the number of cache misses while performing the algorithm: Compulsory misses are unavoidable Capacity misses are minimized by the I/O
algorithm We can quantify the expected number of conflict
misses.
![Page 24: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/24.jpg)
When does a conflict miss occur?
s cache sets are available for k runs S1…Sk. The expected number of elements in any run
Si is N/k. A leading block is a cache line containing a
leading element of a run. bi is the leading block of Si.
A conflict occurs when two leading blocks are mapped to the same cache set.
![Page 25: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/25.jpg)
When does a conflict miss occur – Cont.
Formally: a conflict miss occurs for element Si,j+1 when there is at least one element x in a leading block bk, k≠i, such that Si,j<x<Si,j+1
and S(bi) = S(bk).
Si
Sk
j j+1
x
…
![Page 26: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/26.jpg)
How many conflict misses to expect
Pi = the probability of conflict for element i, 1≤i≤N.
Assume uniform distribution: The leading blocks among cache sets The leading element within the leading block
If k is Ω(s) then Pi is Ω(1). For each round, the number of conflict
misses is Ω(N).
![Page 27: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/27.jpg)
How many conflict misses to expect – Cont.
The expected number of conflict misses throughout merge-sort is
This includes O(N) misses for each pass. By choosing k<<s we minimize the probability of
conflict misses, but we incur more capacity misses.
log
log
N BN
M B
![Page 28: Towards a Theory of Cache-Efficient Algorithms](https://reader034.vdocuments.us/reader034/viewer/2022051020/56815a7a550346895dc7e3af/html5/thumbnails/28.jpg)
Conclusions
There is a way to transform I/O efficient algorithms to cache efficient algorithms
It is only for blocking, direct mapped cache that does not distinguish between reads and writes.
The constants are important in these orders of magnitude.