high performance computing for science and …
TRANSCRIPT
ComputationalScienceandEngineeringLabETHZürich
HIGHPERFORMANCECOMPUTINGFORSCIENCEANDENGINEERINGI2020
Exercise01Amdahl’slaw-Cache
DerivationT
p
n
: Total execution time of a program
: Percentage of the program that can be parallelized
: Number of processors that run the parallel part of the program
Derivation
T = (1 − p)T + pT
T
p
n
: Total execution time of a program
: Percentage of the program that can be parallelized
: Number of processors that run the parallel part of the program
Derivation
T = (1 − p)T + pT
T
p
n
: Total execution time of a program
: Percentage of the program that can be parallelized
: Number of processors that run the parallel part of the program
T(n) = (1 − p)T +pn
T
Derivation
T = (1 − p)T + pT
T
p
n
: Total execution time of a program
: Percentage of the program that can be parallelized
: Number of processors that run the parallel part of the program
T(n) = (1 − p)T +pn
T
Speedup =T
T(n)=
(1 − p)T + pT
(1 − p)T + pn T
=1
1 − p + pn
Cache Design• Cache = Temporary local memory
• Closer to CPU = Faster, but smaller
CPUCoreL1D L1I
L2
CoreL1D L1I
L2
CoreL1D L1I
L2
CoreL1D L1I
L2
L3
Cache Design• Cache = Temporary local memory
• Closer to CPU = Faster, but smaller
• Example:
CPUCoreL1D L1I
L2
CoreL1D L1I
L2
CoreL1D L1I
L2
CoreL1D L1I
L2
L3
* https://www.agner.org/optimize/microarchitecture.pdf
Size Latency [cycles] Associativity
L1 32 kB 4 8
L2 256 kB 14 8
L3 3-24MB 34-85 4-16
DRAM GB-TB 100+ N/A
} Intel Skylake*
Cache Line
• Data stored in chunks of 64 bytes
• If a single double (8 bytes) is accessed, 56 more bytes will be read
• 7 more doubles for free!
L1D0-63
64-127
128-191
…
32704-32767
Cache line (typically 64 bytes)
32 kB
4160-4223
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
Compulsory Misses: 6 Capacity Misses: 0
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Cache Usage: Matrix Matrix Multiplication
4 Cache Lines of 2 Elements Each
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.
System Memory
A B
x =
C
Compulsory Misses: 6 Capacity Misses: 410 Misses / 2 Elements ~ 5 Misses / Element
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Compulsory Misses: 4 Capacity Misses: 0
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
Compulsory Misses: 6 Capacity Misses: 0
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
Compulsory Misses: 10 Capacity Misses: 0
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j
System Memory
A B
x
Cache Line 1
Cache Line 2
Cache Line 3
Cache Line 4
=
C
Compulsory Misses: 12 Capacity Misses: 012 Misses / 4 Elements (in 16 quarters)
~ 3 Misses / Element
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
k ak
Repeat M times:
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
k ak
Repeat M times: Question:
⌘ =M
total time= ?
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Expectation: Larger N —> don’t fit into (small) caches —> slower than smaller N
Variant 1: a random one-cycle permutation (Sattolo’s algorithm, see skeleton code)
Variant 2:
Variant 3:
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
k ak
Repeat M times: Question:
⌘ =M
total time= ?
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Expectation: Larger N —> don’t fit into (small) caches —> slower than smaller N
Variant 1: a random one-cycle permutation (Sattolo’s algorithm, see skeleton code)
Variant 2:
Variant 3:
Memory access speed vs. array size?
Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42
k
k ak
Repeat M times: Question:
⌘ =M
total time= ?
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =
ak
Variant 1: a random one-cycle permutation
Variant 2:
Variant 3:
Question 1: Task
ak = (k + 1)%N
ak =
✓k +
cache line size
sizeof(int)
◆%N
ak =