high performance computing for science and …

ComputationalScienceandEngineeringLabETHZürich

HIGHPERFORMANCECOMPUTINGFORSCIENCEANDENGINEERINGI2020

Exercise01Amdahl’slaw-Cache

Q1: Amdahl’s law Q2: Linear Algebra and Cache Q3: Cache Size and Cache Speed

Q1: Amdahl’s law

DerivationT

: Total execution time of a program

: Percentage of the program that can be parallelized

: Number of processors that run the parallel part of the program

Derivation

T = (1 − p)T + pT

Derivation

T = (1 − p)T + pT

T(n) = (1 − p)T +pn

Derivation

T = (1 − p)T + pT

T(n) = (1 − p)T +pn

Speedup =T

(1 − p)T + pT

(1 − p)T + pn T

1 − p + pn

Q2: Linear Algebra and Cache

Cache DesignCPU

CoreL1D L1I

Cache Design• Cache = Temporary local memory

• Closer to CPU = Faster, but smaller

CPUCoreL1D L1I

CoreL1D L1I

Cache Design• Cache = Temporary local memory

• Closer to CPU = Faster, but smaller

• Example:

CPUCoreL1D L1I

CoreL1D L1I

* https://www.agner.org/optimize/microarchitecture.pdf

Size Latency [cycles] Associativity

L1 32 kB 4 8

L2 256 kB 14 8

L3 3-24MB 34-85 4-16

DRAM GB-TB 100+ N/A

} Intel Skylake*

Cache Line

• Data stored in chunks of 64 bytes

• If a single double (8 bytes) is accessed, 56 more bytes will be read

• 7 more doubles for free!

L1D0-63

64-127

128-191

32704-32767

Cache line (typically 64 bytes)

4160-4223

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

Compulsory Misses: 6 Capacity Misses: 0

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Compulsory Misses: 6 Capacity Misses: 410 Misses / 2 Elements ~ 5 Misses / Element

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

System Memory

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

Compulsory Misses: 12 Capacity Misses: 012 Misses / 4 Elements (in 16 quarters)

~ 3 Misses / Element

Q3: Cache Size and Cache Speed

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

Repeat M times:

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

Repeat M times: Question:

⌘ =M

total time= ?

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Expectation: Larger N —> don’t fit into (small) caches —> slower than smaller N

Variant 1: a random one-cycle permutation (Sattolo’s algorithm, see skeleton code)

Variant 2:

Variant 3:

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

⌘ =M

total time= ?

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Expectation: Larger N —> don’t fit into (small) caches —> slower than smaller N

Variant 1: a random one-cycle permutation (Sattolo’s algorithm, see skeleton code)

Variant 2:

Variant 3:

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

⌘ =M

total time= ?

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Variant 1: a random one-cycle permutation

Variant 2:

Variant 3:

Question 1: Task

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

Variant 1: a random one-cycle permutation

Variant 2:

Variant 3:

Question 1: Task

ak = (k + 1)%N

✓k +

cache line size

sizeof(int)

log ⌘

(rough sketch)

high performance computing for science and …

Documents