high performance computing for science and …

63
Computational Science and Engineering Lab ETH Zürich HIGH PERFORMANCE COMPUTING FOR SCIENCE AND ENGINEERING I 2020 Exercise 01 Amdahl’s law - Cache

Upload: others

Post on 04-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

ComputationalScienceandEngineeringLabETHZürich

HIGHPERFORMANCECOMPUTINGFORSCIENCEANDENGINEERINGI2020

Exercise01Amdahl’slaw-Cache

Q1: Amdahl’s law Q2: Linear Algebra and Cache Q3: Cache Size and Cache Speed

Q1: Amdahl’s law

DerivationT

p

n

: Total execution time of a program

: Percentage of the program that can be parallelized

: Number of processors that run the parallel part of the program

Derivation

T = (1 − p)T + pT

T

p

n

: Total execution time of a program

: Percentage of the program that can be parallelized

: Number of processors that run the parallel part of the program

Derivation

T = (1 − p)T + pT

T

p

n

: Total execution time of a program

: Percentage of the program that can be parallelized

: Number of processors that run the parallel part of the program

T(n) = (1 − p)T +pn

T

Derivation

T = (1 − p)T + pT

T

p

n

: Total execution time of a program

: Percentage of the program that can be parallelized

: Number of processors that run the parallel part of the program

T(n) = (1 − p)T +pn

T

Speedup =T

T(n)=

(1 − p)T + pT

(1 − p)T + pn T

=1

1 − p + pn

Q2: Linear Algebra and Cache

Cache DesignCPU

CoreL1D L1I

L2

CoreL1D L1I

L2

CoreL1D L1I

L2

CoreL1D L1I

L2

L3

Cache Design• Cache = Temporary local memory

• Closer to CPU = Faster, but smaller

CPUCoreL1D L1I

L2

CoreL1D L1I

L2

CoreL1D L1I

L2

CoreL1D L1I

L2

L3

Cache Design• Cache = Temporary local memory

• Closer to CPU = Faster, but smaller

• Example:

CPUCoreL1D L1I

L2

CoreL1D L1I

L2

CoreL1D L1I

L2

CoreL1D L1I

L2

L3

* https://www.agner.org/optimize/microarchitecture.pdf

Size Latency [cycles] Associativity

L1 32 kB 4 8

L2 256 kB 14 8

L3 3-24MB 34-85 4-16

DRAM GB-TB 100+ N/A

} Intel Skylake*

Cache Line

• Data stored in chunks of 64 bytes

• If a single double (8 bytes) is accessed, 56 more bytes will be read

• 7 more doubles for free!

L1D0-63

64-127

128-191

32704-32767

Cache line (typically 64 bytes)

32 kB

4160-4223

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

Compulsory Misses: 6 Capacity Misses: 0

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Cache Usage: Matrix Matrix Multiplication

4 Cache Lines of 2 Elements Each

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

A and B stored row-major differentiated color. Let's ignore C accesses for simplicity.

System Memory

A B

x =

C

Compulsory Misses: 6 Capacity Misses: 410 Misses / 2 Elements ~ 5 Misses / Element

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Compulsory Misses: 4 Capacity Misses: 0

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

Compulsory Misses: 6 Capacity Misses: 0

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

Compulsory Misses: 10 Capacity Misses: 0

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Cache Optimization: Blocked MultiplicationIdea: Let's solve the multiplication in blocks, storing partial solutions to Ci,j

System Memory

A B

x

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

=

C

Compulsory Misses: 12 Capacity Misses: 012 Misses / 4 Elements (in 16 quarters)

~ 3 Misses / Element

Q3: Cache Size and Cache Speed

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

k ak

Repeat M times:

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

k ak

Repeat M times: Question:

⌘ =M

total time= ?

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Expectation: Larger N —> don’t fit into (small) caches —> slower than smaller N

Variant 1: a random one-cycle permutation (Sattolo’s algorithm, see skeleton code)

Variant 2:

Variant 3:

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

k ak

Repeat M times: Question:

⌘ =M

total time= ?

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Expectation: Larger N —> don’t fit into (small) caches —> slower than smaller N

Variant 1: a random one-cycle permutation (Sattolo’s algorithm, see skeleton code)

Variant 2:

Variant 3:

Memory access speed vs. array size?

Setup:0 1 2 3 4 5 … N-12 7 5 1158 … 42

k

k ak

Repeat M times: Question:

⌘ =M

total time= ?

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

ak

Variant 1: a random one-cycle permutation

Variant 2:

Variant 3:

Question 1: Task

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

Variant 1: a random one-cycle permutation

Variant 2:

Variant 3:

Question 1: Task

ak = (k + 1)%N

ak =

✓k +

cache line size

sizeof(int)

◆%N

ak =

log ⌘

logN

(rough sketch)