performance optimizations for numa-multicore systems zoltán majó department of computer science...

Performance Optimizationsfor NUMA-Multicore Systems

Zoltán Majó

Department of Computer ScienceETH Zurich, Switzerland

2

About me

ETH Zurich: research assistant Research: performance optimizations Assistant: lectures

TUCN Student Communications Center: network engineer Department of Computer Science: assistant

3

Computing

Unlimited need for performance

4

Performance optimizations

One goal: make programs run fast

Idea: pick good algorithm Reduce number of operations executed Example: sorting

5

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

n^2 n*log(n)

Input size (n)

Execution time [T]

Number of operations

6

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

Polynomial (n^2)

Column1

Input size (n)

Execution time [T]

Number of operationsbubble so

rt

7

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

Polynomial (n^2)

n*log(n)

Input size (n)

Execution time [T]

bubble sort

quicksort


8

Sorting

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

Polynomial (n^2)

n*log(n)

Input size (n)

Execution time [T]

bubble sort

quicksort


11Xfaster

9

Sorting

We picked good algorithm, work done

Are we really done?

Make sure our algorithm runs fast Operations take time We assumed 1 operation = 1 time unit T

10

Quicksort performance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T

11


1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

12


1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

13


1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 op = 8 T

14


1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

120

140

160

180

200

Input size (n)

Execution time [T]

1 op = 1 T1 op = 2 T

1 op = 4 T

1 op = 8 T

bubble sort (

1 op = 1 T)

32%faster

15

Latency of operations

Best algorithm not enough

Operations are executed on hardware

Stage 1:Dispatchoperation

Stage 2:Executeoperation

Stage 3:Retireoperation

CPU

16

Latency of operations

Best algorithm not enough

Operations are executed on hardware

Hardware must be used efficiently

Stage 1:Dispatchoperation

Stage 2:Executeoperation

Stage 3:Retireoperation

CPU

17

Outline

Introduction: performance optimizations

Cache-aware programming

Scheduling on multicore processors

Using run-time feedback

Data locality optimizations on NUMA-multicores

Conclusion

ETH scholarship

18

RAM

Memory accessesCPU

230 cycles access latency

19

RAM


CPU

Memory accesses

Total access latency = ?Total access latency = 16 x 230 cycles = 3680 cycles

20

RAM


CPU

Caching

21

Cache

RAM

CachingCPU



Block size:

22

RAM

Cache

CachingCPU



Block size:

23

Cache

RAM

CachingCPU



Block size:

24

RAM

Cache

CachingCPU



Block size:

25

Cache

RAM

CachingCPU



Block size:

26

RAM

Cache

Hits and missesCPU



Cache miss: data not in cache = 230 cyclesCache hit: data in cache = 30 cycles

27

RAM

Cache

Total access latencyCPU



Total access latency = ?Total access latency = 4 misses + 12 hits= 4 x 230 cycles + 12 * 30 cycles = 1280 cycles

28

Benefits of caching

Comparison Architecture w/o cache: T = 230 cycles Architecture w/ cache: Tavg = 80 cycles → 2.7X

improvement

Do caches always help? Can you think of access pattern with bad cache usage?

29

RAM

Cache

CachingCPU



Block size:

30


Today’s example: matrix-matrix multiplication (MMM)

Number of operations: n3

Compare naïve and optimized implementation Same number of operations

31

C

MMM: naïve implementation

for (i=0; i<N; i++)for (j=0; j<N; j++) {

sum = 0.0;for (k=0; k < N; k+

+)sum += A[i]

[k]*B[k][j];C[i][j] = sum;

}

A B= Xj

i i

j

32

RAM

Cache

MMMCPU



C A B

Cache hits Total accesses

A[][]B[][]

44

??

33

RAM

A

Cache

MMMCPU



C B


A[][]B[][]

44

??3

34

RAM

A

Cache

MMMCPU



C B


A[][]B[][]

44

??3

35

RAM

A

Cache

MMMCPU



C B


A[][]B[][]

44

??3

36

RAM

BA

Cache

MMMCPU



C


A[][]B[][]

44

??30

37

MMM: Cache performance

Hit rate Accesses to A[][]: 3/4 = 75% Accesses to B[][]: 0/4 = 0% All accesses: 38%

Can we do better?

38

Cache-friendly MMM

Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj)

for (i=0; i<N; i++)for (j=0; j<N; j++) {

sum = 0.0;for (k=0; k <

N; k++)sum +=

A[i][k]*B[k][j];C[i][j] += sum;

}

for (i=0; i<N; i++)for (k=0; k<N; k++) {

r = A[i][k];for (j=0; j <

N; j++)C[i][j]

+= r*B[k][j];}

C A B= Xk

iik

39

RAM

BC A

Cache

MMMCPU




C[][]B[][]

44

??33

40

Cache-friendly MMM

Cache-unfriendly MMM (ijk)

A[][]: 3/4 = 75% hit rate

B[][]: 0/4 = 0% hit rate

All accesses: 38% hit rate

Cache-friendly MMM (ikj)

C[][]: 3/4 = 75% hit rate

B[][]: 3/4 = 75% hit rate

All accesses: 75% hit rate

Better performance due to cache-friendliness?

41

512 1024 2048 4096 81920.01

0.1

1

10

100

1000

10000

ijk (cache-unfriendly) ikj (cache-friendly)

Matrix size

Performance of MMMExecution time [s]

42

512 1024 2048 4096 81920.01

0.1

1

10

100

1000

10000


Matrix size

Performance of MMMExecution time [s]

20X

43


Two versions of MMM: ijk and ikj Same number of operations (~n3) ikj 20X better than ijk

Good performance depends on two aspects Good algorithm Implementation that takes hardware into account

Hardware Many possibilities for inefficiencies We consider only the memory system in this lecture

44

Outline






Conclusions

ETH scholarship

45

CPU

Cache-based architecture

RAM

Bus Controller

L1-C

CacheL2 Cache

Memory Controller




46

Processor package

CPUCore

Multi-core multiprocessor

RAM

Bus Controller

L1-C

CacheL2 Cache

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor package

CPUCore

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

47

Experiment

Performance of a well-optimized program soplex from SPECCPU 2006

Multicore-multiprocessor systems are parallel Multiple programs run on the system simultaneously Contender program: milc from SPECCPU 2006

Examine 4 execution scenarios

soplex

milc

48

Execution scenariosProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex milc

49


L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex milc

50

Performance with sharing: soplex

0.00.40.81.21.62.0

Execution time relative to solo execution

51


0.00.40.81.21.62.0


52


0.00.40.81.21.62.0


53

Resource sharing

Significant slowdowns due to resource sharing

Why is resource sharing so bad?Example: cache sharing

54

RAM

L1 Cache

Cache sharingCore Coresoplex milc

55

RAM

L1 Cache

Cache sharingCore Coresoplex milc

56

Resource sharing

Does resource sharing affect all programs? So far: we considered at the performance of under contention Let us consider a different program:

soplex

namd

57

Performance with sharing

0.00.40.81.21.62.0

soplexnamd


58

Performance with sharing

0.00.40.81.21.62.0

soplexnamd


59

Resource sharing

Significant slowdown for some programs affected significantly affected less

What do we do about it?

Scheduling can help Example workload:

soplex

namd

soplex soplexsoplexsoplex

namd namd namd namd

60


L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 0

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex soplex soplex soplex namdnamd namd namd

61


L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 0

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex soplex namd namdsoplex namdsoplex namd

62

Challenges for a scheduler

Programs have different behaviors

Behavior not known ahead-of-time vs.

Behavior changes over time

soplex namd

63

Single-phased program

64

Program with multiple phases

65

Outline






Conclusions

ETH scholarship

66

Hardware performance counters

Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation

In the past: undocumented feature

Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst

67

Programming performance counters

Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode)

perf_events interface Standard Linux interface since Linux 2.6.31 UNIX philosophy: performance counters are files

Simple API: Set up counters: perf_event_open() Read counters as files

68

int main() {

int pid = fork();

if (pid == 0) {

exit(exec(“./my_program”, NULL));

} else {

int status; uint64_t value;

int fd = perf_event_open(...);

waitpid(pid, &status, NULL);

read(fd, &value, sizeof(uint64_t);

printf(”Cache misses: %”PRIu64”\n”, value);

}

}

Example: monitoring cache misses

69

perf_event_open()

Looks simpleint sys_perf_event_open(

struct perf_event_attr *hw_event_uptr,

pid_t pid,

int cpu,

int group_fd,

unsigned long flags

);

struct perf_event_attr {__u32 type;__u32 size;__u64 config;union {

__u64 sample_period;__u64 sample_freq;

};__u64 sample_type;__u64 read_format;__u64 inherit;__u64 pinned;__u64 exclusive;__u64 exclude_user;__u64 exclude_kernel;__u64 exclude_hv;__u64 exclude_idle;__u64 mmap;

70

libpfm

Open-source helper library

user programlibpfm perf_events

(1) event name

(2) set up perf_event_attr

(3) call perf_event_open()

(4) read results

71

Example: measure cache misses for MMM

Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture

Look up event needed Source: Intel Architectures Software Developer's Manual

72

Software Developer’s Manual

73

Example: measure cache misses for MMM

Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture

Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM

74

512 1024 2048 4096 819210000

100000

1000000

10000000

100000000

1000000000

10000000000

100000000000

1000000000000


Matrix size

MMM cache misses# cache misses x 106

30X

75

Single-phased program

set up performance counters read performance counters

76

Program with multiple phases

set up performance counters

get sample

77

Membus: multicore scheduler

1. Dynamically determine program behavior Measure # of loads/stores that cause memory traffic Hardware performance counters in sampling mode

2. Determine optimal placement based on measurements

78

Evaluation

Workload with 8 processes lbm, soplex, gromacs, hmmer from SPEC CPU 2006 Two instances of each program

Experimental results

79

Evaluation

lbm soplex gromacs hmmer Average0.0

0.5

1.0

1.5

2.0

2.5

3.0

Default LinuxMembus


80

Evaluation


0.5

1.0

1.5

2.0

2.5

3.0

Default LinuxMembus


81

Evaluation


0.5

1.0

1.5

2.0

2.5

3.0

Default LinuxMembus


16%

8%

82

Summary: multicore processors

Resource sharing critical for performance

Membus: a scheduler that reduces resource sharing

Question: why wasn’t Membus able to improve more?

83

Memory controller sharingProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Memory Controller

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Memory Controller

soplex soplex namd namdnamd soplexnamd soplex

84

Memory Controller

Non-uniform memory architectureProcessor 0

L2 Cache

CPUCore

RAM

Bus Controller

L1-C

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Bus Controller

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

RAM RAM

Core

85

Non-uniform memory architectureProcessor 0

L2 Cache

CPUCore

Memory Ctrl

L1-C

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

Processor 1

Core

Memory Ctrl

L1-C

CacheL2 Cache

Core

L1-C

Core

L2 Cache

L1-C

Core

L1-C

RAM RAM

Interconnect Interconnect

86

Outline






Conclusions

ETH scholarship

87

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

88


Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

89


Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

90

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

91

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

92

First-touch page placement policy

Processor 1

DRAM

Processor 0

DRAM

T0 T1

P0

R/W

93

First-touch page placement policy

Processor 1

DRAM

Processor 0

DRAM

T0 T1

P1

R/W

P0

94

Automatic page placement

First-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported by hardware performance counters many architectures

95

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

T0T1

T1P1

P0

96


Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

97

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

98



5%

10%

15%

20%

25%Performance improvement over first-touch [%]

99


Performance improvement over first-touch in some cases No performance improvement in many cases

Why?

100

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0


P1 : accessed 3000 times by

T0T1

T1

P0 P1


accessed 5000 times by

T0

T1

P2

P2: inter-processor shared

101


Processor 1

DRAM

Processor 0

DRAM

T0



T0T1

T1

P0 P1


accessed 5000 times by

T0

T1P2

P2: inter-processor shared

102



10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

103



10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

104



10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

105



10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

106


Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property

We propose program transformations No time for details now, see results

107


5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

108


5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

109

Conclusions

Performance optimizations Good algorithm + hardware-awareness Example: cache-aware matrix multiplication

Hardware awareness Resource sharing in multicore processors Data placement in non-uniform memory architectures

A lot remains to be done...

...and you can be part of it!

110

ETH scholarship for masters students...

...to work on their master thesisIn the Laboratory of Software Technology

Prof. Thomas R. GrossPhD. Stanford University, MIPS project, supervisor John L.

HennessyCarnegie Mellon: Warp, iWarp, Fx projects

ETH offers to you Monthly scholarship of CHF 1500– 1700 (EUR 1200–1400) Assistance with finding housing Thesis topic

111

Possible Topics

Michael Pradel: Automatic bug finding

Luca Della Toffola: Performance optimizations for Java

Me: Hardware-aware performance optimizations

A

B DC

E

Call Graph

A B C D E

A B C

Memory

… …

…

Cache

OO code positioning

A

B DC

E

Call Graph

Profiling Hot Path

A

B DC

E

Call Graph

A B C D E

A B C

Memory

… …

…

Cache

Miss

A

B DC

E

Call Graph

A B CDE

A B E

Memory

… …

…

Cache

Hit

• JVM• No Profiling• Constructors

• Linked list traversal• Looking for the youngest/oldest person

Person

next

name

surname

age

Person

next

name

surname

age

Person

next

name

surname

age

Person

next

name

surname

age

null

Cache

next name agesurname




next name agesurname next name agesurname

Cache

next age next age next age next age



Aa1a2

a3

a4

a5

Aa1: 10a2: 100

a3: 1000

a4: 30

a5: 2000

Aa3a5

Class$Cold

A$Colda1a2

a4

hot field

cold field

Profiling Splitting

# field accesses

• Jikes RVM• Splitting strategies• Garbage collection optimizations• Allocation optimizations

121

If interested and motivated

Apply @ Prof. Rodica Potolea Until August 2012

Come to Zurich Start in February 2013 Work 4-6 months on the thesis

If you have questions Send e-mail to me [email protected] Talk to Prof. Rodica Potolea

122

Thank you for your attention!

performance optimizations for numa-multicore systems zoltán majó department of computer science...

Documents

access pattern

time unit t

bad cache usage

cyclesarchitecture w

hardware hardware

meeth zurich

n3compare nave

research assistantresearch