zoltan majo and thomas r. gross department of computer science eth zurich

Post on 03-Jan-2016

15 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead. Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich. NUMA multicores. Processor 0. Processor 1. 0. 1. 2. 3. Cache. Cache. MC. MC. MC. IC. IC. IC. IC. - PowerPoint PPT Presentation

TRANSCRIPT

1

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

2

NUMA multicores

DRAM memory

32

MC

Cache

10

MC

DRAM memory

Cache

IC ICMC

DRAM memory DRAM memory

MCIC IC

Processor 0 Processor 1

3

10

MC

DRAM memory

Cache

DRAM memory

32

MC

Cache

NUMA multicores

Two problems:

• NUMA:interconnect overhead

BA

MA MB

IC IC

Processor 0 Processor 1

4

DRAM memory

32

MC

Cache

10

MC

DRAM memory

Cache

NUMA multicores

BA

MA MB

Cache

Two problems:

• NUMA:interconnect overhead

• multicore:cache contention

IC IC

Processor 0 Processor 1

5

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

6

Multi-clone experiments

• Intel Xeon E5520

• 4 clones of soplex (SPEC CPU2006)

– local clone

– remote clone

DRAM memory

MC

Cache

0

MC

DRAM memory

Cache

IC IC

1 32 4 6 75

• Memory behavior of unrelated programs

M M M M

C C C C

C C C C

C

C

Processor 0 Processor 1

1 2

3

4 57

Cache

C

DRAM

Cache

C C C

Local bandwidth: 100%

M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 80%

M MMM

Cache

C

DRAM

Cache

C CC

Local bandwidth: 57%

M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 32%

M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 0%

M MMM

8

Performance of schedules

• Which is the best schedule?

• Baseline: single-program execution mode

Cache

C

Cache

M

9

0% 25% 50% 75% 100%1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Local memory bandwidth

Execution time

local clones

remote clones

average

Slowdown relative to baseline

C

C

C

10

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

11

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

12

Step 1: Maximum-local mapping

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

B MB

A MA

C MC

D MD

Processor 0 Processor 1

13

Default OS scheduling

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75BA D

MBMA MC MD

C

Processor 0 Processor 1

14

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

15

Step 2: Cache-aware refinement

In an SMP:

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BAD C

Processor 0 Processor 1

16

Step 2: Cache-aware refinement

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BA DC

MA

In an SMP:Processor 0 Processor 1

17

Step 2: Cache-aware refinement

A B C

D

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MB MC MD

BA D C

MA

A B CD

Performance degradation

In an SMP:

NUMA penalty

Processor 0 Processor 1

18

Step 2: Cache-aware refinement

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BA C DIn a NUMA:Processor 0 Processor 1

19

Step 2: Cache-aware refinement

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

A DCBIn a NUMA:Processor 0 Processor 1

20

Step 2: Cache-aware refinement

A B C

D

Performance degradation

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MB MC MDMA

BA DC

A

B

C D

NUMA allowance

In a NUMA:

NUMA penalty

Processor 0 Processor 1

21

Performance factors

Two factors cause performance degradation:

1. NUMA penaltyslowdown due toremote memory access

2. cache pressure local processes:

misses / KINST (MPKI) remote processes:

MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28

1.0

1.1

1.2

1.3

1.4

1.5

SPEC programs

NUMA penalty

22

Implementation

• User-mode extension to the Linux scheduler

• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI

• All memory for a process allocated on one

processor

23

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

24

0.00

001

0.00

01

0.00

1

0.01 0.

1 1 10 100

0.9

1.0

1.1

1.2

1.3

1.4

1.5

not used programsused programs

MPKI

Workloads

• SPEC CPU2006 subset

• 11 multi-program workloads (WL1 WL11)

4-program workloads(WL1 WL9)

8-program workloads(WL10, WL11)

NUMA penalty

CPU-bound Memory-bound

25

Memory allocation setup

• Where the memory of each process is allocated influences performance

• Controlled setup: memory allocation maps

26

Memory allocation maps

B MB

A C MC

D MD

MA

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

27

Memory allocation maps

BA C D

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0011

MA

MB

MC

MD

28

Memory allocation maps

BA C D

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0011

MA

MB

MC

MD

Unbalanced Balanced

29

Evaluation

• Baseline: Linux average

– Linux scheduler non-deterministic

– average performance degradation in all possible

cases

• N-MASS with perfect NUMA penalty

information

30

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux average

Allocation maps

WL9: Linux averageAverage slowdown relative to single-program mode

31

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux averageN-MASS

Allocation maps

WL9: N-MASSAverage slowdown relative to single-program mode

32

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux averageN-MASS

Allocation maps

WL1: Linux average and N-MASSAverage slowdown relative to single-program mode

33

N-MASS performance

• N-MASS reduces performance degradation by up to 22%

• Which factor more important: interconnect overhead or cache contention?

• Compare:

- maximum-local- N-MASS (maximum-local

+ cache refinement step)

34

Data-locality vs. cache balancing (WL9)

0000 1000 0100 0010 0001 1100 1010 1001-10%

-5%

0%

5%

10%

15%

20%

25%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

35

Data-locality vs. cache balancing (WL1)

0000 1000 0100 0010 0001 1100 1010 1001-10%

-5%

0%

5%

10%

15%

20%

25%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

36

Data locality vs. cache balancing

• Data-locality more important than cache balancing

• Cache-balancing gives performance benefits mostly with unbalanced allocation maps

• What if information about NUMA penalty not available?

37

0 10 20 30 40 501.0

1.1

1.2

1.3

1.4

1.5

MPKI

Estimating NUMA penalty

• NUMA penalty is not directly measurable

• Estimate: fit linear regression onto MPKI data

NUMA penalty

38

Estimate-based N-MASS: performance

Performance improvement relative to Linux average

WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%

0%

2%

4%

6%

8%

maximum-local N-MASS Estimate-based N-MASS

39

Conclusions

• N-MASS: NUMAmulticore-aware scheduler

• Data locality optimizations more beneficial than cache contention avoidance

• Better performance metrics needed for scheduling

40

Thank you! Questions?

top related