memory management in numa multicore systems: trapped between cache contention and interconnect...

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

NUMA multicores

DRAM memory

IC ICMC

DRAM memory DRAM memory

MCIC IC

Processor 0 Processor 1

DRAM memory

NUMA multicores

Two problems:

• NUMA:interconnect overhead

DRAM memory

NUMA multicores

Two problems:

• NUMA:interconnect overhead

• multicore:cache contention

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

Multi-clone experiments

• Intel Xeon E5520

• 4 clones of soplex (SPEC CPU2006)

– local clone

– remote clone

DRAM memory

1 32 4 6 75

• Memory behavior of unrelated programs

M M M M

C C C C

Local bandwidth: 100%

Local bandwidth: 0%

Performance of schedules

• Which is the best schedule?

• Baseline: single-program execution mode

0% 25% 50% 75% 100%1.0

Local memory bandwidth

Execution time

local clones

remote clones

average

Slowdown relative to baseline

Outline

• Scheduling

– N-MASS

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Step 1: Maximum-local mapping

1 32 4 6 75

Default OS scheduling

1 32 4 6 75BA D

MBMA MC MD

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Step 2: Cache-aware refinement

In an SMP:

1 32 4 6 75

MBMA MC MD

1 32 4 6 75

MBMA MC MD

In an SMP:Processor 0 Processor 1

1 32 4 6 75

MB MC MD

BA D C

A B CD

Performance degradation

In an SMP:

NUMA penalty

1 32 4 6 75

MBMA MC MD

BA C DIn a NUMA:Processor 0 Processor 1

1 32 4 6 75

MBMA MC MD

A DCBIn a NUMA:Processor 0 Processor 1

Performance degradation

1 32 4 6 75

MB MC MDMA

NUMA allowance

In a NUMA:

NUMA penalty

Performance factors

Two factors cause performance degradation:

1. NUMA penaltyslowdown due toremote memory access

2. cache pressure local processes:

misses / KINST (MPKI) remote processes:

MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28

SPEC programs

NUMA penalty

Implementation

• User-mode extension to the Linux scheduler

• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI

• All memory for a process allocated on one

processor

Outline

• Scheduling

– N-MASS

0.01 0.

1 1 10 100

not used programsused programs

Workloads

• SPEC CPU2006 subset

• 11 multi-program workloads (WL1 WL11)

4-program workloads(WL1 WL9)

8-program workloads(WL10, WL11)

NUMA penalty

CPU-bound Memory-bound

Memory allocation setup

• Where the memory of each process is allocated influences performance

• Controlled setup: memory allocation maps

Memory allocation maps

A C MC

Processor 1

Processor 0

Allocation map: 0000

BA C D

Processor 1

Processor 0

Processor 1

Processor 0

BA C D

Processor 1

Processor 0

Processor 1

Processor 0

Unbalanced Balanced

Evaluation

• Baseline: Linux average

– Linux scheduler non-deterministic

– average performance degradation in all possible

• N-MASS with perfect NUMA penalty

information

0000 1000 0100 0010 0001 1100 1010 10011.0

Linux average

Allocation maps

WL9: Linux averageAverage slowdown relative to single-program mode

0000 1000 0100 0010 0001 1100 1010 10011.0

Linux averageN-MASS

Allocation maps

WL9: N-MASSAverage slowdown relative to single-program mode

0000 1000 0100 0010 0001 1100 1010 10011.0

Linux averageN-MASS

Allocation maps

WL1: Linux average and N-MASSAverage slowdown relative to single-program mode

N-MASS performance

• N-MASS reduces performance degradation by up to 22%

• Which factor more important: interconnect overhead or cache contention?

• Compare:

- maximum-local- N-MASS (maximum-local

+ cache refinement step)

Data-locality vs. cache balancing (WL9)

0000 1000 0100 0010 0001 1100 1010 1001-10%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1)

0000 1000 0100 0010 0001 1100 1010 1001-10%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Data locality vs. cache balancing

• Data-locality more important than cache balancing

• Cache-balancing gives performance benefits mostly with unbalanced allocation maps

• What if information about NUMA penalty not available?

0 10 20 30 40 501.0

Estimating NUMA penalty

• NUMA penalty is not directly measurable

• Estimate: fit linear regression onto MPKI data

NUMA penalty

Estimate-based N-MASS: performance

WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%

maximum-local N-MASS Estimate-based N-MASS

Conclusions

• N-MASS: NUMAmulticore-aware scheduler

• Data locality optimizations more beneficial than cache contention avoidance

• Better performance metrics needed for scheduling

Thank you! Questions?

memory management in numa multicore systems: trapped between cache contention and interconnect...

Documents

asymmetric numa: - multiple-memory … numa: multiple-memory...

zoltan tombor covers supernation

* complete and utter history of the numa numa dance by...

majo wang novela

2.1 zoltan tompa

automatic numa balancing

lean numa mro

banko zoltan 111010

a campaign to save h2o majo

de majo(5)

tutorial: the zoltan toolkit

000621 voko, zoltan

me majo peroni e memory

numa presentation101

numa overview

tesi zoltan 251535

majo eastern africa

zoltan paul dienes

65953891 pipeline production zoltan

majo- - jose luis ruiz