zoltan majo and thomas r. gross department of computer science eth zurich

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

NUMA multicores

DRAM memory

IC ICMC

DRAM memory DRAM memory

MCIC IC

Processor 0 Processor 1

DRAM memory

NUMA multicores

Two problems:

• NUMA:interconnect overhead

DRAM memory

NUMA multicores

Two problems:

• NUMA:interconnect overhead

• multicore:cache contention

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

Multi-clone experiments

• Intel Xeon E5520

• 4 clones of soplex (SPEC CPU2006)

– local clone

– remote clone

DRAM memory

1 32 4 6 75

• Memory behavior of unrelated programs

M M M M

C C C C

Local bandwidth: 100%

Local bandwidth: 0%

Performance of schedules

• Which is the best schedule?

• Baseline: single-program execution mode

0% 25% 50% 75% 100%1.0

Local memory bandwidth

Execution time

local clones

remote clones

average

Slowdown relative to baseline

Outline

• Scheduling

– N-MASS

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Step 1: Maximum-local mapping

1 32 4 6 75

Default OS scheduling

1 32 4 6 75BA D

MBMA MC MD

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Step 2: Cache-aware refinement

In an SMP:

1 32 4 6 75

MBMA MC MD

1 32 4 6 75

MBMA MC MD

In an SMP:Processor 0 Processor 1

1 32 4 6 75

MB MC MD

BA D C

A B CD

Performance degradation

In an SMP:

NUMA penalty

1 32 4 6 75

MBMA MC MD

BA C DIn a NUMA:Processor 0 Processor 1

1 32 4 6 75

MBMA MC MD

A DCBIn a NUMA:Processor 0 Processor 1

Performance degradation

1 32 4 6 75

MB MC MDMA

NUMA allowance

In a NUMA:

NUMA penalty

Performance factors

Two factors cause performance degradation:

1. NUMA penaltyslowdown due toremote memory access

2. cache pressure local processes:

misses / KINST (MPKI) remote processes:

MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28

SPEC programs

NUMA penalty

Implementation

• User-mode extension to the Linux scheduler

• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI

• All memory for a process allocated on one

processor

Outline

• Scheduling

– N-MASS

0.01 0.

1 1 10 100

not used programsused programs

Workloads

• SPEC CPU2006 subset

• 11 multi-program workloads (WL1 WL11)

4-program workloads(WL1 WL9)

8-program workloads(WL10, WL11)

NUMA penalty

CPU-bound Memory-bound

Memory allocation setup

• Where the memory of each process is allocated influences performance

• Controlled setup: memory allocation maps

Memory allocation maps

A C MC

Processor 1

Processor 0

Allocation map: 0000

BA C D

Processor 1

Processor 0

Processor 1

Processor 0

BA C D

Processor 1

Processor 0

Processor 1

Processor 0

Unbalanced Balanced

Evaluation

• Baseline: Linux average

– Linux scheduler non-deterministic

– average performance degradation in all possible

• N-MASS with perfect NUMA penalty

information

0000 1000 0100 0010 0001 1100 1010 10011.0

Linux average

Allocation maps

WL9: Linux averageAverage slowdown relative to single-program mode

0000 1000 0100 0010 0001 1100 1010 10011.0

Linux averageN-MASS

Allocation maps

WL9: N-MASSAverage slowdown relative to single-program mode

0000 1000 0100 0010 0001 1100 1010 10011.0

Linux averageN-MASS

Allocation maps

WL1: Linux average and N-MASSAverage slowdown relative to single-program mode

N-MASS performance

• N-MASS reduces performance degradation by up to 22%

• Which factor more important: interconnect overhead or cache contention?

• Compare:

- maximum-local- N-MASS (maximum-local

+ cache refinement step)

Data-locality vs. cache balancing (WL9)

0000 1000 0100 0010 0001 1100 1010 1001-10%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1)

0000 1000 0100 0010 0001 1100 1010 1001-10%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Data locality vs. cache balancing

• Data-locality more important than cache balancing

• Cache-balancing gives performance benefits mostly with unbalanced allocation maps

• What if information about NUMA penalty not available?

0 10 20 30 40 501.0

Estimating NUMA penalty

• NUMA penalty is not directly measurable

• Estimate: fit linear regression onto MPKI data

NUMA penalty

Estimate-based N-MASS: performance

WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%

maximum-local N-MASS Estimate-based N-MASS

Conclusions

• N-MASS: NUMAmulticore-aware scheduler

• Data locality optimizations more beneficial than cache contention avoidance

• Better performance metrics needed for scheduling

Thank you! Questions?

zoltan majo and thomas r. gross department of computer science eth zurich

Documents

zoltan tutorial slides

old.anif.ro zoltan... · pop zoltan carol pop zoltan carol...

tesi zoltan 251535

zoltan kodÁly and stanislav lyudkevych

wave power stations zoltan kalanyos

000621 voko, zoltan

de majo - demajo_cat_2009_end.pdf

de majo(2)

a campaign to save h2o majo

gec 2017: zoltan acs

informe industrializacion del majo

zoltan paul dienes

social studies- integration...

me majo peroni e memory

industrialization majo sosa ,aimee mellado 203

structure using short-period data from...

majo wang novela

majo rubrica

zoltan, ildiko - engleza - exercitii de gramatica

majo- - jose luis ruiz