getting performance from openmp programs on numa …

16
EU H2020 Centre of Excellence (CoE) 1 October 2015 – 31 March 2018 Grant Agreement No 676553 Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University [email protected]

Upload: others

Post on 02-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

EUH2020CentreofExcellence(CoE)1October2015– 31March2018

GrantAgreementNo676553

GettingPerformancefromOpenMPProgramsonNUMAArchitecturesChristianTerboven,RWTHAachenUniversity

[email protected]

• AmongthemostobscurethingsthatcannegativelyimpactperformanceofOpenMPprogramsarecc-NUMAeffects

• ThesearenotrestrictedtoOpenMP• ButtheymostshowupbecauseyouusedOpenMP• Inanycasetheyareimportantenoughtobecoveredhere

2

OpenMPandPerformance

8/4/17

• Setofprocessorsisorganizedinsidealocalitydomainwithalocallyconnectedmemory• Thememoryofalllocalitydomainsisaccessibleoverasharedvirtualaddressspace• Otherlocalitydomainsareaccessoverainterconnect,thelocaldomaincanbeaccessedveryefficientlywithoutresortingtoanetworkofanykind

8/4/17 3

Whatiscc-NUMA?

Core

memory

Core

on-chipcache

Core Core

memory

interconnect

on-chipcache

on-chipcache

on-chipcache

Virtualaddress space

• Advantages• Scalableintermsofmemorybandwidth• “Arbitrarily”largenumbersofprocessors:Thereexistsystemswithover1024processors

• Disadvantages• Efficientprogrammingrequiresprecautionswithrespecttolocalandremotememory,althoughallprocessorsshareoneaddressspace• Cachecoherenceishardandexpensiveinimplementation

• e.g.recentwritesneedinvalidationandmayconsumealotoftheavailablebandwidth

8/4/17 4

Advantages&Disadvantages

• You should have abasic understanding of the system topology.Youcould use one of the following options onatarget machine:• numactl tool to control the LinuxNUMApolicy

• numactl --hardware

• Delivers compact information about NUAnodes and the associated processor ID• IntelMPI‘s cpuinfo tool

• cpuinfo

• Delivers information about the number of sockets (=packages)and the mapping ofprocessor IDsto CPUcores used by the OS

• hwlocs‘hwloc-ls tool (comes with Open-MPI)• hwloc-ls

• Displaysa(graphical)representation of the system topology,separated into NUMAnodes,along with the mapping of processor IDsto CPUcores used by the OSandadditionalinformation oncaches

8/4/17 5

QueryyourSystem

• Streamexample with and without NUMA-awaredata placement• 2socketsystem with XeonX5675processors,12OpenMPthreads

8/4/17 6

TobeNUMAornotto be

copy scale add triad

ser_init 18.8GB/s 18.5GB/s 18.1GB/s 18.2GB/s

par_init 41.3GB/s 39.3GB/s 40.3GB/s 40.4GB/s

CPU0

T1 T2 T3

T4 T5 T6

CPU1

T7 T8 T9

T10 T11 T12

MEM

a[0,N-1]b[0,N-1]c[0,N-1]

CPU0

T1 T2 T3

T4 T5 T6

CPU1

T7 T8 T9

T10 T11 T12

MEM

a[0,(N/2)-1]b[0,(N/2)-1]c[0,(N/2)-1]

ser_init:

par_init:

MEM

MEM

a[N/2,N-1]b[N/2,N-1]c[N/2,N-1]

double* A;

A = (double*)malloc(N * sizeof(double));

for (int i = 0; i < N; i++) {

A[i] = 0.0;

}

8/4/17 7

DataPlacement?

Core

memory

Core

on-chipcache

Core Core

memory

interconnect

on-chipcache

on-chipcache

on-chipcache

How To Distribute The Data ?

double* A;

A = (double*)malloc(N * sizeof(double));

for (int i = 0; i < N; i++) {

A[i] = 0.0;

}

8/4/17 8

SerialDataPlacement

Core

memory

Core

on-chipcache

Core Core

memory

interconnect

on-chipcache

on-chipcache

on-chipcache

A[0]…A[N]

• Serialcode:allarray elements are allocated inthe memory of the NUMAnode closest to thecore executing the initializer thread (first touch)

double* A;

A = (double*)malloc(N * sizeof(double));

#pragma omp parallel for

for (int i = 0; i < N; i++) {

A[i] = 0.0;

}

8/4/17 9

FirstTouchDataPlacement

Core

memory

Core

on-chipcache

Core Core

memory

interconnect

on-chipcache

on-chipcache

on-chipcache

• Serialcode:allarray elements are allocated inthe memory of the NUMAnode closest to thecore executing the initializer thread (first touch)

A[0]…A[N/2] A[N/2]…A[N]

• Selecting the „right“binding strategy depends notonly onthetopology,butalsoonthe characteristics of your application• Putting threads far apart,i.e.,ondifferentsockets

• Mayimprove the aggregated memory bandwidth available to your application• Mayimprove the combined cache size available to your application• Maydecrease performance of synchronization constructs

• Putting threads close together,i.e.,ontwo adjacent cores that possibly sharesome caches• Mayimprove performance of synchronization constructs• Maydecrease the available memory bandwidth and cache size

• If you are unsure,justtry afew options and then select the best one.

8/4/17 10

DecideforBindingStrategy

• DefineOpenMPplaces• setofOpenMPthreadsrunningononeormoreprocessors• canbedefinedbytheuser,i.e.,OMP_PLACES=cores

• DefineasetofOpenMPthreadaffinitypolicies• SPREAD:spreadOpenMPthreadsevenlyamongtheplaces,partitiontheplacelist• CLOSE:packOpenMPthreadsnearmasterthread• MASTER:collocateOpenMPthreadswithmasterthread

• Goals• userhasawaytospecifywheretoexecuteOpenMPthreadsforlocalitybetweenOpenMPthreads/lessfalsesharing/memorybandwidth

8/4/17 11

OpenMP4.0:Places+Policies

• Assume the following machine:

• 2sockets,4cores persocket,4hyper-threads percore

• Abstractnames for OMP_PLACES:• threads:Eachplacecorrespondstoasinglehardwarethreadonthetargetmachine• cores:Eachplacecorrespondstoasinglecore(havingoneormorehardwarethreads)onthetargetmachine• sockets:Eachplacecorrespondstoasinglesocket(consistingofoneormorecores)onthe target machine

8/4/17 12

OMP_PLACESenv.variable

p0 p1 p2 p3 p4 p5 p6 p7

• Example‘s Objective:• separatecores for outer loop and near cores for inner loop

• Outer ParallelRegion:proc_bind(spread),Inner:proc_bind(close)• spread creates partition,close binds threads within respective partitionOMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-4):4:8 = cores#pragma omp parallel proc_bind(spread) num_threads(4)#pragma omp parallel proc_bind(close) num_threads(4)

• Example• initial

• spread 4

• close 48/4/17 13

OpenMP4.0:Places+Policies

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3 p4 p5 p6 p7

p0 p1 p2 p3 p4 p5 p6 p7

• FirstTouch:Modernoperating systems (i.e.,Linux>=2.4)determine thephysical location of amemory page during the first page fault,when thepage is first „touched“,and put it close to the CPUthat causes the pagefault• Everything under control?• Inprinciple Yes,butonly if

• threads can be bound explicitly,• data can be placed well by first-touch,or can be migrated,

• What if the data access pattern changes over time?

• ExplicitMigration:Selectedregions of memory (pages)are moved fromone NUMAnode to another viaexplicitOSsyscall

• Automatic Migration:No support for this incurrent operating systems

8/4/17 14

NUMAStrategies:Overview

• ExplicitNUMA-awarememory allocation:• By carefully touching data by the thread which later uses it• By changing the default memory allocation strategy with numactl• By explicitmigration of memory pages

• Linux:move_pages()• Include <numaif.h> header,linkwith -lnumalong move_pages(int pid,unsigned long count,void **pages, const int *nodes,int *status,int flags);

• Example:using numactl to distribute pages round-robin:• numactl –interleave=all ./a.out

• Example:Runonnode 0with memory allocated onnodes 0and 1• numactl --cpubind=0 --membind=0,1 ./a.out

8/4/17 15

UserControlof MemoryAffinity

8/4/17 16

Contact:https://www.pop-coe.eumailto:[email protected]

ThisprojecthasreceivedfundingfromtheEuropeanUnion‘sHorizon2020researchandinnovationprogramme undergrantagreementNo676553.

PerformanceOptimisation andProductivityACentreofExcellenceinComputingApplications