matching memory access patterns and data placement for numa systems

MatchingMemory Access Patterns and Data Placementfor NUMA Systems

Zoltán MajóThomas R. Gross

Computer Science DepartmentETH Zurich, Switzerland

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

Processor 0

Core 0 Core 1

Core 2 Core 3

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

Processor 0

Core 0 Core 1

Core 2 Core 3

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

Processor 0

Core 0 Core 1

Core 2 Core 3

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Outline

Automatic page placement Memory access patterns of matrix-based computations Matching memory access patterns and data placement Evaluation Conclusions

Automatic page placement

Current OS support for NUMA: first-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported in hardware on many architectures

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

Processor 0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

25%Performance improvement over first-touch [%]

Profile-based page placement

25%Performance improvement over first-touch [%]

Inter-processor data sharing

Processor 1

Processor 0

P1 : accessed 3000 times by

accessed 5000 times by

P2: inter-processor shared

Processor 1

Processor 0

accessed 5000 times by

P2: inter-processor shared

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property Detailed look: program memory access patterns

Loop-parallel programs with OpenMP-like parallelization Matrix processing NAS BT

Matrix processing

Process m sequentiallym[NX][NY]

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Matrix processing

Process m x-wise parallel

m[NX][NY]

Thread scheduling

Remember: fixed thread-to-core mapping

Processor 1

Processor 0

Matrix processing

Allocated atProcessor 1

m[NX][NY]

for (i=0; i<NX; i++)#pragma omp parallel for

Process m y-wise parallel

Matrix processing

T0 T1 T2 T3 T4 T5 T6 T7

m[NX][NY]

for (t=0; t<TMAX; t++){

x_wise();y_wise();

Example: NAS BT

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

x_wise();y_wise();

Example: NAS BT

Result:Inter-processor shared heap: 35%Remote accesses: 19%

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

Appropriateallocationnot possible

Appropriate allocationnot possible

Solution?

1. Adjust data placementHigh overhead of runtime data migration cancels benefit

2. Adjust iteration schedulingLimited by data dependences

3. Adjust data placement and iteration scheduling together

Library for data placement Set of common data distributions

Affinity-aware loop iteration scheduling Extension to GCC OpenMP implementation

Example use case: NAS BT

Use-case: NAS BT

Remember: BT has two incompatible access patterns Repeated x-wise and y-wise access to the same data

Idea: data placement to accommodate both access patterns

Blocked-exclusive data placement

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

Use-case: NAS BT

x_wise();

y_wise();}

Use-case: NAS BT

for (j=0; j<NY; j++)//access m[i][j]

x_wise();

y_wise();}

x_wise()

Matrix processed in two steps

Step 1: left halfall accesses local

Step 2: right halfall accesses local

NY / 2

for (j=0; j<NY; j++)//access m[i][j]

for (j=0; j<NY/2; j++)//access m[i][j]

for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT

x_wise();

y_wise();}

for (j=0; j<NY/2; j++)//access m[i][j]

for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT

x_wise();

schedule(static-inverse)

schedule(static)

y_wise();}

#pragma omp parallel forschedule(static)

// access m[i][j]

Matrix processing

m[NX][NY]

// access m[i][j]

Matrix processing

m[NX][NY]

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8.. 3*NX/8 - 1][*]m[3*NX/8.. 4*NX/8 - 1][*]m[4*NX/8.. 5*NX/8 - 1][*]m[5*NX/8 ..6*NX/8 - 1][*]m[6*NX/8 ..7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8 .. 3*NX/8 - 1][*]m[3*NX/8 .. 4*NX/8 - 1][*]m[4*NX/8 .. 5*NX/8 - 1][*]m[5*NX/8 .. 6*NX/8 - 1][*]m[6*NX/8 .. 7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

// access m[i][j]

static vs. static-inverseT0 m[0 .. NX/8 - 1][*]T1 m[NX/8 .. 2*NX/8 - 1][*]T2 m[2*NX/8 .. 3*NX/8 - 1][*]T3 m[3*NX/8 .. 4*NX/8 - 1][*]T4 m[4*NX/8 .. 5*NX/8 - 1][*]T5 m[5*NX/8 .. 6*NX/8 - 1][*]T6 m[6*NX/8 .. 7*NX/8 - 1][*]T7 m[7*NX/8 .. NX - 1][*]

#pragma omp parallel forschedule(static-inverse)

// access m[i][j]

y_wise()

Matrix processed in two steps

Allocated atProcessor 1 NX / 2

NX / 2

T4 T5 T6 T7

Step 1: upper halfall accesses local

Step 2: lower halfall accesses local

T0 T1 T2 T3

Outline

Profile-based page placement Memory access patterns Matching data distribution and iteration scheduling Evaluation Conclusions

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

Conclusions

Automatic data placement (still) limited Alternating memory access patterns

Match memory access patterns and data placement

Simple API: practical solution that works today Ample opportunities for further improvement

Thank you for your attention!

matching memory access patterns and data placement for numa systems

core mapping

data localityall data

mcicdramtdataall data

times byt0t1p2p2

good performance

molka pact

hackenberg micro

t0t1t1p0p1 p2

Documents

numa-friendly data structures (using delegation and ...numa...

processor or socket numa node core lp processor or socket...

thread and memory placement on numa systems · alexandra...

nakajima numa-final

numa obliviousness through memory mapping

overview on numa

automatic numa balancing

jn 1892 numa brochure final to...

numa-piano en de it

numa and java databases

asymmetric numa: - multiple-memory … numa: multiple-memory...

vsphere resource management -...

compact numa-aware locks

numa-aware openmp programming - it4innovationsprace.it4i.cz...

placement matching - association of colleges...

journal of pragmatics - numa markee

numa - workshop presentations

numa presentation (mission vision)

saxon publishers placement inventory placement … · •...

numa stage manual ita-eng