matching memory access patterns and data placement for numa systems

Post on 15-Feb-2016

28 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Matching Memory Access Patterns and Data Placement for NUMA Systems. Zolt á n Maj ó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland. Non-uniform memory architecture. Processor 0. Processor 1. Core 0. Core 1. Core 4. Core 5. Core 2. Core 3. Core 6. Core 7. - PowerPoint PPT Presentation

TRANSCRIPT

MatchingMemory Access Patterns and Data Placementfor NUMA Systems

Zoltán MajóThomas R. Gross

Computer Science DepartmentETH Zurich, Switzerland

2

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

3

Non-uniform memory architecture

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

4

Non-uniform memory architecture

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

5

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

6

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

7

Outline

Automatic page placement Memory access patterns of matrix-based computations Matching memory access patterns and data placement Evaluation Conclusions

8

Automatic page placement

Current OS support for NUMA: first-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported in hardware on many architectures

9

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

T0T1

T1P1

P0

10

Automatic page placement

Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

11

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

12

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

13

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed 3000 times by

T0T1

T1

P0 P1

P2 : accessed 4000 times by

accessed 5000 times by

T0

T1

P2

P2: inter-processor shared

14

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed 3000 times by

T0T1

T1

P0 P1

P2 : accessed 4000 times by

accessed 5000 times by

T0

T1P2

P2: inter-processor shared

15

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

16

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

17

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

18

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

19

Automatic page placement

Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property Detailed look: program memory access patterns

Loop-parallel programs with OpenMP-like parallelization Matrix processing NAS BT

20

Matrix processing

Process m sequentiallym[NX][NY]

NX

NY

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

21

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

22

Thread scheduling

Remember: fixed thread-to-core mapping

Processor 1

DRAM

Processor 0

DRAM

T0

T1T2

T3T4

T5

T6

T7

23

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Process m x-wise parallel

Matrix processing

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

Allocated atProcessor 1

Allocated atProcessor 0

m[NX][NY]

24

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Process m x-wise parallel

for (i=0; i<NX; i++)#pragma omp parallel for

for (j=0; j<NY; j++)// access m[i][j]

Process m y-wise parallel

Matrix processing

NX

NY

T0 T1 T2 T3 T4 T5 T6 T7

Allocated atProcessor 1

Allocated atProcessor 0

m[NX][NY]

25

NX

NY

for (t=0; t<TMAX; t++){

x_wise();y_wise();

}

Example: NAS BT

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

T0

T1

T2

T3

T4

T5

T6

T7

26

NX

NY

for (t=0; t<TMAX; t++){

x_wise();y_wise();

}

Example: NAS BT

Result:Inter-processor shared heap: 35%Remote accesses: 19%

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

T0

T1

T2

T3

T4

T5

T6

T7

Appropriateallocationnot possible

Allocated atProcessor 0

Allocated atProcessor 1

Appropriate allocationnot possible

27

Solution?

1. Adjust data placementHigh overhead of runtime data migration cancels benefit

2. Adjust iteration schedulingLimited by data dependences

3. Adjust data placement and iteration scheduling together

28

API

Library for data placement Set of common data distributions

Affinity-aware loop iteration scheduling Extension to GCC OpenMP implementation

Example use case: NAS BT

29

Use-case: NAS BT

Remember: BT has two incompatible access patterns Repeated x-wise and y-wise access to the same data

Idea: data placement to accommodate both access patterns

NX

NY

Allocated atProcessor 0

Allocated atProcessor 0

Allocated atProcessor 1

Allocated atProcessor 1

Blocked-exclusive data placement

30

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

Use-case: NAS BT

for (t=0; t<TMAX; t++){

x_wise();

y_wise();}

31

Use-case: NAS BT

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)//access m[i][j]

for (t=0; t<TMAX; t++){

x_wise();

y_wise();}

32

x_wise()

Matrix processed in two steps

Step 1: left halfall accesses local

Step 2: right halfall accesses local

Allocated atProcessor 1

Allocated atProcessor 0

NY / 2

NX

Allocated atProcessor 0

Allocated atProcessor 1

NY / 2

T0

T1T2

T3

T4

T5

T6

T7

33

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)//access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY/2; j++)//access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT

for (t=0; t<TMAX; t++){

x_wise();

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

y_wise();}

34

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY/2; j++)//access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT

for (t=0; t<TMAX; t++){

x_wise();

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

schedule(static-inverse)

schedule(static)

y_wise();}

35

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forschedule(static)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

36

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forschedule(static)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8.. 3*NX/8 - 1][*]m[3*NX/8.. 4*NX/8 - 1][*]m[4*NX/8.. 5*NX/8 - 1][*]m[5*NX/8 ..6*NX/8 - 1][*]m[6*NX/8 ..7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

37

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8 .. 3*NX/8 - 1][*]m[3*NX/8 .. 4*NX/8 - 1][*]m[4*NX/8 .. 5*NX/8 - 1][*]m[5*NX/8 .. 6*NX/8 - 1][*]m[6*NX/8 .. 7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

#pragma omp parallel forschedule(static)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

static vs. static-inverseT0 m[0 .. NX/8 - 1][*]T1 m[NX/8 .. 2*NX/8 - 1][*]T2 m[2*NX/8 .. 3*NX/8 - 1][*]T3 m[3*NX/8 .. 4*NX/8 - 1][*]T4 m[4*NX/8 .. 5*NX/8 - 1][*]T5 m[5*NX/8 .. 6*NX/8 - 1][*]T6 m[6*NX/8 .. 7*NX/8 - 1][*]T7 m[7*NX/8 .. NX - 1][*]

#pragma omp parallel forschedule(static-inverse)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

T0

T1

T2

T3

T4T5

T6

T7

38

y_wise()

Matrix processed in two steps

Allocated atProcessor 0

Allocated atProcessor 1 NX / 2

Allocated atProcessor 0

Allocated atProcessor 1

NY

NX / 2

T4 T5 T6 T7

Step 1: upper halfall accesses local

Step 2: lower halfall accesses local

T0 T1 T2 T3

39

Outline

Profile-based page placement Memory access patterns Matching data distribution and iteration scheduling Evaluation Conclusions

40

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

41

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

42

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

43

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

50%

100%

150%

200%

250%

44

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

50%

100%

150%

200%

250%

45

Conclusions

Automatic data placement (still) limited Alternating memory access patterns

Inter-processor data sharing

Match memory access patterns and data placement

Simple API: practical solution that works today Ample opportunities for further improvement

46

Thank you for your attention!

top related