handling massive concurrency development of a programming

19
PUBLIC Matthias Liedtke, SAP April 08, 2019 Handling massive concurrency Development of a programming model for GPU and CPU

Upload: others

Post on 26-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Handling massive concurrency Development of a programming

PUBLIC

Matthias Liedtke, SAPApril 08, 2019

Handling massive concurrencyDevelopment of a programming model forGPU and CPU

Page 2: Handling massive concurrency Development of a programming

2PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Our requirements and related concepts

The programming model

Results

Agenda

Page 3: Handling massive concurrency Development of a programming

Our requirements and related concepts

Page 4: Handling massive concurrency Development of a programming

4PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Llang compiler§ Just in time compiler in existing server environment using the LLVM backend§ Llang à internal language with little performance overhead compared to C++

Context of the programming model

Page 5: Handling massive concurrency Development of a programming

5PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Ease-of-Use

Achieve comparable performance to CUDA

Write once

Supportability

Our requirements for the programming model

Page 6: Handling massive concurrency Development of a programming

6PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

OpenMP

§ Sequential program

§ Added pre-processor directives for parallelization

àLimited expressiveness as parallelization is „on top“ of programming language

Existing GPU programming models

Page 7: Handling massive concurrency Development of a programming

7PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

CUDA§ Strong support for hardware capabilities§ Many libraries for special needs§ C-style interface, little abstractionàLimited to Nvidia GPUs, no CPU execution possible

Existing GPU programming models

Page 8: Handling massive concurrency Development of a programming

8PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

OpenCL

§ Platform independent programming of highly parallel kernels

§ Hardware abstraction

§ Mature (but complex) interface, also in C++

àVery close to what we need

àNo integration into existing environment

Existing GPU programming models

Page 9: Handling massive concurrency Development of a programming

The programming model

Page 10: Handling massive concurrency Development of a programming

10PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

_acc_kernel Void multiply(acc::GridInfo& gi, ForeignArray<Int32>& data) {

Size index = gi.getThreadIdX() + gi.getBlockIdX() * gi.getThreadCountX();

data[index] = data[index] * Int32(index);

}

export Void testMain() {

Size blockCount = 32z;

Size threadCount = 32z;

ForeignArray<Int64> array = /*...*/;

acc::GridConfig config(blockCount, threadCount);

acc::gridInvoke(config, _bind(multiply, array));

}

Example usage of the programming model

Page 11: Handling massive concurrency Development of a programming

11PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

_acc_kernel Void multiply(acc::GridInfo& gi, ForeignArray<Int32>& data) { }

§ Keyword _acc_kernel§ Function will be compiled for the CPU and GPU backend

acc::GridConfig config(blockCount, threadCount);

acc::gridInvoke(config, _bind(multiply, array));

§ GridConfig to set number of threads and thread groups§ Kernel function bound with arguments

Programming model – Kernel invocation

Page 12: Handling massive concurrency Development of a programming

12PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

acc::gridInvoke(config, _bind(multiply, array));

Data transfer handled by invoke mechanism

Programming model – Data transfer

Page 13: Handling massive concurrency Development of a programming

13PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

_acc_kernel Void multiply(acc::GridInfo& gi, ForeignArray<Int32>& data) {

}

export Void testMain() {

Size blockCount = 32z;

Size threadCount = 32z;

ForeignArray<Int32> array = /*...*/;

acc::Stream stream;

{

acc::GridConfig config(blockCount, threadCount);

acc::Transfer arrayTransfer(array, stream);

acc::gridInvoke(config, _bind(multiply, array), stream);

acc::gridInvoke(config, _bind(multiply, array), stream);

} // end of lifetime of transfer object triggers transfer

}

Programming model – Explicit data transfers

Page 14: Handling massive concurrency Development of a programming

15PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Aggregating multiple results into one, e.g. sum

_reduce(gridInfo, COMPLETE_GRID, partialResult, add, &result);

Programming model – Reduction

10

1 2 3 4

Page 15: Handling massive concurrency Development of a programming

16PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Aim: Avoid self defined locks and dead locks

Concept: Have phases that are handled “sequentially”

_acc_kernel Void kernel(acc::GridInfo& gi, ForeignArray<Int32> in, ForeignArray<Int32>& out) {

_acc_shared ForeignArray<Int32> inShared;

_acc_shared ForeignArray<Int32> outShared;

_phased_execution "load" {

// load data from in to inShared

}

_phased_execution "process" {

// execution operation reading data from inShared and storing results in outShared

}

_phased_execution "aggregate" {

// aggregate results in outShared

}

}

Programming model – Execution phases

Page 16: Handling massive concurrency Development of a programming

Results

Page 17: Handling massive concurrency Development of a programming

18PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Kernel runtimewith 5’000 points and 10’000 edgesGPU: Nvidia Tesla P100CPU: 4 x Intel Xeon E7-8880 v2 @2.5 GHz

For each point p:§ Count intersections of ray starting at p with

polygon

§ Even number: outside§ Odd number: inside

Points-in-polygon

0

2

4

6

8

10

12

14

16

Run

time

in m

sCuda C++ Prototype GPU Prototype CPU

Page 18: Handling massive concurrency Development of a programming

19PUBLIC© 2019 SAP SE or an SAP affiliate company. All rights reserved. ǀ

Main concepts of our programming model§ Worker / kernel function like in CUDA / OpenCL§ Context object for multiple kernel calls (“Stream”); comparable to CUDA stream§ Object for kernel invocation configuration§ Object to handle explicit GPU transfer for an existing variable on CPU§ Execution phases to avoid explicit locks§ GPU and CPU backend with GPU focus

Summary

Page 19: Handling massive concurrency Development of a programming

Contact information:

Matthias [email protected]

Thank you.