heterogeneous c++ - openmp users

Post on 01-Jan-2022

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Heterogeneous C++

Michael Wong, Codeplay Software

VP of Research and Development

Chair of SYCL Heterogeneous Programming LanguageISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong

Head of Delegation for C++ Standard for Canada

Chair of Programming Languages for Standards Council of Canada

Chair of WG21 SG5 Transactional MemoryChair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded

Editor: C++ SG5 Transactional Memory Technical Specification

Editor: C++ SG1 Concurrency Technical Specification

http:://wongmichael.com/aboutADC++ 2018 – May 2018

© 2018 Codeplay Software Ltd.2

Acknowledgement Disclaimer

Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk.I even lifted this acknowledgement and disclaimer from some of them.But I claim all credit for errors, and stupid mistakes. These are mine, all mine!

© 2018 Codeplay Software Ltd.3

Legal Disclaimer

This work represents the view of the author and does not necessarily represent the view of Codeplay.

Other company, product, and service names may be trademarks or service marks of others.

© 2018 Codeplay Software Ltd.4

Partners

Codeplay - Connecting AI to Silicon

Customers

C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. TensorFlow™

The heart of Codeplay's compute technologyenabling OpenCL™, SPIR™, HSA™ and Vulkan™

Products

Automotive (ISO 26262)IoT, Smartphones & Tablets

High Performance Compute (HPC)Medical & Industrial

Technologies: Vision ProcessingMachine Learning

Artificial IntelligenceBig Data Compute

Addressable Markets

High-performance software solutions for custom heterogeneous systems

Enabling the toughest processor systems with tools and middleware based on open standards

Established 2002 in Scotland

~70 employees

Company

© 2018 Codeplay Software Ltd.5

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

© 2018 Codeplay Software Ltd.6

A tale of two cities

© 2018 Codeplay Software Ltd.7

Will the two galaxies ever join/collide?

© 2018 Codeplay Software Ltd.8

© 2018 Codeplay Software Ltd.9

A tale of two/three cities

© 2018 Codeplay Software Ltd.10

Programming GPU/Accelerators

• OpenGL• DirectX• CUDA• OpenCL• OpenMP• OpenACC• C++ AMP• HPX

• HSA• SYCL• Vulkan• Boost.Compute• Halide• Kokkos• Raja• UPC++• HCC• Charm++

© 2018 Codeplay Software Ltd.11

• P3MA at ISC• WACCPD at SC• DHPCC++ at IWOCL• Repara at EuroPar• HeteroPar at EuroPar

• Distributed and Heterogeneous Programming in C/C++

• 2nd year attached to IWOCL• Latest research in HPX, SYCL,

Kokkos, Raja, …• Latest standardization efforts• Work loads in machine

learning, vision, safety critical

Several Key workshops

© 2018 Codeplay Software Ltd.12

How can we compile source code for a sub architectures?

Separate source

Single source

© 2018 Codeplay Software Ltd.13

Benefits of Single Source

• Device code is written in C++ in the same source file as the host CPU code

• Allows compile-time evaluation of device code

• Supports type safety across host CPU and device

• Supports generic programming

• Removes the need to distribute source code

© 2018 Codeplay Software Ltd.14

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to HeterogeneousHeterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

© 2018 Codeplay Software Ltd.15

C++ Directions Group: P0939

© 2018 Codeplay Software Ltd.16

What is C++

C++ is a language for defining and using lightweight abstractions

C++ supports building resource constrained applications and software infrastructure

C++ support large-scale software development

© 2018 Codeplay Software Ltd.17

How do we want C++ to develop?

Improve support for large -scale dependable softwareImprove support for high-level concurrency models

Simplify language useAddress major sources of dissatisfaction

Address major sources of error

© 2018 Codeplay Software Ltd.18

C++ rests on two pillars

A direct map to hardware (initially from C)Zero-overhead abstraction in production code (initially from

Simula, where it wasn’t zero-overhead)

© 2018 Codeplay Software Ltd.19

4.3 Concrete Suggestions• Pattern matching• Exception and error returns• Static reflection• Modern networking• Modern hardware:

We need better support for modern hardware, such as executors/execution context, affinity support in C++ leading to heterogeneous/distributed computing support,

SIMD/task blocks, more concurrency data structures, improved atomics/memory model/lock-free data structures support. The challenge is to turn this (incomplete) laundry list into a

coherent set of facilities and to introduce them in a manner that leaves each new standard with a coherent subset of our ideal.

• Simple graphics and interaction• Anything from the Priorities for C++20 that didn’t make C++20

© 2018 Codeplay Software Ltd.20

Use the Proper Abstraction today with C++17Abstraction How is it supported

Cores C++11/14/17 threads, async

HW threads C++11/14/17 threads, async

Vectors Parallelism TS2

Atomic, Fences, lockfree, futures, counters, transactions

C++11/14/17 atomics, Concurrenct TS1, Transactional Memory TS1

Parallel Loops Async, TBB:parallel_invoke, Parallelism TS1, C++17 parallel algorithms

Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos,Raja

Distributed HPX, MPI, UPC++

Caches OpenMP affinity places, Not supported in C++

Numa OpenMP affinity places, Not Supported in C++

TLS C++ 11 TLS (thread_local)

Exception handling in concurrent environment Not supported

© 2018 Codeplay Software Ltd.21

• The 3 Data problems even after you exposed Parallelism• Data Movement: the COST! But is it implicit or explicit?• Data Layout: AoS vs SoA: coalesced memory accesses; How about data tiling?• Data Addressing: Partitioned addressing, how do you treat pointers, can you share arrays across CPU/GPUs

• Other General issues• What is a Thread?: is it a System thread? GPU thread, OpenMP/OpenCL thread; Thread of Execution, Execution Agents• Memory Model: is it flat, abstract? Or scoped based on the warp, team, • Thread Local Storage: do GPUs • Scheduling: dependencies • More specific Hardware kinds: FPGA, DSPs, streaming, cloud and Tensor Processing Units?• Forward Progress Guarantees and weak execution agents• Dynamic online/offline device • How to fill a GPU team, block, warp

• More specific C++ issues• Separation of Concerns: What, where, how, when is it executed?• Affinity: both CPU and memory affinity• Exceptions and Error Model: how to have concurrent exceptions• Templates: how do you support compile-time polymorphism, type erasure, AI • Polymorphisms: Supporting everything on a GPU vs CPU

Problems of Transforming from a non-Heterogeneous to Heterogeneous Language

© 2018 Codeplay Software Ltd.22

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

© 2018 Codeplay Software Ltd.23

C++ Std Timeline/statushttps://isocpp.org/std/status

23

© 2018 Codeplay Software Ltd.24

Status after Mar JAX C++ MeetingISO number Name Status links C++20?

ISO/IEC TS 19841:2015 Transactional Memory TS

Published 2015-09-16, (ISO Store). Final draft: n4514(2015-05-08)

Composablelock-free programming that scales

No. Already in GCC 6 release and waiting for subsequent usage experience.

ISO/IEC TS 19217:2015 C++ Extensions for Concepts

Published 2015-11-13. (ISO Store). Final draft: n4553(2015-10-02)

Constrained templates

Merged into C++20 without terse syntax. . Already in GCC 6 release and and waiting for subsequent usage experience.

© 2018 Codeplay Software Ltd.25

Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?

ISO/IEC TS 19571:2016 C++ Extensions for Concurrency

Published 2016-01-19. (ISO Store) Final draft: p0159r0(2015-10-22)

improvements to future, latches and barriers, atomic smart pointers

Latches, atomic<shared_ptr<t>> headed into C++20. Already in Visual Studio release and Anthony Williams Just Threads! and waiting for subsequent usage experience.

ISO/IEC TS 19568:2017 C++ Extensions for Library Fundamentals, Version 2

Published 2017-03-30. (ISO Store) Draft: n4617 (2016-11-28)

source code information capture and various utilities Published with parts in C++17

ISO/IEC DTS 21425:2017 Ranges TS Published 2017-11Draft n4651 (2017-03-15)

Range-based algorithms and views Published

ISO/IEC DTS 19216:xxxx Networking TS PDTS, Draft n4656 (2017-03-17)

Sockets library based on Boost.ASIO

Publish soon but must come with executors

ISO/IEC DTS 21544:xxxx Modules Proposed Draft n4689(2017-07-31) out for ballot

A component system to supersede the textual header file inclusion model

Voted for publication.

© 2018 Codeplay Software Ltd.26

Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?

Numerics TS Early development. Draft p0101(2015-09-27)

Various numerical facilities,Including DSP rounding types Under active development

ISO/IEC DTS 19571:xxxx Concurrency TS 2 Early developmentExploring , lock-free, hazard pointers, RCU, atomic views,concurrent data structures

Under active development. Possible new clause

ISO/IEC DTS 19570:xxxx Parallelism TS 2 Early development. Draft n4578(2016-02-22)

Exploring task blocks, progress guarantees, SIMD<T> type, vec, no_vec loop based execution policy

PDTS ballot now on remaining part after pushing seq policy into C++20

ISO/IEC DTS 19841:xxxx Transactional Memory TS 2 Early developmentExploring on_commit, in_transaction. Lambda-based executor model.

Under active development.

Graphics TS Early development. Draft p0267r0 (2016-02-12)

2D drawing API using Cairo interface, adding stateless interface

LEWG reviewed but now became controversial

Library Fundamental V3 Initial draft, early development Maybe mdspan and expected<T> Under development

© 2018 Codeplay Software Ltd.27

Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?

ISO/IEC DTS 22277:2017 Coroutine TS Published 2017-11Draft n4663 (2017-03-25)

Resumable functions, based on Microsoft’s await design Published

Reflection TS

Early development. Draft p0194r2 (2016-10-15) with rationale in p0385r2 (2017-02-06). Alternative: p0590r0 (2017-02-05)

Code introspection and (later) reification mechanisms

Introspection proposal passed core language design review; next stop is design review of the library components. Targeting a Reflection TS.

Contracts TS Unified proposal reviewed favourably. )

Preconditions, postconditions, etc.

Proposal passed core language design review; next stop is design review of the library components. Targeting C++20.

Executor TS Separated from Concurrency TS. have a unified proposal .

Describes how, where, when of execution. Enables distributed and heterogeneous computing.

weekly calls but have consensus in SG1 and reviewed in LEWG and LWG. Now considering how to make it into C++20.

Heterogeneous Device TSAffinity, execution context, eh in concurrent environment, Execution agent local storage

Support Hetereogeneous Devices

Weekly calls. Affinity and EH progressing

© 2018 Codeplay Software Ltd.28

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

© 2018 Codeplay Software Ltd.29

Parallelize and vectorize example

29

© 2018 Codeplay Software Ltd.30

C++17 Parallel STL: Democratizing Parallelism in C++What is Parallel STL?

Parallel STL greatly facilitates the usage of parallelism in C++ by exposing a parallel interface for the STL algorithms.

Why do I care?

Hardware architecture is becoming increasingly parallel. You cannot escape. See Herb Sutter The Free Lunch Is Over, which is now over 10 years old now! More updated version: Welcome to the jungle

What does it include?

It adds wording for parallel execution on the C++ standard, and Execution Policies to the STL interface that enable selecting the appropriate level of parallelism.

New parallel algorithms are also added to the interface.

What do I take from this talk?

You will understand what Parallel STL is and learn the basic to use them. You’ll be ready to use also the SYCL ParallelSTL on your accelerator.

C++ goes parallel!

© 2018 Codeplay Software Ltd.31

Sorting with the STL

std::vector<int> data = { 8, 9, 1, 4 };

std::sort(std::begin(data), std::end(data));

if (std::is_sorted(data)) {Std::cout << “ Data is sorted!” << std::endl;

}

std::vector<int> data = { 8, 9, 1, 4 };

std::sort(std::execution_policy::par, std::begin(data), std::end(data));

if (std::is_sorted(data)) {Std::cout << “ Data is sorted!” << std::endl;

}

Extra parameter to STL algorithms enable parallelism

Normal sequential sort algorithm

© 2018 Codeplay Software Ltd.32

The Execution Policy: Standard policy classes

● Defined in the execution namespace○ Sequenced policy

■ Never do parallel, sequenced in-order execution■ constexpr sequenced_policy sequenced;

○ Parallel policy■ Can use caller thread but may span others (std::thread)■ Invocations do not interleave on a single thread■ constexpr sequenced_policy par;

○ Parallel unsequenced■ Can use caller thread or others (e.g std::thread)■ Multiple invocations may be interleaved on a single thread■ constexpr sequenced_policy par_unseq;

© 2018 Codeplay Software Ltd.33

Many different existing implementations

Available today● Microsoft: http://parallelstl.codeplex.com● HPX: http://stellar-group.github.io/hpx/docs/html/hpx/manual/parallel.html● HSA: http://www.hsafoundation.com/hsa-for-math-science● Thibaut Lutz: http://github.com/t-lutz/ParallelSTL● NVIDIA: https://thrust.github.io/doc/group__execution__policies.html● Codeplay: http://github.com/KhronosGroup/SyclParallelSTL● Clang: Not yet available

Expect major C++ compilers to implement it soon!

© 2018 Codeplay Software Ltd.34

Using execution policies

using std::execution_policy;

// May execute in parallelstd::sort(par, std::begin(data), std::end(data))// May be parallelized and vectorizedstd::sort(std::par_unseq, std::begin(data), std::end(data));// Will not be parallelized/vectorizedstd::sort(std::sequenced, std::begin(data), std::end(data));// Vendor-specific policy, read their documentation!std::sort(custom_vendor_policy, std::begin(data), std::end(data));

© 2018 Codeplay Software Ltd.35

Propagating the policy to the end user

using std::execution_policy;

template<typename Policy, typename Iterator>void library_function(Policy p, Iterator begin,

Iterator end) {std::sort(p, begin, end);std::for_each(p, begin, end,

[&](Iterator::value_type e&) { e ++;}) ;std::for_each(std::sequenced, begin, end,

non_parallel_operation) ;}

© 2018 Codeplay Software Ltd.36

Parallel overloads available

© 2018 Codeplay Software Ltd.37

New algorithms into the STL: Parallel For Each

template<class ExecutionPolicy, class InputIterator, class Function>void for_each(ExecutionPolicy && exec,InputIterator first, InputIterator last, Function f);

template<class ExecutionPolicy, class InputIterator, class Size, class Function>InputIterator for_each_n(ExecutionPolicy && exec,

InputIterator first, Size n,Function f) ;

template<class InputIterator, class Size, class Function>InputIterator for_each_n(InputIterator first, Size n, Function f);

● for_each: Applies f to elements in range [first, last).● for_each_n: Applies f to elements in [first, first + n)

© 2018 Codeplay Software Ltd.38

New algorithms into the STL

Numerical Parallel Algorithmstemplate < class InputIterator >typename iterator_traits < InputIterator >:: value_typereduce ( InputIterator first , InputIterator last ) ;

template < class InputIterator , class T >T reduce ( InputIterator first , InputIterator last , T init ) ;

template < class InputIterator , class T , class BinaryOperation >T reduce ( InputIterator first , InputIterator last , T init ,BinaryOperation binary_op ) ;

Implements a reduction operation (the order of the binary_op is not relevant). The sequential equivalent is accumulate

© 2018 Codeplay Software Ltd.39

New algorithms into the STL (Serial Reduction pattern)

0 1 2 3 4 5 6

21

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::accumulate(std::begin(v1), nElems, 1);

Only one core is used for the different additions.

© 2018 Codeplay Software Ltd.40

New algorithms into the STL (Parallel Reduction Pattern)

0 1 2 3 4 5 6

21

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::reduce(std::execution_policy::par,std::begin(v1), nElems, 1);

If operation is commutative and associative, can be run in parallel. Reduction uses all cores!

© 2018 Codeplay Software Ltd.41

Transform

● transform (a.k.a map) applies a function to an input range and stores the result on the output range. Operation is out of order.

std::transform(std::execution::par, v1.begin(), v1.end(), v2.begin(), output.begin(), [=](int val1, int val2)

{ return val1 + val2 + 1; });

© 2018 Codeplay Software Ltd.42

Transform reduce

● transform_reduce applies a function to an input range and then applies the binary operation to reduce the values

© 2018 Codeplay Software Ltd.43

Transform Reduce example

0,1 2,3 4,5 6,7 8,9 10,11

12, 13

1275.4866..

Apply to each element of v

Reduction function: addition

AdditionMultiplication per element

Neutral element of reduction

© 2018 Codeplay Software Ltd.44

What can I do with a Parallel For Each?

Intel Core i7 7th generation

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::fill_n(std::begin(v1), nElems, 1);

std::for_each(std::begin(v), std::end(v),[=](float f) { f * f + f });

Traditional for each uses only one core, rest of the die is unutilized!

10000 elems

© 2018 Codeplay Software Ltd.45

What can I do with a Parallel For Each?

Intel Core i7 7th generation

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::fill_n(std::execution_policy::par,std::begin(v1), nElems, 1);

std::for_each(std::execution_policy::par,std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed across cores!(mileage may vary, implementation-specific behaviour)

2500 elems

2500 elems

2500 elems

2500 elems

© 2018 Codeplay Software Ltd.46

What can I do with a Parallel For Each?

Intel Core i7 7th generation

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::fill_n(std::execution_policy::par,std::begin(v1), nElems, 1);

std::for_each(std::execution_policy::par,std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed across cores!(mileage may vary, implementation-specific behaviour)

2500 elems

2500 elems

2500 elems

2500 elems

What about this part?

© 2018 Codeplay Software Ltd.47

What can I do with a Parallel For Each?

Intel Core i7 7th generation

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::fill_n(sycl_policy,std::begin(v1), nElems, 1);

std::for_each(sycl_named_policy<class KernelName>,

std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed on the GPU cores(mileage may vary, implementation-specific behaviour)

10000 elems

© 2018 Codeplay Software Ltd.48

What can I do with a Parallel For Each?

Intel Core i7 7th generation

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::fill_n(sycl_heter_policy(cpu, gpu, 0.5),std::begin(v1), nElems, 1);

std::for_each(sycl_heter_policy<class kName>(cpu, gpu, 0.5),std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed on all cores!(mileage may vary, implementation-specific behaviour)

5000 elems

1250 elems

1250 elems

1250 elems

1250 elems

Experimental!

© 2018 Codeplay Software Ltd.49

Parallel overloads available in SYCL Parallel STL

© 2018 Codeplay Software Ltd.50

SYCL for OpenCL

➢ Cross-platform, single-source, high-level, C++ programming layer➢ Built on top of OpenCL and based on standard C++14

© 2018 Codeplay Software Ltd.51

Enabling Machine Learning Frameworks on OpenCL and Modern C++ 11, 14, 17, 20, 23, ...

© 2018 Codeplay Software Ltd.52

The SYCL Ecosystem

C++ Application

C++ Template Library

SYCL for OpenCL

OpenCL

C++ Template LibraryC++ Template Library

GPU APUCPU FPGAAccelerator DSP

© 2018 Codeplay Software Ltd.53

Example: Vector Add

© 2018 Codeplay Software Ltd.54

Example: Vector AddExample: Vector Add#include <CL/sycl.hpp>

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

}

© 2018 Codeplay Software Ltd.55

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());

}

The buffers synchronise upon

destruction

© 2018 Codeplay Software Ltd.56

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;

}

© 2018 Codeplay Software Ltd.57

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {

});}

Create a command group to define an asynchronous task

© 2018 Codeplay Software Ltd.58

Example: Vector Add#include <CL/sycl.hpp>

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);

});}

© 2018 Codeplay Software Ltd.59

Example: Vector Add#include <CL/sycl.hpp>template <typename T> kernel;

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(out.size())),

[=](cl::sycl::id<1> idx) {

}));});

}

You must provide a name for the

lambda

Create a parallel_for to define the device

code

© 2018 Codeplay Software Ltd.60

Example: Vector Add#include <CL/sycl.hpp>template <typename T> kernel;

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(out.size())),

[=](cl::sycl::id<1> idx) {outputPtr[idx] = inputAPtr[idx] + inputBPtr[idx];

}));});

}

© 2018 Codeplay Software Ltd.61

Example: Vector Add

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> out);

int main() {

std::vector<float> inputA = { /* input a */ };std::vector<float> inputB = { /* input b */ };std::vector<float> output = { /* output */ };

parallel_add(inputA, inputB, output);

...}

© 2018 Codeplay Software Ltd.62

Implementing Parallel STL with SYCL/* sycl_execution_policy.* The sycl_execution_policy enables algorithms to be executed using* a SYCL implementation.*/template <class KernelName = DefaultKernelName>class sycl_execution_policy {cl::sycl::queue m_q;public:// The kernel name when using lambdasusing kernelName = KernelName;sycl_execution_policy() = default;sycl_execution_policy(cl::sycl::queue q) : m_q(q){};sycl_execution_policy(const sycl_execution_policy&) = default;// Returns the name of the kernel as a stringstd::string get_name() const { return typeid(kernelName).name(); };// Returns the queue, if anycl::sycl::queue get_queue() const { return m_q; }

Creates a SYCL policy using an existing queue

Typeid information only valid for debugging

© 2018 Codeplay Software Ltd.63

Implementing Parallel STL with SYCL

/* for_each*/template <class Iterator, class UnaryFunction>void for_each(Iterator b, Iterator e, UnaryFunction f) {impl::for_each(*this, b, e, f);

}

For_each member function on the policy forwards to implementation

Iterator can be any RandomAccess tag

Functions can take C++ iterators or SYCL-specific iterators

© 2018 Codeplay Software Ltd.64

template <class ExecutionPolicy, class Iterator, class UnaryFunction>void for_each(ExecutionPolicy &sep, Iterator b, Iterator e, UnaryFunction op) {{cl::sycl::queue q(sep.get_queue());auto device = q.get_device();size_t localRange =

device.get_info<cl::sycl::info::device::max_work_group_size>();auto bufI = sycl::helpers::make_buffer(b, e);auto vectorSize = bufI.get_count();size_t globalRange = sep.calculateGlobalSize(vectorSize, localRange);

Continues...

Obtain the queue from the policy

Obtain device parameters

Prepare allocations on device

© 2018 Codeplay Software Ltd.65

auto f = [vectorSize, localRange, globalRange, &bufI, op](cl::sycl::handler &h) mutable {

cl::sycl::nd_range<1> r{cl::sycl::range<1>{std::max(globalRange, localRange)},cl::sycl::range<1>{localRange}};

auto aI = bufI.template get_access<cl::sycl::access::mode::read_write>(h);h.parallel_for<typename ExecutionPolicy::kernelName>(

r, [aI, op, vectorSize](cl::sycl::nd_item<1> id) {if (id.get_global(0) < vectorSize) {op(aI[id.get_global(0)]);

}});

};q.submit(f);

}

Device Lambda

User functor

Submit for execution on the device

© 2018 Codeplay Software Ltd.66

Demo Results - Running std::sort(Running on Intel i7 6600 CPU & Intel HD Graphics 520)

size 2^16 2^17 2^18 2^19

std::seq 0.27031s 0.620068s 0.669628s 1.48918s

std::par 0.259486s 0.478032s 0.444422s 1.83599s

std::unseq 0.24258s 0.413909s 0.456224s 1.01958s

sycl_execution_policy 0.273724s 0.269804s 0.277747s 0.399634s

© 2018 Codeplay Software Ltd.67

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

© 2018 Codeplay Software Ltd.68

SG1 Par/Con TS

SG5 Transactional Memory TS

SG14 Low Latency

3

……

The Parallel and concurrency planets of C++ today

© 2018 Codeplay Software Ltd.69

C++1Y(1Y=17/20/22) SG1/SG5/SG14 TS Planred=C++17, blue=C++20? Black=future?

Parallelism• Parallel Algorithms:• Data-Based Parallelism.

(Vector, SIMD, ...)• Task-based parallelism (cilk, OpenMP, fork-join)• Loop based execution policy• Execution Agents• Progress guarantees• MapReduce• Pipelines

Concurrency• Future++ (then, wait_any, wait_all): • Resumable Functions, await (with futures)• Lock free techniques/Transactions • Synchronics• Atomic Views• Counters/Queues• Concurrent Vector/Unordered Associative Containers• Latches and Barriers• upgrade_lock• Atomic smart pointers

• Executors

• Coroutines

• Networking

© 2018 Codeplay Software Ltd.70

C++ 20 (barely possible)• Executors Lite• Networking

C++ 23• Executors• Networking• More alg.policies.• Futures• Async• Affinity• EATLS• EH in a concurrent

environment

Two “Possible” Universes

© 2018 Codeplay Software Ltd.71

Use the Proper Abstraction with C++20Abstraction How is it supported

Cores C++11/14/17 threads, async

HW threads C++11/14/17 threads, async

Vectors Parallelism TS2->C++20

Atomic, Fences, lockfree, futures, counters, transactions

C++11/14/17 atomics, Concurrenct TS1->C++20,Transactional Memory TS1

Parallel Loops Async, TBB:parallel_invoke, Parallelism TS1, C++17 parallel algorithms

Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos, RajaP0796 on affinity

Distributed HPX, MPI, UPC++P0796 on affinity

Caches OpenMP affinity places, Not supported in C++Executors, Execution Context, Affinity, P0796 on affinity,P0443->Executor TS or IS20

Numa OpenMP affinity places, Not supported in C++Executors, Execution Context, Affinity, P0796 on affinityP0443->Executor TS or IS20,

TLS EATLS, P0772

Exception handling in concurrent environment EH reduction propertiesP0797

© 2018 Codeplay Software Ltd.72

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

© 2018 Codeplay Software Ltd.73

• The 3 Data problems even after you exposed Parallelism• Data Movement: the COST! But is it implicit or explicit?• Data Layout: AoS vs SoA: coalesced memory accesses; How about data tiling?• Data Addressing: Partitioned addressing, how do you treat pointers, can you share arrays across CPU/GPUs

• Other General issues• What is a Thread?: is it a System thread? GPU thread, OpenMP/OpenCL thread; Thread of Execution, Execution Agents• Memory Model: is it flat, abstract? Or scoped based on the warp, team, • Thread Local Storage: do GPUs convoy startups• Scheduling: dependencies • More specific Hardware kinds: FPGA, DSPs, streaming, cloud and Tensor Processing Units?• Forward Progress Guarantees and weak execution agents• Dynamic online/offline device • How to fill a GPU team, block, warp

• More specific C++ issues• Separation of Concerns: What, where, how, when is it executed?• Affinity: both CPU and memory affinity• Exceptions and Error Model: how to have concurrent exceptions• Templates: how do you support compile-time polymorphism, type erasure, AI • Polymorphisms: Supporting everything on a GPU vs CPU

Problems of Transforming from a non-Heterogeneous to Heterogeneous Language

© 2018 Codeplay Software Ltd.74

SYCL / OpenCL / CUDA / HCC OpenMP / MPI C++ Thread Pool

Boost.Asio / Networking TS

Unified interface for execution

defer define_task_block dispatch strand<>asynchronous operations

future::thenasyncinvoke postparallel algorithms

© 2018 Codeplay Software Ltd.75

Execution Resource

Execution Context Executor

Instruction Stream

Lightweight Execution

Agent

Execution Function

Lightweight Execution

Agent

Lightweight Execution

Agent

● An execution context is responsible for managing an execution resource

● An execution context provides an executor for executing work on it’s managed execution resource

● An execution context manages a number of light-weight execution agents

Separation of concerns with executors

© 2018 Codeplay Software Ltd.76

#pragma omp for [clause[[,] clause] ... ] new-line

for-loop

and is bound to a parallel region that looks as follows:

#pragma omp parallel [clause[ [, ]clause] ...] new-line

structured-block

while both constructs can be combined into the following:

#pragma omp parallel for [clause[[,] clause] ...] new-line

for-loop

for [[omp::for(clause, clause), … ]] (loop-head)

loop-body

The enclosing parallel region would look like this:using [[omp::parallel(clause,clause), … ]]

{ }

OpenMP without Pragmas

© 2018 Codeplay Software Ltd.77

Learning from each other

C++ from OpenMP

• Affinity & Numa Model• SIMD model

• Learning shared code drop to clang• Leverage same OpenMP clang fat

binary• Build uniform tools for debugging and

performance analysis

OpenMP from C++

• Separation of Concerns • Attributes over pragmas

• Futures• Thread of Execution model• Error Model, Better C++ support• Consumer Producer Model

• Academic Membership Model • Local National chapters

OpenCL from OpenMP

OpenMP from OpenCL

© 2018 Codeplay Software Ltd.78

• In time, we will move accelerator code and data segment into object layouts,

• such that OSes can load it directly to various devices

Another possible future

78

@codeplaysoft codeplay.cominfo@codeplay.com

Thank you for Listening

top related