heterogeneous c++ - openmp users

Heterogeneous C++

Michael Wong, Codeplay Software

VP of Research and Development

Chair of SYCL Heterogeneous Programming LanguageISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong

Head of Delegation for C++ Standard for Canada

Chair of Programming Languages for Standards Council of Canada

Chair of WG21 SG5 Transactional MemoryChair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded

Editor: C++ SG5 Transactional Memory Technical Specification

Editor: C++ SG1 Concurrency Technical Specification

http:://wongmichael.com/aboutADC++ 2018 – May 2018

Acknowledgement Disclaimer

Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk.I even lifted this acknowledgement and disclaimer from some of them.But I claim all credit for errors, and stupid mistakes. These are mine, all mine!

Legal Disclaimer

This work represents the view of the author and does not necessarily represent the view of Codeplay.

Other company, product, and service names may be trademarks or service marks of others.

Partners

Codeplay - Connecting AI to Silicon

Customers

C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. TensorFlow™

The heart of Codeplay's compute technologyenabling OpenCL™, SPIR™, HSA™ and Vulkan™

Products

Automotive (ISO 26262)IoT, Smartphones & Tablets

High Performance Compute (HPC)Medical & Industrial

Technologies: Vision ProcessingMachine Learning

Artificial IntelligenceBig Data Compute

Addressable Markets

High-performance software solutions for custom heterogeneous systems

Enabling the toughest processor systems with tools and middleware based on open standards

Established 2002 in Scotland

~70 employees

Company

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

A tale of two cities

Will the two galaxies ever join/collide?

A tale of two/three cities

Programming GPU/Accelerators

• OpenGL• DirectX• CUDA• OpenCL• OpenMP• OpenACC• C++ AMP• HPX

• HSA• SYCL• Vulkan• Boost.Compute• Halide• Kokkos• Raja• UPC++• HCC• Charm++

• P3MA at ISC• WACCPD at SC• DHPCC++ at IWOCL• Repara at EuroPar• HeteroPar at EuroPar

• Distributed and Heterogeneous Programming in C/C++

• 2nd year attached to IWOCL• Latest research in HPX, SYCL,

Kokkos, Raja, …• Latest standardization efforts• Work loads in machine

learning, vision, safety critical

Several Key workshops

How can we compile source code for a sub architectures?

Separate source

Single source

Benefits of Single Source

• Device code is written in C++ in the same source file as the host CPU code

• Allows compile-time evaluation of device code

• Supports type safety across host CPU and device

• Supports generic programming

• Removes the need to distribute source code

Agenda

A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to HeterogeneousHeterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++

C++ Directions Group: P0939

What is C++

C++ is a language for defining and using lightweight abstractions

C++ supports building resource constrained applications and software infrastructure

C++ support large-scale software development

How do we want C++ to develop?

Improve support for large -scale dependable softwareImprove support for high-level concurrency models

Simplify language useAddress major sources of dissatisfaction

Address major sources of error

C++ rests on two pillars

A direct map to hardware (initially from C)Zero-overhead abstraction in production code (initially from

Simula, where it wasn’t zero-overhead)

4.3 Concrete Suggestions• Pattern matching• Exception and error returns• Static reflection• Modern networking• Modern hardware:

We need better support for modern hardware, such as executors/execution context, affinity support in C++ leading to heterogeneous/distributed computing support,

SIMD/task blocks, more concurrency data structures, improved atomics/memory model/lock-free data structures support. The challenge is to turn this (incomplete) laundry list into a

coherent set of facilities and to introduce them in a manner that leaves each new standard with a coherent subset of our ideal.

• Simple graphics and interaction• Anything from the Priorities for C++20 that didn’t make C++20

Use the Proper Abstraction today with C++17Abstraction How is it supported

Cores C++11/14/17 threads, async

HW threads C++11/14/17 threads, async

Vectors Parallelism TS2

Atomic, Fences, lockfree, futures, counters, transactions

C++11/14/17 atomics, Concurrenct TS1, Transactional Memory TS1

Parallel Loops Async, TBB:parallel_invoke, Parallelism TS1, C++17 parallel algorithms

Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos,Raja

Distributed HPX, MPI, UPC++

Caches OpenMP affinity places, Not supported in C++

Numa OpenMP affinity places, Not Supported in C++

TLS C++ 11 TLS (thread_local)

Exception handling in concurrent environment Not supported

• The 3 Data problems even after you exposed Parallelism• Data Movement: the COST! But is it implicit or explicit?• Data Layout: AoS vs SoA: coalesced memory accesses; How about data tiling?• Data Addressing: Partitioned addressing, how do you treat pointers, can you share arrays across CPU/GPUs

• Other General issues• What is a Thread?: is it a System thread? GPU thread, OpenMP/OpenCL thread; Thread of Execution, Execution Agents• Memory Model: is it flat, abstract? Or scoped based on the warp, team, • Thread Local Storage: do GPUs • Scheduling: dependencies • More specific Hardware kinds: FPGA, DSPs, streaming, cloud and Tensor Processing Units?• Forward Progress Guarantees and weak execution agents• Dynamic online/offline device • How to fill a GPU team, block, warp

• More specific C++ issues• Separation of Concerns: What, where, how, when is it executed?• Affinity: both CPU and memory affinity• Exceptions and Error Model: how to have concurrent exceptions• Templates: how do you support compile-time polymorphism, type erasure, AI • Polymorphisms: Supporting everything on a GPU vs CPU

Problems of Transforming from a non-Heterogeneous to Heterogeneous Language

Agenda

C++ Std Timeline/statushttps://isocpp.org/std/status

Status after Mar JAX C++ MeetingISO number Name Status links C++20?

ISO/IEC TS 19841:2015 Transactional Memory TS

Published 2015-09-16, (ISO Store). Final draft: n4514(2015-05-08)

Composablelock-free programming that scales

No. Already in GCC 6 release and waiting for subsequent usage experience.

ISO/IEC TS 19217:2015 C++ Extensions for Concepts

Published 2015-11-13. (ISO Store). Final draft: n4553(2015-10-02)

Constrained templates

Merged into C++20 without terse syntax. . Already in GCC 6 release and and waiting for subsequent usage experience.

Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?

ISO/IEC TS 19571:2016 C++ Extensions for Concurrency

Published 2016-01-19. (ISO Store) Final draft: p0159r0(2015-10-22)

improvements to future, latches and barriers, atomic smart pointers

Latches, atomic<shared_ptr<t>> headed into C++20. Already in Visual Studio release and Anthony Williams Just Threads! and waiting for subsequent usage experience.

ISO/IEC TS 19568:2017 C++ Extensions for Library Fundamentals, Version 2

Published 2017-03-30. (ISO Store) Draft: n4617 (2016-11-28)

source code information capture and various utilities Published with parts in C++17

ISO/IEC DTS 21425:2017 Ranges TS Published 2017-11Draft n4651 (2017-03-15)

Range-based algorithms and views Published

ISO/IEC DTS 19216:xxxx Networking TS PDTS, Draft n4656 (2017-03-17)

Sockets library based on Boost.ASIO

Publish soon but must come with executors

ISO/IEC DTS 21544:xxxx Modules Proposed Draft n4689(2017-07-31) out for ballot

A component system to supersede the textual header file inclusion model

Voted for publication.

Numerics TS Early development. Draft p0101(2015-09-27)

Various numerical facilities,Including DSP rounding types Under active development

ISO/IEC DTS 19571:xxxx Concurrency TS 2 Early developmentExploring , lock-free, hazard pointers, RCU, atomic views,concurrent data structures

Under active development. Possible new clause

ISO/IEC DTS 19570:xxxx Parallelism TS 2 Early development. Draft n4578(2016-02-22)

Exploring task blocks, progress guarantees, SIMD<T> type, vec, no_vec loop based execution policy

PDTS ballot now on remaining part after pushing seq policy into C++20

ISO/IEC DTS 19841:xxxx Transactional Memory TS 2 Early developmentExploring on_commit, in_transaction. Lambda-based executor model.

Under active development.

Graphics TS Early development. Draft p0267r0 (2016-02-12)

2D drawing API using Cairo interface, adding stateless interface

LEWG reviewed but now became controversial

Library Fundamental V3 Initial draft, early development Maybe mdspan and expected<T> Under development

ISO/IEC DTS 22277:2017 Coroutine TS Published 2017-11Draft n4663 (2017-03-25)

Resumable functions, based on Microsoft’s await design Published

Reflection TS

Early development. Draft p0194r2 (2016-10-15) with rationale in p0385r2 (2017-02-06). Alternative: p0590r0 (2017-02-05)

Code introspection and (later) reification mechanisms

Introspection proposal passed core language design review; next stop is design review of the library components. Targeting a Reflection TS.

Contracts TS Unified proposal reviewed favourably. )

Preconditions, postconditions, etc.

Proposal passed core language design review; next stop is design review of the library components. Targeting C++20.

Executor TS Separated from Concurrency TS. have a unified proposal .

Describes how, where, when of execution. Enables distributed and heterogeneous computing.

weekly calls but have consensus in SG1 and reviewed in LEWG and LWG. Now considering how to make it into C++20.

Heterogeneous Device TSAffinity, execution context, eh in concurrent environment, Execution agent local storage

Support Hetereogeneous Devices

Weekly calls. Affinity and EH progressing

Agenda

Parallelize and vectorize example

C++17 Parallel STL: Democratizing Parallelism in C++What is Parallel STL?

Parallel STL greatly facilitates the usage of parallelism in C++ by exposing a parallel interface for the STL algorithms.

Why do I care?

Hardware architecture is becoming increasingly parallel. You cannot escape. See Herb Sutter The Free Lunch Is Over, which is now over 10 years old now! More updated version: Welcome to the jungle

What does it include?

It adds wording for parallel execution on the C++ standard, and Execution Policies to the STL interface that enable selecting the appropriate level of parallelism.

New parallel algorithms are also added to the interface.

What do I take from this talk?

You will understand what Parallel STL is and learn the basic to use them. You’ll be ready to use also the SYCL ParallelSTL on your accelerator.

C++ goes parallel!

Sorting with the STL

std::vector<int> data = { 8, 9, 1, 4 };

std::sort(std::begin(data), std::end(data));

if (std::is_sorted(data)) {Std::cout << “ Data is sorted!” << std::endl;

std::vector<int> data = { 8, 9, 1, 4 };

std::sort(std::execution_policy::par, std::begin(data), std::end(data));

if (std::is_sorted(data)) {Std::cout << “ Data is sorted!” << std::endl;

Extra parameter to STL algorithms enable parallelism

Normal sequential sort algorithm

The Execution Policy: Standard policy classes

● Defined in the execution namespace○ Sequenced policy

■ Never do parallel, sequenced in-order execution■ constexpr sequenced_policy sequenced;

○ Parallel policy■ Can use caller thread but may span others (std::thread)■ Invocations do not interleave on a single thread■ constexpr sequenced_policy par;

○ Parallel unsequenced■ Can use caller thread or others (e.g std::thread)■ Multiple invocations may be interleaved on a single thread■ constexpr sequenced_policy par_unseq;

Many different existing implementations

Available today● Microsoft: http://parallelstl.codeplex.com● HPX: http://stellar-group.github.io/hpx/docs/html/hpx/manual/parallel.html● HSA: http://www.hsafoundation.com/hsa-for-math-science● Thibaut Lutz: http://github.com/t-lutz/ParallelSTL● NVIDIA: https://thrust.github.io/doc/group__execution__policies.html● Codeplay: http://github.com/KhronosGroup/SyclParallelSTL● Clang: Not yet available

Expect major C++ compilers to implement it soon!

Using execution policies

using std::execution_policy;

// May execute in parallelstd::sort(par, std::begin(data), std::end(data))// May be parallelized and vectorizedstd::sort(std::par_unseq, std::begin(data), std::end(data));// Will not be parallelized/vectorizedstd::sort(std::sequenced, std::begin(data), std::end(data));// Vendor-specific policy, read their documentation!std::sort(custom_vendor_policy, std::begin(data), std::end(data));

Propagating the policy to the end user

using std::execution_policy;

template<typename Policy, typename Iterator>void library_function(Policy p, Iterator begin,

Iterator end) {std::sort(p, begin, end);std::for_each(p, begin, end,

[&](Iterator::value_type e&) { e ++;}) ;std::for_each(std::sequenced, begin, end,

non_parallel_operation) ;}

Parallel overloads available

New algorithms into the STL: Parallel For Each

template<class ExecutionPolicy, class InputIterator, class Function>void for_each(ExecutionPolicy && exec,InputIterator first, InputIterator last, Function f);

template<class ExecutionPolicy, class InputIterator, class Size, class Function>InputIterator for_each_n(ExecutionPolicy && exec,

InputIterator first, Size n,Function f) ;

template<class InputIterator, class Size, class Function>InputIterator for_each_n(InputIterator first, Size n, Function f);

● for_each: Applies f to elements in range [first, last).● for_each_n: Applies f to elements in [first, first + n)

New algorithms into the STL

Numerical Parallel Algorithmstemplate < class InputIterator >typename iterator_traits < InputIterator >:: value_typereduce ( InputIterator first , InputIterator last ) ;

template < class InputIterator , class T >T reduce ( InputIterator first , InputIterator last , T init ) ;

template < class InputIterator , class T , class BinaryOperation >T reduce ( InputIterator first , InputIterator last , T init ,BinaryOperation binary_op ) ;

Implements a reduction operation (the order of the binary_op is not relevant). The sequential equivalent is accumulate

New algorithms into the STL (Serial Reduction pattern)

0 1 2 3 4 5 6

size_t nElems = 1000u;std::vector<float> nums(nElems);

std::accumulate(std::begin(v1), nElems, 1);

Only one core is used for the different additions.

New algorithms into the STL (Parallel Reduction Pattern)

0 1 2 3 4 5 6

std::reduce(std::execution_policy::par,std::begin(v1), nElems, 1);

If operation is commutative and associative, can be run in parallel. Reduction uses all cores!

Transform

● transform (a.k.a map) applies a function to an input range and stores the result on the output range. Operation is out of order.

std::transform(std::execution::par, v1.begin(), v1.end(), v2.begin(), output.begin(), [=](int val1, int val2)

{ return val1 + val2 + 1; });

Transform reduce

● transform_reduce applies a function to an input range and then applies the binary operation to reduce the values

Transform Reduce example

0,1 2,3 4,5 6,7 8,9 10,11

12, 13

1275.4866..

Apply to each element of v

Reduction function: addition

AdditionMultiplication per element

Neutral element of reduction

What can I do with a Parallel For Each?

Intel Core i7 7th generation

std::fill_n(std::begin(v1), nElems, 1);

std::for_each(std::begin(v), std::end(v),[=](float f) { f * f + f });

Traditional for each uses only one core, rest of the die is unutilized!

10000 elems

std::fill_n(std::execution_policy::par,std::begin(v1), nElems, 1);

std::for_each(std::execution_policy::par,std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed across cores!(mileage may vary, implementation-specific behaviour)

2500 elems

std::fill_n(std::execution_policy::par,std::begin(v1), nElems, 1);

std::for_each(std::execution_policy::par,std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed across cores!(mileage may vary, implementation-specific behaviour)

2500 elems

What about this part?

std::fill_n(sycl_policy,std::begin(v1), nElems, 1);

std::for_each(sycl_named_policy<class KernelName>,

std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed on the GPU cores(mileage may vary, implementation-specific behaviour)

10000 elems

std::fill_n(sycl_heter_policy(cpu, gpu, 0.5),std::begin(v1), nElems, 1);

std::for_each(sycl_heter_policy<class kName>(cpu, gpu, 0.5),std::begin(v), std::end(v),[=](float f) { f * f + f });

Workload is distributed on all cores!(mileage may vary, implementation-specific behaviour)

5000 elems

1250 elems

Experimental!

Parallel overloads available in SYCL Parallel STL

SYCL for OpenCL

➢ Cross-platform, single-source, high-level, C++ programming layer➢ Built on top of OpenCL and based on standard C++14

Enabling Machine Learning Frameworks on OpenCL and Modern C++ 11, 14, 17, 20, 23, ...

The SYCL Ecosystem

C++ Application

C++ Template Library

SYCL for OpenCL

OpenCL

C++ Template LibraryC++ Template Library

GPU APUCPU FPGAAccelerator DSP

Example: Vector Add

Example: Vector AddExample: Vector Add#include <CL/sycl.hpp>

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {

Example: Vector Add#include <CL/sycl.hpp>

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());

The buffers synchronise upon

destruction

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;

cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {

Create a command group to define an asynchronous task

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);

Example: Vector Add#include <CL/sycl.hpp>template <typename T> kernel;

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(out.size())),

[=](cl::sycl::id<1> idx) {

}));});

You must provide a name for the

lambda

Create a parallel_for to define the device

Example: Vector Add#include <CL/sycl.hpp>template <typename T> kernel;

auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(out.size())),

[=](cl::sycl::id<1> idx) {outputPtr[idx] = inputAPtr[idx] + inputBPtr[idx];

}));});

Example: Vector Add

template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> out);

int main() {

std::vector<float> inputA = { /* input a */ };std::vector<float> inputB = { /* input b */ };std::vector<float> output = { /* output */ };

parallel_add(inputA, inputB, output);

Implementing Parallel STL with SYCL/* sycl_execution_policy.* The sycl_execution_policy enables algorithms to be executed using* a SYCL implementation.*/template <class KernelName = DefaultKernelName>class sycl_execution_policy {cl::sycl::queue m_q;public:// The kernel name when using lambdasusing kernelName = KernelName;sycl_execution_policy() = default;sycl_execution_policy(cl::sycl::queue q) : m_q(q){};sycl_execution_policy(const sycl_execution_policy&) = default;// Returns the name of the kernel as a stringstd::string get_name() const { return typeid(kernelName).name(); };// Returns the queue, if anycl::sycl::queue get_queue() const { return m_q; }

Creates a SYCL policy using an existing queue

Typeid information only valid for debugging

Implementing Parallel STL with SYCL

/* for_each*/template <class Iterator, class UnaryFunction>void for_each(Iterator b, Iterator e, UnaryFunction f) {impl::for_each(*this, b, e, f);

For_each member function on the policy forwards to implementation

Iterator can be any RandomAccess tag

Functions can take C++ iterators or SYCL-specific iterators

template <class ExecutionPolicy, class Iterator, class UnaryFunction>void for_each(ExecutionPolicy &sep, Iterator b, Iterator e, UnaryFunction op) {{cl::sycl::queue q(sep.get_queue());auto device = q.get_device();size_t localRange =

device.get_info<cl::sycl::info::device::max_work_group_size>();auto bufI = sycl::helpers::make_buffer(b, e);auto vectorSize = bufI.get_count();size_t globalRange = sep.calculateGlobalSize(vectorSize, localRange);

Continues...

Obtain the queue from the policy

Obtain device parameters

Prepare allocations on device

auto f = [vectorSize, localRange, globalRange, &bufI, op](cl::sycl::handler &h) mutable {

cl::sycl::nd_range<1> r{cl::sycl::range<1>{std::max(globalRange, localRange)},cl::sycl::range<1>{localRange}};

auto aI = bufI.template get_access<cl::sycl::access::mode::read_write>(h);h.parallel_for<typename ExecutionPolicy::kernelName>(

r, [aI, op, vectorSize](cl::sycl::nd_item<1> id) {if (id.get_global(0) < vectorSize) {op(aI[id.get_global(0)]);

};q.submit(f);

Device Lambda

User functor

Submit for execution on the device

Demo Results - Running std::sort(Running on Intel i7 6600 CPU & Intel HD Graphics 520)

size 2^16 2^17 2^18 2^19

std::seq 0.27031s 0.620068s 0.669628s 1.48918s

std::par 0.259486s 0.478032s 0.444422s 1.83599s

std::unseq 0.24258s 0.413909s 0.456224s 1.01958s

sycl_execution_policy 0.273724s 0.269804s 0.277747s 0.399634s

Agenda

SG1 Par/Con TS

SG5 Transactional Memory TS

SG14 Low Latency

……

The Parallel and concurrency planets of C++ today

C++1Y(1Y=17/20/22) SG1/SG5/SG14 TS Planred=C++17, blue=C++20? Black=future?

Parallelism• Parallel Algorithms:• Data-Based Parallelism.

(Vector, SIMD, ...)• Task-based parallelism (cilk, OpenMP, fork-join)• Loop based execution policy• Execution Agents• Progress guarantees• MapReduce• Pipelines

Concurrency• Future++ (then, wait_any, wait_all): • Resumable Functions, await (with futures)• Lock free techniques/Transactions • Synchronics• Atomic Views• Counters/Queues• Concurrent Vector/Unordered Associative Containers• Latches and Barriers• upgrade_lock• Atomic smart pointers

• Executors

• Coroutines

• Networking

C++ 20 (barely possible)• Executors Lite• Networking

C++ 23• Executors• Networking• More alg.policies.• Futures• Async• Affinity• EATLS• EH in a concurrent

environment

Two “Possible” Universes

Use the Proper Abstraction with C++20Abstraction How is it supported

Cores C++11/14/17 threads, async

HW threads C++11/14/17 threads, async

Vectors Parallelism TS2->C++20

Atomic, Fences, lockfree, futures, counters, transactions

C++11/14/17 atomics, Concurrenct TS1->C++20,Transactional Memory TS1

Parallel Loops Async, TBB:parallel_invoke, Parallelism TS1, C++17 parallel algorithms

Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos, RajaP0796 on affinity

Distributed HPX, MPI, UPC++P0796 on affinity

Caches OpenMP affinity places, Not supported in C++Executors, Execution Context, Affinity, P0796 on affinity,P0443->Executor TS or IS20

Numa OpenMP affinity places, Not supported in C++Executors, Execution Context, Affinity, P0796 on affinityP0443->Executor TS or IS20,

TLS EATLS, P0772

Exception handling in concurrent environment EH reduction propertiesP0797

Agenda

• The 3 Data problems even after you exposed Parallelism• Data Movement: the COST! But is it implicit or explicit?• Data Layout: AoS vs SoA: coalesced memory accesses; How about data tiling?• Data Addressing: Partitioned addressing, how do you treat pointers, can you share arrays across CPU/GPUs

• Other General issues• What is a Thread?: is it a System thread? GPU thread, OpenMP/OpenCL thread; Thread of Execution, Execution Agents• Memory Model: is it flat, abstract? Or scoped based on the warp, team, • Thread Local Storage: do GPUs convoy startups• Scheduling: dependencies • More specific Hardware kinds: FPGA, DSPs, streaming, cloud and Tensor Processing Units?• Forward Progress Guarantees and weak execution agents• Dynamic online/offline device • How to fill a GPU team, block, warp

• More specific C++ issues• Separation of Concerns: What, where, how, when is it executed?• Affinity: both CPU and memory affinity• Exceptions and Error Model: how to have concurrent exceptions• Templates: how do you support compile-time polymorphism, type erasure, AI • Polymorphisms: Supporting everything on a GPU vs CPU

Problems of Transforming from a non-Heterogeneous to Heterogeneous Language

SYCL / OpenCL / CUDA / HCC OpenMP / MPI C++ Thread Pool

Boost.Asio / Networking TS

Unified interface for execution

defer define_task_block dispatch strand<>asynchronous operations

future::thenasyncinvoke postparallel algorithms

Execution Resource

Execution Context Executor

Instruction Stream

Lightweight Execution

Execution Function

Lightweight Execution

● An execution context is responsible for managing an execution resource

● An execution context provides an executor for executing work on it’s managed execution resource

● An execution context manages a number of light-weight execution agents

Separation of concerns with executors

#pragma omp for [clause[[,] clause] ... ] new-line

for-loop

and is bound to a parallel region that looks as follows:

#pragma omp parallel [clause[ [, ]clause] ...] new-line

structured-block

while both constructs can be combined into the following:

#pragma omp parallel for [clause[[,] clause] ...] new-line

for-loop

for [[omp::for(clause, clause), … ]] (loop-head)

loop-body

The enclosing parallel region would look like this:using [[omp::parallel(clause,clause), … ]]

OpenMP without Pragmas

Learning from each other

C++ from OpenMP

• Affinity & Numa Model• SIMD model

• Learning shared code drop to clang• Leverage same OpenMP clang fat

binary• Build uniform tools for debugging and

performance analysis

OpenMP from C++

• Separation of Concerns • Attributes over pragmas

• Futures• Thread of Execution model• Error Model, Better C++ support• Consumer Producer Model

• Academic Membership Model • Local National chapters

OpenCL from OpenMP

OpenMP from OpenCL

• In time, we will move accelerator code and data segment into object layouts,

• such that OSes can load it directly to various devices

Another possible future

@codeplaysoft codeplay.cominfo@codeplay.com

Thank you for Listening

heterogeneous c++ - openmp users

Documents

hetroomp: openmp for hybrid load balancing across...

parallel programming with openmp part 1 – openmp v2.5

qos of mobile broadband users in heterogeneous...

openmp api 5.0 page 1 openmp 5.0 api syntax reference...

extending openmp to survive the...

groundwater pumping by heterogeneous users

communities in multi-mode networks 1. heterogeneous network...

1 parallel programming with openmp. 2 contents overview of...

programming irregular applications with openmp · 2016. 11....

openmp tutorialadvanced openmp, sc'2001 2 sc’2000 tutorial...

openmp tutorial - kisti...openmp tutorial mitsuhisa sato...

openmp china mcp 1. agenda motivation: the need the openmp...

programming irregular applications with openmp · 1 1...

heterogeneous programming with openmp* 4 - kit - scc...

loop scheduling for openmp · status of proposal for adding...

the openmp api for multithreaded programming sc'05 openmp

homogeneous and heterogeneous deformation mechanisms in...

heterogeneous c++ - uk openmp users...(2015-09-27) various...

hpc1 openmp e. bruce pitman october, 2002. hpc1 outline what...

using openmp to program embedded …...using openmp to...