hsa for the common man vinod tipparaju heterogeneous system software, amd lee howes heterogeneous...

49

Upload: mikel-tongue

Post on 01-Apr-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD
Page 2: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

HSA FOR THE COMMON MAN

Vinod TipparajuHeterogeneous System Software, AMD

Lee HowesHeterogeneous System Software, AMD

Page 3: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

3| Presentation Title | Month ##, 2012

THE HETEROGENEOUS SYSTEM ARCHITECTURE

Taking to programmers

Page 4: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

4| Presentation Title | Month ##, 2012

OPENCL™ AND HSA

HSA is an optimized platform architecture for OpenCL™

– Not an alternative to OpenCL™

OpenCL™ on HSA will benefit from

– Avoidance of wasteful copies

– Low latency dispatch

– Improved memory model

– Pointers shared between CPU and GPU

HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance

– Optimized libraries may choose the lower level interface

Page 5: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

5| Presentation Title | Month ##, 2012

HSA TAKING PLATFORM TO PROGRAMMERS

Balance between CPU and GPU for performance and power efficiency

Make GPUs accessible to wider audience of programmers

– Programming models close to today’s CPU programming models

– Enabling more advanced language features on GPU

– Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU

– Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU)

• Enabling task-graph style algorithms, Ray-Tracing, etc

Clearly defined HSA memory model enables effective reasoning for parallel programming

HSA provides a compatible architecture across a wide range of programming models and HW implementations.

Page 6: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

6| Presentation Title | Month ##, 2012

How we deliver the HSA value proposition?

Overall Vision:

– Make GPU easily accessibleSupport mainstream languages

Expandable to domain specific languages

– Make compute offload efficientDirect path to GPU (avoid Graphics overhead)

Eliminate memory copy

Low-latency dispatch

– Make it ubiquitousDrive HSA as a standard through HSA

Foundation

Open Source key components

HSA SOFTWARE STACK

Application and System Languages, domain specific languages, etc

e.g.

OpenCL™, C++ AMP, Python, R, JS, etc.

HSA Runtime

LLVM IR

HSA Hardware

Applications

HSAIL

Page 7: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

7| Presentation Title | Month ##, 2012

HSA EXECUTION MODEL VIA HSA RUNTIME

HSA Runtime User-mode work queues

– Uniform abstraction across devices, simple insertion mechanism

– Multi-level parallelism -- within a queue and across queues

Simple parallelism specifier

– Range/Grid, and group

– HW specifics have a simple abstraction

Analogous to programming based on cache-line size

– Implicit preemption – launch and execute multiple tasks simultaneously

User Write

Device Read

User Write

Device Read

User Write

Device Read

Page 8: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

8| Presentation Title | Month ##, 2012

HSA MEMORY MODEL VIA HSA RUNTIME

Key concepts

– Simplified view of memory

– Sharing pointers across devices is possible

makes it possible to run a task on a any device

Possible to use pointers and data structures that require pointer chasing correctly across device boundaries

– Relaxed consistency memory model

Acquire/release

Barriers

HSA Runtime exposes allocation interfaces with control over memory attributes

– Types of memories can be mixed and matched based on usage needs

Simplified launches – dispatch(task, arg1, arg2, …)

– Run device tasks with stack memory

Page 9: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

9| Presentation Title | Month ##, 2012

ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE

Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)

• Enabling task-graph style algorithms, Ray-Tracing, etc

• Queue is an architected feature

• Format of what represents a queue is architected

• Methods to enqueue follow

• Decoupled from HSAIL language

• Unique way of dynamically specifying where enqueues go to

• Resolution at the time of execution permits many load-balancing solutions

User Write

Device Read

Page 10: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

10| Presentation Title | Month ##, 2012

ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE

Device Write

Device Read

Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)

• Enabling task-graph style algorithms, Ray-Tracing, etc

• Queue is an architected feature

• Format of what represents a queue is architected

• Methods to enqueue follow

• Decoupled from HSAIL language

• Unique way of dynamically specifying where enqueues go to

• Resolution at the time of execution permits many load-balancing solutions

Page 11: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

11| Presentation Title | Month ##, 2012

ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE

Device Read

Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU)

• Enabling task-graph style algorithms, Ray-Tracing, etc

• Queue is an architected feature

• Format of what represents a queue is architected

• Methods to enqueue follow

• Decoupled from HSAIL language

• Unique way of dynamically specifying where enqueues go to

• Resolution at the time of execution permits many load-balancing solutions

Page 12: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

12| Presentation Title | Month ##, 2012

ABSTRACTING ARCHITECTED FEATURES

CreateQueue(ptr,size,…);

For(i=1,n)queue.dispatch(kernel,args,dep n+i,n-i);

queue.dispatch(1minuteKernel,args)

Event.wait(); event.getExceptionDetails();

Call CPU_FUNCTION_FROM_GPU

*fptr(….)

queue.dispatch(kernel,iptr); *iptr=2;

Queue.dispatch(kernel_set_i_value_1, iptr); While(i==1);

HSAAllocate(1) LDS/GDS as virtual memory

Access any address from host/kernel

Do atomics on the queue, in host and in kernel

Channels

User Mode Queuing

Context Switching

Process Reset (to avoid TDRs)

HW Exceptions

Function calls

Virtual functions

Memory Coherence

Unpinned Memory Access (for DMA and Compute Shader)

Flat Address Space

Unaligned Addressing / Memory Access

Platform Atomic Operations

Memory Watchpoints

Page 13: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

13| Presentation Title | Month ##, 2012

ENABLING DIFFERENT KINDS OF PROBLEM DOMAINS

Memory model

HSAIL Language

Execution Model

Architected Features

Utilize combination of characteristics per application requirements

Memory Model HS

AIL Language

Execution Model

Arc

hite

ct F

eatu

re

Arc

hitec

ted

F

eatu

re- 1

Arc

hitec

ted

F

eatu

re- 2

Arc

hitec

ted

F

eatu

re- 3

Model1

Arc

hitec

ted

F

eatu

re- 1

Arc

hitec

ted

F

eatu

re- 3

Arc

hitec

ted

F

eatu

re- 4

Model2

Arc

hitec

ted

F

eatu

re- 1

Arc

hitec

ted

F

eatu

re- 4

Model3

Application and System Languages, domain specific languages, etc

e.g.

OpenCL™, C++ AMP, Python, R, JS, etc.

HSA Runtime

LLVM IR

HSA Hardware

Applications

HSAIL

Page 14: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

14| Presentation Title | Month ##, 2012

EXPOSING DATAFLOW THROUGH DEVICE-SIDE

ENQUEUE

Page 15: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

15| Presentation Title | Month ##, 2012

CHANNELS - PERSISTENT CONTROL PROCESSOR THREADING MODEL

Add data-flow support to GPGPU

We are not primarily notating this as producer/consumer kernel bodies

– That is that we are not promoting a method where one kernel loops producing values and another loops to consume them

– That has the negative behavior of promoting long-running kernels

– We’ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children

Instead we assume that kernel entities produce/consume but consumer work-items are launched on-demand

An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel

Page 16: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

16| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Page 17: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

17| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Page 18: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

18| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Page 19: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

19| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue CP

Scheduler

Kernel

Write

Page 20: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

20| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue CP

Scheduler

Kernel

Page 21: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

21| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue CP

Scheduler

Kernel

TriggerDispatch

Work items complete

Page 22: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

22| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Launch

Page 23: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

23| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

CommandQueue

Kernel

CPScheduler

Consume

Page 24: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

24| Presentation Title | Month ##, 2012

OPERATIONAL FLOW

Channel

Kernel

CPScheduler

Next writes

Page 25: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

25| Presentation Title | Month ##, 2012

CHANNEL EXAMPLE

std::function<bool (opp::Channel<int>*)> predicate = [] (opp::Channel<int>* c) -> bool __device(fql) { return c->size() % PACKET_SIZE == 0; };

opp::Channel<int> b(N); b.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumB] (opp::Index<1>) __device(opp) { sumB++; });

opp::Channel<int> c(N);

c.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumC] (opp::Index<1>, const int v) __device(opp) { sumC += v; });

opp::parallelFor( opp::Range<1>(N), [a, &b, &c] (opp::Index<1> index) __device(opp) { unsigned int n = *(a+index.getX()); if (n > 5) { b.write(n); } else { c.write(n); } });

Page 26: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

26| Presentation Title | Month ##, 2012

EXAMPLE PROBLEMSRigid body/cloth collision

Page 27: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

27| Presentation Title | Month ##, 2012

CLOTH SIMULATION AND COLLISION DETECTION

Physics simulation has a range of properties

Rigid body simulation is often

– Not highly parallel

– Very dynamic

– Not necessarily a good match for wide SIMD architectures

Cloth simulation is

– Highly parallel

– While meshes are complicated connectivity is largely static

Page 28: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

28| Presentation Title | Month ##, 2012

EFFICIENT GPU CLOTH SIMULATION: TWO-LEVEL BATCHING

Offline static batching of the mesh

Create independent subsets of links through graph coloring.

Synchronize between batches

10 batches

Page 29: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

29| Presentation Title | Month ##, 2012

EFFICIENT GPU CLOTH SIMULATION: BATCHING

Chunk mesh into larger groups of links

Batch those chunks

– 4 global dispatches

Iterate within the workgroups

– 8 secondary batches

4 batches 8 secondary batches

Page 30: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

30| Presentation Title | Month ##, 2012

COLLISION WITH RIGID BODY

Small set of rigid bodies

Rigid bodies best computed on the CPU

Cloth on GPU

Page 31: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

31| Presentation Title | Month ##, 2012

OPTIONS

Either

– Small launches of rigid body/cloth collisions against cloth

– Process rigid body/cloth collisions on CPU

On GPU:

– Small launches suffer dispatch overhead

– Must update rigid body data structures from the GPU

On CPU:

– Must continuously move cloth mesh data to and from GPU

RB solveCloth solve

Cloth/RB collide

CPU GPU

RB solveCloth solve

Cloth/RB collide

Page 32: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

32| Presentation Title | Month ##, 2012

RB solveCloth solve

Cloth/RB collide

CPU GPU

RB solveCloth solve

Cloth/RB collide

OPTIONS

Either

– Small launches of rigid body/cloth collisions against cloth

– Process rigid body/cloth collisions on CPU

On GPU:

– Small launches suffer dispatch overhead

– Must update rigid body data structures from the GPU

On CPU:

– Must continuously move cloth mesh data to and from GPU

Page 33: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

33| Presentation Title | Month ##, 2012

WHY HSA?

Colliding rigid bodies are likely to be very sparse in memory

– Do not want to copy the rigid body array to the GPU “just in case”

– Do not even want to incur OS page lock overhead

– Accessing targeted virtual addresses as necessary reduces the overhead

HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task

Operations are in a tight loop

– Overhead from dispatch grows quickly

– User mode queuing reduces this significantly

Architected queues are exposed via simple API

Do not want to transform rigid body code

– It is a common problem to do vast, confusing, transformations to host code to enabled wide vector processing

Shared pointer model enables access to those structures directly rather than restructuring them

Page 34: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

34| Presentation Title | Month ##, 2012

HOW DO YOU TRIGGER CLOTH/RB COLLIDE?

RB solveCloth solve

Cloth/RB collide

CPU GPUCS = dispatch(cloth_solve, x, y, …)

CPU does RB Solve (p, q, …)

Wait for CS

Dispatch(cloth_RB_collide, x, p, …)

RB = dispatch(RB_solve, p, q, …)

CS = dispatch(cloth_solve, x, y, …)

Collide = dispatch(cloth_RB_collide, x, p, …, RB & CS)

Page 35: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

35| Presentation Title | Month ##, 2012

IN PSEUDOCODE

batch rigid bodies and RB/cloth pairs

dispatch cloth solver to GPU;

dispatch cloth/rigid body collision solver to GPU pending event;

foreach rigid body batch:

for rigid body pair in batch:

compute force

update position and velocity of rigid body

Signal GPU event

Foreach rigid body not involved in cloth collision:

– update positions

Return to next iteration

Cloth solver:

for iteration towards convergence:

foreach cloth link batch:

foreach cloth link subbatch:

update positions and velocities

Cloth/RB solver:

for each batch of RB/Cloth pairs:

– Read rigid body data directly from data structures used by the CPU

– Test cloth against RB and update cloth

– Read/write update to global data (relies on memory visibility rules guaranteed by HSA)

Page 36: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

36| Presentation Title | Month ##, 2012

EXAMPLE PROBLEMSTree search

Page 37: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

37| Presentation Title | Month ##, 2012

NESTED DATA PARALLELISM AND EFFICIENT EXECUTION OF UNSTRUCTURED DATA

Perfectly balanced trees are easy:

– If the tree is being regularly rebalanced and stored contiguously then data may be moved around where needed

Large, poorly balanced trees are harder:

– Layout is ambiguous so copying data is challenging

– Amount of parallelism is unpredictable

One approach to deal with this, on a single node:

– Fine-grained tasking

– Share memory infrastructure

– Picture breadth first search through FIFO queues

Example: UTS (unbalanced Tree Search):

– To count the number of nodes in an implicitly constructed tree

– tree is parameterized in shape, depth, size, and imbalance.

Page 38: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

38| Presentation Title | Month ##, 2012

SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code

Page 39: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

39| Presentation Title | Month ##, 2012

SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code

Page 40: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

40| Presentation Title | Month ##, 2012

SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code

– Slowly increase launch batch size to improve efficiency

Page 41: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

41| Presentation Title | Month ##, 2012

SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code

– Slowly increase launch batch size to improve efficiency

Page 42: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

42| Presentation Title | Month ##, 2012

SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code

– Slowly increase launch batch size to improve efficiency

Page 43: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

43| Presentation Title | Month ##, 2012

SEARCHING A TREE

As we move through the tree:

– Unpredictable amount of parallelism

– Unpredictable dependence structure

Launch tasks as tree space is available:

– Perform BFS queuing into a buffer

– When buffer reaches a certain size, launch processing code

– Slowly increase launch batch size to improve efficiency

Page 44: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

44| Presentation Title | Month ##, 2012

EXTENDING THE TREE

Many tree search algorithms expand the tree as time goes on

– A lot of overhead in the absence of shared memory

– With shared memory we can be searching parts of the tree while adding to others

Example: Multiresolution Analysis (MRA) is a mathematical technique for approximating a continuous function as a hierarchy of coefficients over a set of basis functions.

– characterized by dynamic adaptivity to guarantee the accuracy of approximation, Challenges include:

The coefficient trees are unbalanced on account of the adaptive multiresolution

properties leading to different scales of information granularity

The tree structure may be refined in an uncoordinated fashion – different parts of the tree may be refined independently and the intervals of such refinement are not preset.

Page 45: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

45| Presentation Title | Month ##, 2012

WHY HSA?

Dynamic adaptive nature means unpredictable amount of parallelism unpredictable dependence structure

– Do not want to copy sections array to the GPU “just in case”

– Accessing targeted virtual addresses as necessary reduces the overhead

HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task

Tremendous nesting of parallelism is inherent in the problem

– Single search-node can lead to a very large number of additional searches

– Ability to efficiently do nesting is key and triggering searches based on grouping is important

HSA allows for device-to-device enqueue that permits nested parallelism

Significant load imbalance is possible

– Need to group searches and trigger them when they reach a certain size

Support for dataflow via channels makes this possible

– Need to balance what is already launched due to the unpredictable amount of parallelism

Queues are in user mode, balancing is enabled by the architected features that allow user-level access to a queue

Page 46: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

46| Presentation Title | Month ##, 2012

HOW DO YOU BALANCE?

Several user mode queues

– Number of nodes to start with don’t represent the real load (imbalance)

Page 47: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

47| Presentation Title | Month ##, 2012

IN PSEUDOCODE

Control process:

– dipatch(unbalanced_tree_kernel, root….)

GPU/CPU unbalanced_tree_kernel

– For the next n-1 levels

Count

– For each child at level n

Insert child into parse channel

Control process:

– dipatch(unbalanced_tree_kernel, root….)

– balance and terminate

GPU/CPU : unbalanced_tree_kernel

– For the next n-1 levels

Count

– For each child at level n

dispatch(unbalanced_tree_kernel, child….)

Page 48: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

48| Presentation Title | Month ##, 2012

CONCLUSIONS

HSA has several architected features that can improve programmability and an ecosystem that exposes these features to the users effectively

HSA runtime is how these features are exposed to higher-level programming models

– Composability is possible, a new higher-level model can be composed of multiple architected features

Channels are a very unique technology made possible with HSA

– Channels enable many applications that need dataflow model or features

Cloth simulation and collision detection is an example that shows how several of HSA features both simplify the solution and avoid unnecessary costs typically involved with using GPUs to solve this problem

Unbalanced tree search is a domain with unpredictable amount of parallelism, a major load balancing problem and need for adjusting the granularity of a task

– HSA features significantly simplify and allow a natural solution to this problem.

– Channels address adjusting of granularity of launches by allowing a dataflow patterns that launches a task when certain data-dependent criterion are met

Page 49: HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

49| Presentation Title | Month ##, 2012

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

© 2012 Advanced Micro Devices, Inc.