isca final presentation - applications

83
HSA APPLICATIONS WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS WITH J.P. BORDES AND JUAN GOMEZ

Upload: hsa-foundation

Post on 19-Jun-2015

471 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: ISCA Final Presentation - Applications

HSA APPLICATIONSWEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS

WITH J.P. BORDES AND JUAN GOMEZ

Page 2: ISCA Final Presentation - Applications

USE CASES SHOWING HSA ADVANTAGE

Programming Technique Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created binary tree.

CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers.

Platform Atomics

Work-Group Dynamic Task ManagementGPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads

Binary tree updatesCPU and GPU operating simultaneously on the tree, both doing modifications

CPU and GPU can synchronize using Platform AtomicsHigher performance through parallel operations reducing the need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision detection, global illumination, BVH

CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernelSimpler programming does not require “split kernels”Higher performance through parallel operations

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 3: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORY FOR POINTER-BASED DATA STRUCTURES

Page 4: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 5: ISCA Final Presentation - Applications

L R

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 6: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L R

L R

L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 7: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 8: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 9: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 10: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 11: ISCA Final Presentation - Applications

SYSTEM MEMORY

KERNEL

GPU

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA and full OpenCL 2.0

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 12: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 13: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 14: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 15: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 16: ISCA Final Presentation - Applications

POINTER DATA STRUCTURES - CODE COMPLEXITY

HSA Legacy

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 17: ISCA Final Presentation - Applications

POINTER DATA STRUCTURES- PERFORMANCE

1M 5M 10M 25M0

10,000

20,000

30,000

40,000

50,000

60,000

Binary Tree Search

CPU (1 core)

CPU (4 core)

Legacy APU

HSA APU

Tree size ( # nodes )

Se

arc

h r

ate

(

no

de

s /

ms

)

Measured in AMD labs Jan 1-3 on system shown in back up slide

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 18: ISCA Final Presentation - Applications

PLATFORM ATOMICS FOR DYNAMIC TASK MANAGEMENT

Page 19: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

0

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 20: ISCA Final Presentation - Applications

0

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Asynchronous transfer

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 21: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 22: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Asynchronous transfer

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 23: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 24: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 25: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 26: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 27: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 28: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICS

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 29: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 30: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 31: ISCA Final Presentation - Applications

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Zero-copy

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 32: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

0

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 33: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

0

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

memcpy

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 34: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 35: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 36: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 37: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 38: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 39: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 40: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 41: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 42: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 43: ISCA Final Presentation - Applications

PLATFORM ATOMICS – CODE COMPLEXITY

HSALegacy

Host enqueue function: 20 lines of code

Host enqueue function: 102 lines of code

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 44: ISCA Final Presentation - Applications

PLATFORM ATOMICS - PERFORMANCE

64 128 256 512 64 128 256 5124096 16384

0

100

200

300

400

500

600

700

Legacy implementation (ms)

HSA implementation (ms)

Tasks per insertionTasks pool size

Exe

cuti

on

tim

e (m

s)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 45: ISCA Final Presentation - Applications

PLATFORM ATOMICS FOR CPU/GPU COLLABORATION

Page 46: ISCA Final Presentation - Applications

PLATFORM ATOMICSENABLING EFFICIENT GPU/CPU COLLABORATION

Legacy

Only GPU can work on input

arrayConcurre

nt processin

g not possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 47: ISCA Final Presentation - Applications

PLATFORM ATOMICS

Legacy

Only GPU can work on input

arrayConcurre

nt processin

g not possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 48: ISCA Final Presentation - Applications

PLATFORM ATOMICS

Legacy

Only GPU can work on input

arrayConcurre

nt processin

g not possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 49: ISCA Final Presentation - Applications

GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 50: ISCA Final Presentation - Applications

GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 51: ISCA Final Presentation - Applications

UNIFIED COHERENT MEMORY FOR LARGEDATA SETS

Page 52: ISCA Final Presentation - Applications

PROCESSING LARGE DATA SETS

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

SYSTEM MEMORY

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 53: ISCA Final Presentation - Applications

SYSTEM MEMORY

Level 1

Level 2

Level 3

Level 4

Level 5

PROCESSING LARGE DATA SETS

Larg

e 3D

spa

tial d

ata

stru

ctur

e

GPU

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

Compare HSA and Legacy methods

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 54: ISCA Final Presentation - Applications

SYSTEM MEMORY

LEGACY ACCESS USING GPU MEMORY

Legacy

GPU Memory is smaller

Have to copy and process in chunks

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 55: ISCA Final Presentation - Applications

SYSTEM MEMORY

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

LEGACY ACCESS TO LARGE STRUCTURES

Larg

e 3D

spa

tial d

ata

stru

ctur

e

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 56: ISCA Final Presentation - Applications

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

Copy of top 2 levels of hierarchy

Larg

e 3D

spa

tial d

ata

stru

ctur

e

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 57: ISCA Final Presentation - Applications

GPU

GPU MEMORY

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

FIRSTKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 58: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

FIRSTKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 59: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

?

GPU

GPU MEMORY

FIRSTKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 60: ISCA Final Presentation - Applications

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 61: ISCA Final Presentation - Applications

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

Copy of bottom 3 levels of one branch of the hierarchy

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 62: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

SECOND KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 63: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

SECOND KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 64: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

SECOND KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 65: ISCA Final Presentation - Applications

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 66: ISCA Final Presentation - Applications

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 67: ISCA Final Presentation - Applications

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

Copy of bottom 3 levels of a different branch of the

hierarchy

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 68: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

NthKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 69: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

NthKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 70: ISCA Final Presentation - Applications

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

NthKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 71: ISCA Final Presentation - Applications

LARGE SPATIAL DATA STRUCTURE

Level 1

Level 2

Level 3

Level 4

Level 5

Larg

e 3D

spa

tial d

ata

stru

ctur

eSYSTEM MEMORY

KERNEL

GPUHSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 72: ISCA Final Presentation - Applications

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 73: ISCA Final Presentation - Applications

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 74: ISCA Final Presentation - Applications

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 75: ISCA Final Presentation - Applications

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 76: ISCA Final Presentation - Applications

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

KERNEL

HSAGPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 77: ISCA Final Presentation - Applications

CALLBACKS

Page 78: ISCA Final Presentation - Applications

CALLBACKS

Parallel processing algorithm with branches A seldom taken branch requires new data from the CPU

On legacy systems, the algorithm must be split: Process Kernel 1 on GPU Check for CPU callbacks and if any, process on CPU Process Kernel 2 on GPU

Example algorithm from Image Processing Perform a filter Calculate average LUMA in each tile Compare LUMA against threshold and call CPU callback if exceeded (rare) Perform special processing on tiles with callbackx\s

COMMON SITUATION IN HC

Input Image Output Image

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 79: ISCA Final Presentation - Applications

CALLBACKS

Legacy

1st KERNEL

END

STAR

T

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

CPU callbacks

Early term

ination

due to need fo

r

callback

2nd KERNEL

END

START Continuation kernel

finishes up kernel works results in poor GPU utilization

TIME

TIME

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 80: ISCA Final Presentation - Applications

CALLBACKS

Input Image

1 Tile = 1 OpenCL Work Item

Output Image

GPU• Work items compute average RGB value

of all the pixels in a tile • Work items also compute average Luma

from the average RGB• If average Luma > threshold, workgroup

invokes CPU CALLBACK• In parallel with callback, continue compute

CPU • For selected tiles, update average Luma

value (set to RED)

GPU• Work items apply the Luma value to all

pixels in the tile

GPU to CPU callbacks use Shared Virtual Memory (SVM) Semaphores, implemented using Platform Atomic Compare-and-Swap.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 81: ISCA Final Presentation - Applications

CALLBACKS

A few kernel threads need CPU callback services but serviced immediately

KERNEL

END

STAR

T

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

TIME

CPU callbacks

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 82: ISCA Final Presentation - Applications

SUMMARY - HSA ADVANTAGE

Programming Technique Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created binary tree.

CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers.

Platform Atomics

Work-Group Dynamic Task ManagementGPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads

Binary tree updatesCPU and GPU operating simultaneously on the tree, both doing modifications

CPU and GPU can synchronize using Platform AtomicsHigher performance through parallel operations reducing the need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision detection, global illumination, BVH

CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernelSimpler programming does not require “split kernels”Higher performance through parallel operations

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 83: ISCA Final Presentation - Applications

QUESTIONS?