isca final presentation - intro

INTRODUCTIONPHIL ROGERS, AMD CORPORATE FELLOW & PRESIDENT OF HSA FOUNDATION

© Copyright 2014 HSA Foundation. All Rights Reserved

HSA FOUNDATION

Founded in June 2012 Developing a new platform for heterogeneous

systems www.hsafoundation.com Specifications under development in working

groups to define the platform Membership consists of 43 companies and 16

universities Adding 1-2 new members each month

DIVERSE PARTNERS DRIVING FUTURE OF HETEROGENEOUS COMPUTING


Founders

Promoters

Supporters

Contributors

Academic

Needs Updating – Add Toshiba Logo

http://www.apical.co.uk/

http://www.multicorewareinc.com/index.php


MEMBERSHIP TABLEMembership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,

Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical

Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation

Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory

Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science


HETEROGENEOUS PROCESSORS HAVE PROLIFERATED — MAKE THEM BETTER

Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms

SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory

How do we make them even better? Easier to program Easier to optimize Higher performance Lower power

HSA unites accelerators architecturally

Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU


INFLECTIONS IN PROCESSOR DESIGN

?

Sin

gle

-th

rea

d

Pe

rfo

rma

nce

Time

we arehere

Enabled by: Moore’s

Law Voltage

Scaling

Constrained by:PowerComplexity

Single-Core Era

Mo

de

rn A

pp

lica

tion

P

erf

orm

an

ce

Time (Data-parallel exploitation)

we arehere

HeterogeneousSystems Era

Enabled by: Abundant data

parallelism Power efficient

GPUs

Temporarily Constrained by:

Programming modelsComm.overhead

Th

rou

gh

pu

t

Pe

rfo

rma

nce

Time (# of processors)

we arehere

Enabled by: Moore’s Law SMP

architecture

Constrained by:PowerParallel SWScalability

Multi-Core Era

Assembly C/C++ Java … pthreads OpenMP / TBB …Shader CUDA OpenCL

C++ and Java


LEGACY GPU COMPUTE

PCIe™

System Memory(Coherent)

CPU CPU CPU. . .

CU CU CU CU

CU CU CU CU

GPU Memory(Non-Coherent)

GPU

Multiple memory pools

Multiple address spaces

High overhead dispatch

Data copies across PCIe

New languages for programming

Dual source development

Proprietary environments

Expert programmers only

Need to fix all of this to unleash our programmers

The limiters

EXISTING APUS AND SOCS

CPU1

CPUN…

CPU2

Physical Integration

CU1 …

CU2

CU3

CUM-2

CUM-1

CUM

System Memory(Coherent)

GPU Memory(Non-Coherent)

GPU

Physical Integration

Good first step

Some copies gone

Two memory pools remain

Still queue through the OS

Still requires expert programmers

Need to finish the job

AN HSA ENABLED SOC

Unified Coherent Memory enables data sharing across all processors

Processors architected to operate cooperatively

Designed to enable the application to run on different processors at different times

Unified Coherent Memory

CPU1

CPUN…

CPU2

CU1

CU2

CU3

CUM-2

CUM-1

CUM…


PILLARS OF HSA* Unified addressing across all processors Operation into pageable system memory Full memory coherency User mode dispatch Architected queuing language Scheduling and context switching HSA Intermediate Language (HSAIL) High level language support for GPU compute processors

* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors


HSA SPECIFICATIONS HSA System Architecture Specification

Version 1.0 Provisional, Released April 2014 Defines discovery, memory model, queue management, atomics, etc

HSA Programmers Reference Specification Version 1.0 Provisional, Released June 2014 Defines the HSAIL language and object format

HSA Runtime Software Specification Version 1.0 Provisional, expected to be released in July 2014 Defines the APIs through which an HSA application uses the platform

All released specifications can be found at the HSA Foundation web site: www.hsafoundation.com/standards

http://www.hsafoundation.com/standards


HSA - AN OPEN PLATFORM Open Architecture, membership open to all

HSA Programmers Reference Manual

HSA System Architecture

HSA Runtime

Delivered via royalty free standards

Royalty Free IP, Specifications and APIs

ISA agnostic for both CPU and GPU

Membership from all areas of computing

Hardware companies

Operating Systems

Tools and Middleware

Applications

Universities


HSA INTERMEDIATE LAYER — HSAIL HSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler or “Finalizer” ISA independent by design for CPU & GPU

Explicitly parallel Designed for data parallel programming

Support for exceptions, virtual functions, and other high level language features

Lower level than OpenCL SPIR Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models: Java, C++, OpenMP, C++, Python, etc


HSA MEMORY MODEL Defines visibility ordering between all

threads in the HSA System Designed to be compatible with C++11,

Java, OpenCL and .NET Memory Models

Relaxed consistency memory model for parallel compute performance

Visibility controlled by: Load.Acquire Store.Release Fences


HSA QUEUING MODEL User mode queuing for low latency dispatch

Application dispatches directly No OS or driver required in the dispatch path

Architected Queuing Layer Single compute dispatch path for all hardware No driver translation, direct to hardware

Allows for dispatch to queue from any agent CPU or GPU

GPU self enqueue enables lots of solutions Recursion Tree traversal Wavefront reforming

HSA SOFTWARE


Hardware - APUs, CPUs, GPUs

Driver Stack

Domain Libraries

OpenCL™, DX Runtimes, User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

HSA Software Stack

Task Queuing Libraries

HSA Domain Libraries,OpenCL ™ 2.x Runtime

HSA Kernel Mode Driver

HSA Runtime

HSA JIT

AppsApps

AppsApps

AppsApps

User mode component Kernel mode component Components contributed by third parties

EVOLUTION OF THE SOFTWARE STACK


OPENCL™ AND HSA

HSA is an optimized platform architecture

for OpenCL Not an alternative to OpenCL

OpenCL on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU

OpenCL 2.0 leverages HSA Features Shared Virtual Memory Platform Atomics


ADDITIONAL LANGUAGES ON HSA In development

Language Body More Information

Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/

LLVM LLVM Code generator for HSAIL

C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home

OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches/hsa/gcc/README.hsa?view=markup&pathrev=207425

http://openjdk.java.net/projects/sumatra/

https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home

https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home

https://gcc.gnu.org/viewcvs/gcc/branches/hsa/gcc/README.hsa?view=markup&pathrev=207425




SUMATRA PROJECT OVERVIEW AMD/Oracle sponsored Open Source (OpenJDK) project Targeted at Java 9 (2015 release) Allows developers to efficiently represent data parallel algorithms in

Java Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to

enable both CPU or GPU computing At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch

‘selected’ constructs to available HSA enabled devices Developers of Java libraries are already refactoring their library code to

use these same constructs So developers using existing libraries should see GPU acceleration without

any code changes


https://wikis.oracle.com/display/HotSpotInternals/Sumatra

http://mail.openjdk.java.net/pipermail/sumatra-dev/

Application.java

Java Compiler

GPUCPU

Sumatra Enabled JVM

Application

GPU ISA

Lambda/Stream API

CPU ISA

Application.class

Development

Runtime

HSA Finalizer


https://wikis.oracle.com/display/HotSpotInternals/Sumatra

http://mail.openjdk.java.net/pipermail/sumatra-dev/


HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack Allows a single shared implementation for many components Enables university research and collaboration in all areas Because it’s the right thing to do

Component Name IHV or Common Rationale

HSA Bolt Library Common Enable understanding and debug

HSAIL Code Generator Common Enable research

LLVM Contributions Common Industry and academic collaboration

HSAIL Assembler Common Enable understanding and debug

HSA Runtime Common Standardize on a single runtime

HSA Finalizer IHV Enable research and debug

HSA Kernel Driver IHV For inclusion in linux distros

WORKLOAD EXAMPLESUFFIX ARRAY CONSTRUCTIONCLOUD SERVER WORKLOAD


SUFFIX ARRAYS

Suffix Arrays are a fundamental data structure Designed for efficient searching of a large text

Quickly locate every occurrence of a substring S in a text T

Suffix Arrays are used to accelerate in-memory cloud workloads Full text index search Lossless data compression Bio-informatics


ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA

M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction.

By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies.

+5.8x

-5x

INCREASEDPERFORMANCE

DECREASED ENERGY

Skew Algorithm for Compute SA

EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE


LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

0

50

100

150

200

250

300

350L

OC

Copy-back Algorithm Launch Copy Compile Init Performance

Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt

Pe

rform

an

ce

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0Copy-back

Algorithm

Launch

Copy

Compile

Init.

Copy-back

Algorithm

Launch

Copy

Compile

Copy-back

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

(Exemplary ISV “Hessian” Kernel)


THE HSA FUTURE

Architected heterogeneous processing on the SOC Programming of accelerators becomes much easier Accelerated software that runs across multiple hardware vendors Scalability from smart phones to super computers on a common architecture GPU acceleration of parallel processing is the initial target, with DSPs

and other accelerators coming to the HSA system architecture model Heterogeneous software ecosystem evolves at a much faster pace Lower power, more capable devices in your hand, on the wall, in the cloud

JOIN US!

WWW.HSAFOUNDATION.COM

isca final presentation - intro

Technology

university of edinburgh

university of bologna

northeastern university

rice university

university of illinois

university of mississippi

seoul national university

national tsing hua university