isca final presentation - intro
TRANSCRIPT
INTRODUCTIONPHIL ROGERS, AMD CORPORATE FELLOW & PRESIDENT OF HSA FOUNDATION
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA FOUNDATION
Founded in June 2012 Developing a new platform for heterogeneous
systems www.hsafoundation.com Specifications under development in working
groups to define the platform Membership consists of 43 companies and 16
universities Adding 1-2 new members each month
DIVERSE PARTNERS DRIVING FUTURE OF HETEROGENEOUS COMPUTING
© Copyright 2014 HSA Foundation. All Rights Reserved
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba Logo
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMBERSHIP TABLEMembership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,
Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical
Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science
© Copyright 2014 HSA Foundation. All Rights Reserved
HETEROGENEOUS PROCESSORS HAVE PROLIFERATED — MAKE THEM BETTER
Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms
SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory
How do we make them even better? Easier to program Easier to optimize Higher performance Lower power
HSA unites accelerators architecturally
Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
INFLECTIONS IN PROCESSOR DESIGN
?
Sin
gle
-th
rea
d
Pe
rfo
rma
nce
Time
we arehere
Enabled by: Moore’s
Law Voltage
Scaling
Constrained by:PowerComplexity
Single-Core Era
Mo
de
rn A
pp
lica
tion
P
erf
orm
an
ce
Time (Data-parallel exploitation)
we arehere
HeterogeneousSystems Era
Enabled by: Abundant data
parallelism Power efficient
GPUs
Temporarily Constrained by:
Programming modelsComm.overhead
Th
rou
gh
pu
t
Pe
rfo
rma
nce
Time (# of processors)
we arehere
Enabled by: Moore’s Law SMP
architecture
Constrained by:PowerParallel SWScalability
Multi-Core Era
Assembly C/C++ Java … pthreads OpenMP / TBB …Shader CUDA OpenCL
C++ and Java
© Copyright 2014 HSA Foundation. All Rights Reserved
LEGACY GPU COMPUTE
PCIe™
System Memory(Coherent)
CPU CPU CPU. . .
CU CU CU CU
CU CU CU CU
GPU Memory(Non-Coherent)
GPU
Multiple memory pools
Multiple address spaces
High overhead dispatch
Data copies across PCIe
New languages for programming
Dual source development
Proprietary environments
Expert programmers only
Need to fix all of this to unleash our programmers
The limiters
EXISTING APUS AND SOCS
CPU1
CPUN…
CPU2
Physical Integration
CU1 …
CU2
CU3
CUM-2
CUM-1
CUM
System Memory(Coherent)
GPU Memory(Non-Coherent)
GPU
Physical Integration
Good first step
Some copies gone
Two memory pools remain
Still queue through the OS
Still requires expert programmers
Need to finish the job
AN HSA ENABLED SOC
Unified Coherent Memory enables data sharing across all processors
Processors architected to operate cooperatively
Designed to enable the application to run on different processors at different times
Unified Coherent Memory
CPU1
CPUN…
CPU2
CU1
CU2
CU3
CUM-2
CUM-1
CUM…
© Copyright 2014 HSA Foundation. All Rights Reserved
PILLARS OF HSA* Unified addressing across all processors Operation into pageable system memory Full memory coherency User mode dispatch Architected queuing language Scheduling and context switching HSA Intermediate Language (HSAIL) High level language support for GPU compute processors
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA SPECIFICATIONS HSA System Architecture Specification
Version 1.0 Provisional, Released April 2014 Defines discovery, memory model, queue management, atomics, etc
HSA Programmers Reference Specification Version 1.0 Provisional, Released June 2014 Defines the HSAIL language and object format
HSA Runtime Software Specification Version 1.0 Provisional, expected to be released in July 2014 Defines the APIs through which an HSA application uses the platform
All released specifications can be found at the HSA Foundation web site: www.hsafoundation.com/standards
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA - AN OPEN PLATFORM Open Architecture, membership open to all
HSA Programmers Reference Manual
HSA System Architecture
HSA Runtime
Delivered via royalty free standards
Royalty Free IP, Specifications and APIs
ISA agnostic for both CPU and GPU
Membership from all areas of computing
Hardware companies
Operating Systems
Tools and Middleware
Applications
Universities
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA INTERMEDIATE LAYER — HSAIL HSAIL is a virtual ISA for parallel programs
Finalized to ISA by a JIT compiler or “Finalizer” ISA independent by design for CPU & GPU
Explicitly parallel Designed for data parallel programming
Support for exceptions, virtual functions, and other high level language features
Lower level than OpenCL SPIR Fits naturally in the OpenCL compilation stack
Suitable to support additional high level languages and programming models: Java, C++, OpenMP, C++, Python, etc
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL Defines visibility ordering between all
threads in the HSA System Designed to be compatible with C++11,
Java, OpenCL and .NET Memory Models
Relaxed consistency memory model for parallel compute performance
Visibility controlled by: Load.Acquire Store.Release Fences
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODEL User mode queuing for low latency dispatch
Application dispatches directly No OS or driver required in the dispatch path
Architected Queuing Layer Single compute dispatch path for all hardware No driver translation, direct to hardware
Allows for dispatch to queue from any agent CPU or GPU
GPU self enqueue enables lots of solutions Recursion Tree traversal Wavefront reforming
HSA SOFTWARE
© Copyright 2014 HSA Foundation. All Rights Reserved
Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes, User Mode Drivers
Graphics Kernel Mode Driver
AppsApps
AppsApps
AppsApps
HSA Software Stack
Task Queuing Libraries
HSA Domain Libraries,OpenCL ™ 2.x Runtime
HSA Kernel Mode Driver
HSA Runtime
HSA JIT
AppsApps
AppsApps
AppsApps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ AND HSA
HSA is an optimized platform architecture
for OpenCL Not an alternative to OpenCL
OpenCL on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU
OpenCL 2.0 leverages HSA Features Shared Virtual Memory Platform Atomics
© Copyright 2014 HSA Foundation. All Rights Reserved
ADDITIONAL LANGUAGES ON HSA In development
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches/hsa/gcc/README.hsa?view=markup&pathrev=207425
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMATRA PROJECT OVERVIEW AMD/Oracle sponsored Open Source (OpenJDK) project Targeted at Java 9 (2015 release) Allows developers to efficiently represent data parallel algorithms in
Java Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices Developers of Java libraries are already refactoring their library code to
use these same constructs So developers using existing libraries should see GPU acceleration without
any code changes
http://openjdk.java.net/projects/sumatra/
https://wikis.oracle.com/display/HotSpotInternals/Sumatra
http://mail.openjdk.java.net/pipermail/sumatra-dev/
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.class
Development
Runtime
HSA Finalizer
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack Allows a single shared implementation for many components Enables university research and collaboration in all areas Because it’s the right thing to do
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
WORKLOAD EXAMPLESUFFIX ARRAY CONSTRUCTIONCLOUD SERVER WORKLOAD
© Copyright 2014 HSA Foundation. All Rights Reserved
SUFFIX ARRAYS
Suffix Arrays are a fundamental data structure Designed for efficient searching of a large text
Quickly locate every occurrence of a substring S in a text T
Suffix Arrays are used to accelerate in-memory cloud workloads Full text index search Lossless data compression Bio-informatics
© Copyright 2014 HSA Foundation. All Rights Reserved
ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction.
By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies.
+5.8x
-5x
INCREASEDPERFORMANCE
DECREASED ENERGY
Skew Algorithm for Compute SA
EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE
© Copyright 2014 HSA Foundation. All Rights Reserved
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350L
OC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Pe
rform
an
ce
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
© Copyright 2014 HSA Foundation. All Rights Reserved
THE HSA FUTURE
Architected heterogeneous processing on the SOC Programming of accelerators becomes much easier Accelerated software that runs across multiple hardware vendors Scalability from smart phones to super computers on a common architecture GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model Heterogeneous software ecosystem evolves at a much faster pace Lower power, more capable devices in your hand, on the wall, in the cloud
JOIN US!
WWW.HSAFOUNDATION.COM