petapath hp cast 12 - programming for high performance accelerated systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

1

Petapath

Dairsie Latimer and Michal Harasimiuk

Programming for High Performance Accelerated Systems


2

Petapath


3

Petapath

Hardware

Software

Tools

Consulting

ClearSpeed

NVIDIA

AMD

Intel


4

PetapathJoint Petapath/HP PRACE WP8 Prototype system at SARA/NCF


5

PetapathJoint Petapath/HP PRACE WP8 Prototype system at SARA/NCF

6U

10 TFLOPS

7 kW


6

Petapath

• 20 racks, 1.125 PFLOPS, end of 2009• 500KW• Alternative systems – 15x the size, 8x the power


7

Programming for High Performance Accelerated Systems

• Overview of the development environment at SARA/NCF

• Options for programming heterogeneous systems

• Moving software development flows from multi-core to heterogeneous systems

• Developing with OpenCL going forward

Petapath


8

PetapathPetapath/HP PRACE Prototype system at SARA/NCF


9

ClearSpeed Software Development environment at SARA/NCF

• ClearSpeed SDK Version 3.1• Binary compatible across all ClearSpeed based products

• Cn Optimising Compiler• C with poly extensions for SIMD data types

• Debugger – a port of gdb• Runs on hardware

• Profiler – csprof• Allows system-wide visualization of an accelerated

application’s performance while running on both a multi-core host and ClearSpeed accelerators

• Libraries (BLAS, RNG and FFT) & High level APIs (CSPX)


10

• Standard Eclipse graphical debug interface for CSX processors

• CSX processors provide full hardware debugging of running application code

• Provides seamless viewof many processor coresin parallel with their associated state

• Allows full symbolic debug of the Cn language

• Enhanced views for CSX specific information

ClearSpeed graphical debug interface for the heterogeneous systems

Images used with permission of ClearSpeed Technology Plc


11

ClearSpeed profiler for heterogeneous and multi-processor systems

Advance™ Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

HostCPU(s)Host

CPU(s)Host

CPU(s)

Advance™ Accelerator Board

HostCores(s)

CSX

Pipeline

HOST/BOARD INTERACTIONView host/board interactions.

Provides performance information for data transfer

operations. Trace cluster node/board interaction. See overlap of host compute and

board compute.

CSX PIPELINEView detailed instruction

issue information. Visualize overlap of executing

instructions. Optimize code at the instruction level. View

instruction level performance bottlenecks. Get accurate

instruction timing.

CSX SYSTEMView system level trace.

Visually inspect the overlap of compute and

I/O. Visualize cache utilization. View branch trace of code executing.

Find and analyse performance bottlenecks. Get accurate event timing

CSX

Pipeline

HOST CODE PROFILINGVisually inspect host code

executing. Supports multiple threads

and processes. Time specific code sections.

See overlap of host threads executing.

Platform and processor agnostic trace collection.


12

PetapathProgramming for High Performance Accelerated Systems


13

Programming for High PerformanceAccelerated Systems

Introduction

• Heterogeneous systems are now increasingly common

• They are being adopted at the top (Top500) and the bottom (technical workstation) of the HPC market

• Acceleration can deliver significant performance and cost savings over traditional COTS HPC systems

• However, there are real barriers to adoption:• Software support and programming models• Host system requirements


14

• In order to take advantage of this new technology trend, what are the realistic options?

• Some important things to consider:

• Single or multi-use system?

• Where do the majority of the cycles go?

• ISV codes or Open Source/Custom Codes?• Sufficient development resources?


15

• Starting with application source, what is the best way to target heterogeneous computing today?

• Proprietary development environments and hardware:• AdvanceTM/Cn (ClearSpeed)• TeslaTM/CUDA (NVIDIA)• StreamTM/Brook+ (AMD)• FPGA based solutions

• Or via Third Party/Middleware:• RapidMindTM Platform• CAPS HMPPTM

• PGI’s x64+GPU Accelerate Model• e.g. Mitrion Development Platform for FPGAs


16

• These options can loosely be categorised into

• Language• Cn, CUDA, Brook+, Mitrion C, OpenCL

• Directive based or hybrid approaches• PGI x86+GPU, CAPS HMPP, RapidMind

• Allow re-targetable support• Can potentially support multiple vendor development

environments


17

• Library

• All the languages have a library component• Manages hardware resources and runtime interaction

• Can also provide higher level abstractions suchas standard library support, e.g. BLAS or LAPACK

• Some libraries are available from third parties that are designed to transparently interface ISV applications to accelerator hardware

• Often the best implementations are available from the vendors themselves


18

• Industry will inevitably move towards available open standards

• We believe that the Khronos Group’s OpenCLTM (Open Computing Language) will be a key enabler in the wider adoption of heterogeneous systems

• Petapath are members of the Khronos Group and participants on the OpenCL working group

What comes next?


19

• OpenCL is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems

• OpenCL provides uniform programming environment for software developers

• Can write efficient, portable code for a range of high-performance systems and a diverse mix of multi-core and parallel processors

OpenCL


20

• OpenCL consists of:• An API for coordinating parallel computation• A programming language for describing those

computations.

• Specifically, the OpenCL standard defines:• Subset of the C99 language with extensions for

parallelism• API for coordinating data and task-based parallel

computations• Numerical requirements based on the IEEE 754 standard• Interoperability with other Khronos standards such as

OpenGL• An abstraction layer for a diverse range of computational

resources


21

• OpenCL also specifies:• A rich set of built-in functions• Online or offline compilation and build of compute kernel

executables

• Platform Layer API• Query, select and initalize compute devices• Create compute contexts and work-queues

• Runtime API• Execute compute kernels• Manage scheduling, compute and memory resources


22

• Is OpenCL a golden bullet?• Possibly not but it’s an excellent place to start

• It’s a well supported Open standard• OpenCL has complete cross vendor support• Most are motivated to increase their market share in the

HPC and Technical computing market

• Write once, work on many platforms is attractive for ISVs• The lack of an open standard has certainly slowed

adoption of support for heterogeneous systems outside of the academic community for compute intensive applications


23

• When will it be available?• KhronosTM Group ratified the OpenCLTM 1.0 specification

at Siggraph Asia, December 9th 2008• Conformant vendor implementations available in Q3

2009• One vendor already has a public beta program• Others will not be far behind

• What are the principle reasons that make OpenCL attractive?• No reliance on proprietary programming languages

• Cross vendor compatibility and interoperability

• Cross platform support (Linux, Windows and OS X)


24

• The incentive to support heterogeneous systems has to be a clear business win; so companies who differentiate on innovation are more likely to adopt early

• Many large ISVs have long development cycles and if their licensing model is core or socket based they will have to revise their charging structures

• Heterogeneous computing won’t really hit mainstream, multi-application HPC market without ISV support

Observations


25

PetapathSoftware development flows onmulti-core and heterogeneous systems


26

Host Software Development Practice (Single Core)

• Typical host development flow (Rinse, Profile, Repeat)

• Use a naïve implementation (e.g. the infamous triple loop)

• Compile (compiler choice can often be important)• Profile/Benchmark (use % of peak GFLOPS as a guide)• Throw some compiler switches• Repeat

• Some developers don’t get very far into this optimisation process

• Time vs Reward (Does it run fast enough yet?)


27

Host Software Development Practice (Multi-core)

• Look for more scalable implementations• In the multi-core era look for algorithmically scalable

solutions• This usually means looking to leverage architectural

features• e.g. Make sure you are cache friendly and take advantage

of Vector/SIMD support

• Compile, Profile/Benchmark (use % of peak GFLOPS as a guide)

• Throw compiler switches but also use compiler directives e.g. OpenMP which can require some changes to code

• The parameter space for these optimisations can be large

• Challenging even for the experienced


28

Host Software Development Practice (Pitfalls)

• The ‘memory wall’ is probably the biggest hurdle

• With more cores sharing an already scarce resource in main memory bandwidth, cache hygiene is very important!

• Once you fall out of cache then it is sometimes possible that adding more cores can actually slow down your application

• Effective programming is about optimising bandwidth• Tools such as Acumem’s SlowSpotterTM are particularly

useful!

• Deliberately skipping multi-node development as it’s a whole other subject and deserves it’s own track


29

Heterogeneous Systems Software Development Practice

• An implementation tuned for multi-core is a good starting point for porting to an accelerated system

• This is because available concurrency (via multi-threading and asynchronous operations) and data parallel operations will likely have been explicitly exposed

• In all but the most compute bound applications, effective implementations of data parallel problems are usually tuned to maximise cache bandwidth

• And to allow effective loop blocking and strip-mining transformations

• This set of optimisations provide a good template for developing an algorithm on an accelerated system


30

• First ascertain at what limiting factors on the host are:

• Bandwidth• Is your application bandwidth limited on the host?

• In its most cache/memory friendly implementation does it scale and exhibit good cache behaviour?

• GPU based accelerators have several times the BW to their local memories of even the latest servers• However accelerators typically have less local memory

than the host so large working sets will have to be streamed from the host

• Any significant and repeated data movement to and from the accelerator can often be a gating factor for overall application acceleration


31

• Compute• Is you application compute limited?

• Is it single precision?• Single precision is still the clear advantage for GPU based

accelerators

• Does your application require double precision?• GPU based accelerators have less of a delta over x86 hosts

in terms of pure DP performance • Lower GFLOP/$ and GFLOP/W vs Implementation Complexity

• ClearSpeed has significant advantages in terms of GFLOP/W for applications needing double precision


32

• As for optimal multi-core development

• Make sure you are making the most of the architectural features

• Occupancy vs. Latency hiding• Shared or local memory accesses

• Consider using other memories (constant, texture etc.)

• Make sure you are maximising external memory bandwidth • Correct alignment and granularity vital• Must used coalesced memory accesses

General comments on using accelerators


33

Accelerator Software Development Pitfalls

• Pay attention to Amdahl’s Law

• Simply put it describes the limit of potential acceleration of an application due to parallelisation

• Applies equally to many multi-core implementations

• As you process the data parallel kernels faster, the data movement and other serial portions of the application start to dominate the actual runtime

• At this point the host interface to the accelerator can now be a bottleneck


34

PetapathThe future - Developing with OpenCL


35

OpenCL in use

• The Khronos Group’s conformance requirements for OpenCL will endeavour to ensure correctness of implementation between vendors

• A real challenge for those using OpenCL could well be managing varying performance characteristics of different OpenCL capable platforms

• Even different products by the same vendor may vary

• What works well on a multi-core CPU and efficiently on a massively parallel accelerator will likely vary


36

• How similar is the heterogeneous development environment to traditional host development?

• What tools are there to help the development process?

• Do they all support a similar debug interface?• Do they all have similar profiling capabilities?

Will development methods and tools converge?


37

• Debug• Hardware gdb support?

• ClearSpeed supports source level debug of Cn

• NVIDIA in CUDA 2.1, CELL• Debug for Brook+ & pre-CUDA 2.1 was via host versions of

kernels

• Profiling• gprof (supported by ClearSpeed in Cn)

• Host API only support for gprof with NVIDIA

• Hardware profiling?• ClearSpeed has a very sophisticated profiling and debugging

environment• Other profilers currently report a more limited set of information

for kernels running on HW


38

• Will these debug and profile tools support OpenCL out of the gate?

• With an open development environment now available, it makes sense to develop cross-platform tools that support OpenCL natively and more importantly across multiple vendors and operating systems

• Not having to use vendor specific tools will increase the likelihood that developers will not spend too much time tuning for each platform

What will OpenCL have initially?


39

ClearSpeed CSX700

All Image Rights reserved by original copyright holders

Architectures targeted by OpenCL are similar, but different …


40

NVIDIA GT200

Image Rights reserved by original copyright holders


41

AMD RV770



42

INTEL LARRABEE



43

• Additional utilities and development tools available to the host based developer:

• Intel® Compilers, MKL, IPP, VTune, Thread Building Blocks, Thread Checker (and soon Parallel Studio)

• AMD Partner Compilers, CodeAnalyst, ACML

• Acumem SlowSpotter

• Allinea Tools

• And a myriad of other third party tools …

Can we look forward to …


44

Petapath Questions?

petapath hp cast 12 - programming for high performance accelerated systems

Documents

petapath programming

heterogeneous systems

alternative systems

pipeline csx

host compute

host threads

performance information

csx processors csx processors