petapath hp cast 12 - programming for high performance accelerated systems

44
20/06/22 Copyright © 2009 Petapath Limited. All rights reserved. 1 Petapath Dairsie Latimer and Michal Harasimiuk Programming for High Performance Accelerated Systems

Upload: dairsie

Post on 10-May-2015

681 views

Category:

Documents


0 download

DESCRIPTION

Presentation given at HP-CAST 12, Tutorial Session in Madrid, May 2009 on Software Environments for Accelerators.

TRANSCRIPT

Page 1: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

1

Petapath

Dairsie Latimer and Michal Harasimiuk

Programming for High Performance Accelerated Systems

Page 2: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

2

Petapath

Page 3: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

3

Petapath

Hardware

Software

Tools

Consulting

ClearSpeed

NVIDIA

AMD

Intel

Page 4: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

4

PetapathJoint Petapath/HP PRACE WP8 Prototype system at SARA/NCF

Page 5: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

5

PetapathJoint Petapath/HP PRACE WP8 Prototype system at SARA/NCF

6U

10 TFLOPS

7 kW

Page 6: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

6

Petapath

• 20 racks, 1.125 PFLOPS, end of 2009• 500KW• Alternative systems – 15x the size, 8x the power

Page 7: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

7

Programming for High Performance Accelerated Systems

• Overview of the development environment at SARA/NCF

• Options for programming heterogeneous systems

• Moving software development flows from multi-core to heterogeneous systems

• Developing with OpenCL going forward

Petapath

Page 8: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

8

PetapathPetapath/HP PRACE Prototype system at SARA/NCF

Page 9: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

9

ClearSpeed Software Development environment at SARA/NCF

• ClearSpeed SDK Version 3.1• Binary compatible across all ClearSpeed based products

• Cn Optimising Compiler• C with poly extensions for SIMD data types

• Debugger – a port of gdb• Runs on hardware

• Profiler – csprof• Allows system-wide visualization of an accelerated

application’s performance while running on both a multi-core host and ClearSpeed accelerators

• Libraries (BLAS, RNG and FFT) & High level APIs (CSPX)

Page 10: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

10

• Standard Eclipse graphical debug interface for CSX processors

• CSX processors provide full hardware debugging of running application code

• Provides seamless viewof many processor coresin parallel with their associated state

• Allows full symbolic debug of the Cn language

• Enhanced views for CSX specific information

ClearSpeed graphical debug interface for the heterogeneous systems

Images used with permission of ClearSpeed Technology Plc

Page 11: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

11

ClearSpeed profiler for heterogeneous and multi-processor systems

Advance™ Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

HostCPU(s)Host

CPU(s)Host

CPU(s)

Advance™ Accelerator Board

HostCores(s)

CSX

Pipeline

HOST/BOARD INTERACTIONView host/board interactions.

Provides performance information for data transfer

operations. Trace cluster node/board interaction. See overlap of host compute and

board compute.

CSX PIPELINEView detailed instruction

issue information. Visualize overlap of executing

instructions. Optimize code at the instruction level. View

instruction level performance bottlenecks. Get accurate

instruction timing.

CSX SYSTEMView system level trace.

Visually inspect the overlap of compute and

I/O. Visualize cache utilization. View branch trace of code executing.

Find and analyse performance bottlenecks. Get accurate event timing

CSX

Pipeline

HOST CODE PROFILINGVisually inspect host code

executing. Supports multiple threads

and processes. Time specific code sections.

See overlap of host threads executing.

Platform and processor agnostic trace collection.

Page 12: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

12

PetapathProgramming for High Performance Accelerated Systems

Page 13: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

13

Programming for High PerformanceAccelerated Systems

Introduction

• Heterogeneous systems are now increasingly common

• They are being adopted at the top (Top500) and the bottom (technical workstation) of the HPC market

• Acceleration can deliver significant performance and cost savings over traditional COTS HPC systems

• However, there are real barriers to adoption:• Software support and programming models• Host system requirements

Page 14: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

14

• In order to take advantage of this new technology trend, what are the realistic options?

• Some important things to consider:

• Single or multi-use system?

• Where do the majority of the cycles go?

• ISV codes or Open Source/Custom Codes?• Sufficient development resources?

Page 15: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

15

• Starting with application source, what is the best way to target heterogeneous computing today?

• Proprietary development environments and hardware:• AdvanceTM/Cn (ClearSpeed)• TeslaTM/CUDA (NVIDIA)• StreamTM/Brook+ (AMD)• FPGA based solutions

• Or via Third Party/Middleware:• RapidMindTM Platform• CAPS HMPPTM

• PGI’s x64+GPU Accelerate Model• e.g. Mitrion Development Platform for FPGAs

Page 16: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

16

• These options can loosely be categorised into

• Language• Cn, CUDA, Brook+, Mitrion C, OpenCL

• Directive based or hybrid approaches• PGI x86+GPU, CAPS HMPP, RapidMind

• Allow re-targetable support• Can potentially support multiple vendor development

environments

Page 17: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

17

• Library

• All the languages have a library component• Manages hardware resources and runtime interaction

• Can also provide higher level abstractions suchas standard library support, e.g. BLAS or LAPACK

• Some libraries are available from third parties that are designed to transparently interface ISV applications to accelerator hardware

• Often the best implementations are available from the vendors themselves

Page 18: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

18

• Industry will inevitably move towards available open standards

• We believe that the Khronos Group’s OpenCLTM (Open Computing Language) will be a key enabler in the wider adoption of heterogeneous systems

• Petapath are members of the Khronos Group and participants on the OpenCL working group

What comes next?

Page 19: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

19

• OpenCL is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems

• OpenCL provides uniform programming environment for software developers

• Can write efficient, portable code for a range of high-performance systems and a diverse mix of multi-core and parallel processors

OpenCL

Page 20: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

20

• OpenCL consists of:• An API for coordinating parallel computation• A programming language for describing those

computations.

• Specifically, the OpenCL standard defines:• Subset of the C99 language with extensions for

parallelism• API for coordinating data and task-based parallel

computations• Numerical requirements based on the IEEE 754 standard• Interoperability with other Khronos standards such as

OpenGL• An abstraction layer for a diverse range of computational

resources

Page 21: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

21

• OpenCL also specifies:• A rich set of built-in functions• Online or offline compilation and build of compute kernel

executables

• Platform Layer API• Query, select and initalize compute devices• Create compute contexts and work-queues

• Runtime API• Execute compute kernels• Manage scheduling, compute and memory resources

Page 22: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

22

• Is OpenCL a golden bullet?• Possibly not but it’s an excellent place to start

• It’s a well supported Open standard• OpenCL has complete cross vendor support• Most are motivated to increase their market share in the

HPC and Technical computing market

• Write once, work on many platforms is attractive for ISVs• The lack of an open standard has certainly slowed

adoption of support for heterogeneous systems outside of the academic community for compute intensive applications

Page 23: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

23

• When will it be available?• KhronosTM Group ratified the OpenCLTM 1.0 specification

at Siggraph Asia, December 9th 2008• Conformant vendor implementations available in Q3

2009• One vendor already has a public beta program• Others will not be far behind

• What are the principle reasons that make OpenCL attractive?• No reliance on proprietary programming languages

• Cross vendor compatibility and interoperability

• Cross platform support (Linux, Windows and OS X)

Page 24: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

24

• The incentive to support heterogeneous systems has to be a clear business win; so companies who differentiate on innovation are more likely to adopt early

• Many large ISVs have long development cycles and if their licensing model is core or socket based they will have to revise their charging structures

• Heterogeneous computing won’t really hit mainstream, multi-application HPC market without ISV support

Observations

Page 25: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

25

PetapathSoftware development flows onmulti-core and heterogeneous systems

Page 26: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

26

Host Software Development Practice (Single Core)

• Typical host development flow (Rinse, Profile, Repeat)

• Use a naïve implementation (e.g. the infamous triple loop)

• Compile (compiler choice can often be important)• Profile/Benchmark (use % of peak GFLOPS as a guide)• Throw some compiler switches• Repeat

• Some developers don’t get very far into this optimisation process

• Time vs Reward (Does it run fast enough yet?)

Page 27: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

27

Host Software Development Practice (Multi-core)

• Look for more scalable implementations• In the multi-core era look for algorithmically scalable

solutions• This usually means looking to leverage architectural

features• e.g. Make sure you are cache friendly and take advantage

of Vector/SIMD support

• Compile, Profile/Benchmark (use % of peak GFLOPS as a guide)

• Throw compiler switches but also use compiler directives e.g. OpenMP which can require some changes to code

• The parameter space for these optimisations can be large

• Challenging even for the experienced

Page 28: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

28

Host Software Development Practice (Pitfalls)

• The ‘memory wall’ is probably the biggest hurdle

• With more cores sharing an already scarce resource in main memory bandwidth, cache hygiene is very important!

• Once you fall out of cache then it is sometimes possible that adding more cores can actually slow down your application

• Effective programming is about optimising bandwidth• Tools such as Acumem’s SlowSpotterTM are particularly

useful!

• Deliberately skipping multi-node development as it’s a whole other subject and deserves it’s own track

Page 29: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

29

Heterogeneous Systems Software Development Practice

• An implementation tuned for multi-core is a good starting point for porting to an accelerated system

• This is because available concurrency (via multi-threading and asynchronous operations) and data parallel operations will likely have been explicitly exposed

• In all but the most compute bound applications, effective implementations of data parallel problems are usually tuned to maximise cache bandwidth

• And to allow effective loop blocking and strip-mining transformations

• This set of optimisations provide a good template for developing an algorithm on an accelerated system

Page 30: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

30

• First ascertain at what limiting factors on the host are:

• Bandwidth• Is your application bandwidth limited on the host?

• In its most cache/memory friendly implementation does it scale and exhibit good cache behaviour?

• GPU based accelerators have several times the BW to their local memories of even the latest servers• However accelerators typically have less local memory

than the host so large working sets will have to be streamed from the host

• Any significant and repeated data movement to and from the accelerator can often be a gating factor for overall application acceleration

Page 31: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

31

• Compute• Is you application compute limited?

• Is it single precision?• Single precision is still the clear advantage for GPU based

accelerators

• Does your application require double precision?• GPU based accelerators have less of a delta over x86 hosts

in terms of pure DP performance • Lower GFLOP/$ and GFLOP/W vs Implementation Complexity

• ClearSpeed has significant advantages in terms of GFLOP/W for applications needing double precision

Page 32: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

32

• As for optimal multi-core development

• Make sure you are making the most of the architectural features

• Occupancy vs. Latency hiding• Shared or local memory accesses

• Consider using other memories (constant, texture etc.)

• Make sure you are maximising external memory bandwidth • Correct alignment and granularity vital• Must used coalesced memory accesses

General comments on using accelerators

Page 33: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

33

Accelerator Software Development Pitfalls

• Pay attention to Amdahl’s Law

• Simply put it describes the limit of potential acceleration of an application due to parallelisation

• Applies equally to many multi-core implementations

• As you process the data parallel kernels faster, the data movement and other serial portions of the application start to dominate the actual runtime

• At this point the host interface to the accelerator can now be a bottleneck

Page 34: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

34

PetapathThe future - Developing with OpenCL

Page 35: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

35

OpenCL in use

• The Khronos Group’s conformance requirements for OpenCL will endeavour to ensure correctness of implementation between vendors

• A real challenge for those using OpenCL could well be managing varying performance characteristics of different OpenCL capable platforms

• Even different products by the same vendor may vary

• What works well on a multi-core CPU and efficiently on a massively parallel accelerator will likely vary

Page 36: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

36

• How similar is the heterogeneous development environment to traditional host development?

• What tools are there to help the development process?

• Do they all support a similar debug interface?• Do they all have similar profiling capabilities?

Will development methods and tools converge?

Page 37: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

37

• Debug• Hardware gdb support?

• ClearSpeed supports source level debug of Cn

• NVIDIA in CUDA 2.1, CELL• Debug for Brook+ & pre-CUDA 2.1 was via host versions of

kernels

• Profiling• gprof (supported by ClearSpeed in Cn)

• Host API only support for gprof with NVIDIA

• Hardware profiling?• ClearSpeed has a very sophisticated profiling and debugging

environment• Other profilers currently report a more limited set of information

for kernels running on HW

Page 38: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

38

• Will these debug and profile tools support OpenCL out of the gate?

• With an open development environment now available, it makes sense to develop cross-platform tools that support OpenCL natively and more importantly across multiple vendors and operating systems

• Not having to use vendor specific tools will increase the likelihood that developers will not spend too much time tuning for each platform

What will OpenCL have initially?

Page 39: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

39

ClearSpeed CSX700

All Image Rights reserved by original copyright holders

Architectures targeted by OpenCL are similar, but different …

Page 40: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

40

NVIDIA GT200

Image Rights reserved by original copyright holders

Page 41: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

41

AMD RV770

Image Rights reserved by original copyright holders

Page 42: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

42

INTEL LARRABEE

Image Rights reserved by original copyright holders

Page 43: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

43

• Additional utilities and development tools available to the host based developer:

• Intel® Compilers, MKL, IPP, VTune, Thread Building Blocks, Thread Checker (and soon Parallel Studio)

• AMD Partner Compilers, CodeAnalyst, ACML

• Acumem SlowSpotter

• Allinea Tools

• And a myriad of other third party tools …

Can we look forward to …

Page 44: Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

12/04/23Copyright © 2009 Petapath Limited.All rights reserved.

44

Petapath Questions?