chap. 10 part 1

Chap. 10 part 1

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Purpose of chap. 10

Overview of HW technology

Doesn’t fit usual molds of “HPC cluster” or multicore or SMP system

New non-mainstream SW approaches

Theory and principles we studied still apply

The specific techniques { Pilot/MPI, pthreads, OpenMP } may not directly apply


Why we aren’t learning these platforms/languages

GPUs common

Inside every display adapter!

But difficult to exploit for general purpose computing (GPGPU), not standardized (yet)

Cells uncommon (except PlayStation 3)

IBM no longer developing new models

Painful to program, steep learning curve

Contrast multicores, clusters


Theme: Attached processors

Can be cost-effective to attach a special-purpose processor to GP CPU

Specially-designed logic optimizes certain operations with high speed/throughput

Aka co-processor, accelerator

Done for decades for floating point!

If you need the capability, buy it and plug it in

If not, save the cost, board space, power, heat

Still true today in embedded systems where unit cost is critical


Trend: Smart device adapters

Device adapters on PC bus became “smarter” and more powerful, turning into computers in own right

Display adapters, took over task of rendering text and 2D/3D graphics into raster scans

Offloads CPU program/OS from time-consuming tasks

Pay for the power/memory you need/want


FPGAs as accelerators

Field-programmable gate arrays

Our research group + SOE have a lot of experience with this

Aka reconfigurable logic, “soft hardware”

HW advantage from custom design

Circuit performs calculation needed by your app that’s time-consuming for CPU but can be done rapidly by dedicated HW

Multiple instances of same circuit

parallel computation! Fall 2016 CIS*3090 Parallel Programming 6

FPGA in use

“Programming” the FPGA

Circuit description (configuration) downloaded into device as bitstream

CPU program sends args/data to device via system bus or parallel/serial port

Perhaps FPGA interrupts CPU when done

CPU reads results from FPGA and program continues


Pros and cons

PRO: very fast calculations

App-specific

Reconfigure on-the-fly from library of “personalities”, “partial reconfiguration”

Cheap $ < fast CPU or custom IC

CON: difficult to design fast circuit well

Special knowledge/training needed

Time-consuming (minutes to hours) to convert HW description to bitstream


Designers turn to multicore

How to exploit Moore’s Law?

Transistor density still increasing

But power/heat limit single-CPU speed

“Mainstream” multicore chips (Intel, AMD)

Copies of existing instruction processor

Programming easy write in same HLL or

ASM, use same compiler

Easy to configure as SMP, exploit with multithreaded OS, OpenMP


Must extra cores simply be copies of “main” CPU?

Heterogeneous (vs. homogeneous) multiprocessors (MP)

Driven by video gaming industry

Specific subset of calculations they needed

Single-precision floating point

Not full instruction set, nor deeply pipelined CPU

Code segments small (e.g., shaders), so didn’t need (shared) access to huge RAM

IDEA: lower speed, special-purpose “cores” would be made up by quantity!


Designs went in 2 directions

(1) Video adapter manufacturers (NVIDIA, AMD) populated cards with multiple Graphics Processing Units

Optimized for “graphics pipeline” (Fig 10.3)

Not intended as GP computing platform!

But programmers wanted to make them easier to program for non-video apps GPGPU

In that sense, now used as co-processor:

Main CPU configures GPUs with app-specific program, uploads data, downloads results


Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 12

Figure 10.3 The graphics pipeline.

GPU programming craze

Cheap, plentiful compute power

Just a step away on the system bus!

Awesome throughput (gigaflops = billions of floating point operations per second)

Growing much faster than CPUs (Fig 10.2)

Updated for 2013, http://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html

Based on higher-speed cores and more of them

Single precision much faster than double


Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 14

Figure 10.2 GPU versus CPU

performance over time.

CPU vs GPU performance 11-Jun-2013


Cell Broadband Engine (BE)

(2) Self-contained heterogeneous MP

Sony, Toshiba & IBM teamed to create platform for high-end video games

PlayStation 3 targeted for entertainment

Potential for high-speed parallel programming

Unique architecture (Fig 10.5)

PPE: PowerPC core, GP computer

SPEs: 8-16 “Synergistic” cores

Mere 256K of local memory each!

No cache-coherent “global memory”


Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 10-17

Figure 10.5 The floor plan of the Cell

processor.

Cell programming

Needs two compilers

PPE intentionally low-performance

Runs Linux, but should put app on the SPEs

SPEs have SIMD instructions

E.g., multiply 4 pairs of numbers to give 4 results in one instruction

To access, usually requires ASM language, or library calls that invoke those instructions

“Everyone said” difficult to program

And they’re right! Fall 2016 CIS*3090 Parallel Programming 18

Pilot to the rescue!

My grad student Natalie Girard

No one using SHARCNET’s Cell cluster

Needed to combine pthreads, MPI, Cell SDK

Extended Pilot to Cell = CellPilot

Possible to treat SPEs as processes

Communicate with channels transparently

Within single Cell node

Across heterog. cluster of Cells + other computers

Using “best” means: DMA, message, mailbox

From what we learned with CellPilot…


Another grad, Ben Kelly

Renesas IMAPCAR2: 128-core chip!

Embedded automotive image processing

Works in 128-way SIMD mode, or

Micro cluster of CP + 32 threads SPMD mode

How can 32-thread SMP work?!

Same as Cell, skip cache coherency HW

Dilemma: shared mem. but not coherent!

Pthreads programming becomes treacherous

Solution: AutoPilot for message-passing

Hides complex, easy-to-abuse low-level I/O Fall 2016 CIS*3090 Parallel Programming 20

Summarizing so far

Trend of attached co-processors

(1) GPU attached to conventional CPU

(2) CPU + heterogeneous cores on chip

Both feature explicit data movement

How do they do it?

KEY: skip the coherent shared memory!

Saves on digital logic space for more cores

Reduces overhead, improves mem. bandwidth

Downside: programming headache


Worth mentioning: SHARCNET clusters with accelerators

Tried each of above technologies

Cell BEs + conventional computers Gone!

Cluster with shared GPU on pairs of nodes

Cluster with bank of FPGAs Gone?

Cluster with 2 Intel Xeon Phi on one node

All lightly used!

Difficult to learn and program

Available for “exploring outside course” projects


Road map for chap 10

Topics

Decent introduction to GPU programming

Overview of HW, programmer model, CUDA

NVIDIA training video

OpenACC: new OpenMP-like alternative

Overview of Intel Xeon Phi

Intended to be “GPGPU killer” and win back market share

Best feature: easy to program in popular HLL (like “normal” CPU)


Last handful of “trendy” topics: read for yourself

Grid computing

Internet pulls together a virtual parallel computing utility you can buy time on, like a far-flung distributed parallel cluster

More challenging to utilize than conventional HPC cluster!

Lambda latencies vary by orders of magnitude whether nodes co-located or remote

Can say Grid evolved into Cloud


More trends

Transactional memory

Getting memory HW to do the work normally associated with SW locks

Locks become bottlenecks in par. programs

Similar to Peril-L full/empty variables!

MapReduce developed by Google

“A tool for searching huge data archives”


“Go” language

Publicized by Google in 2009

“Optimized for multicore processors, with many features automatically taking advantage of multithreading.”

“Go promotes writing systems and servers as sets of lightweight communicating processes.”

Inspired by CSP & pi-calculus, like Pilot

Turns out was pre-existing “Go!” language, and author objected!


chap. 10 part 1

Documents