par prog course many core sw pats ocl

11

Many core processors: Opportunities and challenges

(or an old supercomputing hacker rambles on about parallel computing)

Tim Mattson

Intel Labs

[email protected]

Disclosure

The views expressed in this talk are those of the speaker and not his employer.

I am in a research group and know nothing about Intel products. So anything I say about them is highly suspect I might even lie!

This was a team effort, but if I say anything really stupid, its all my fault dont blame my collaborators.

Sources for the slides

Slides from my work at Intel.

Slides I created with UC Berkeley ParLab colleagues. Most of these come from courses I taught with Prof. Kurt Keutzer.

Slides I developed with members of the Khronos compute group (OpenCL).

Agenda Our hardware future

Rising to the many core software challenge

OpenCL: A common HW abstraction layer

Parallel hardware trendsTop 500: total number of processors (1993-2000)

0

20000

40000

60000

80000

100000

120000

140000

1993 1994 1995 1996 1997 1998 1999 2000

The early years SIMD MPP fades and clusters

take over.Source: the June lists from www.top500.org

Parallel Hardware TrendsTop 500: total number of processors (1993-2009)

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

Disruptive

technology

trend!

Source: the June lists from www.top500.org

7

The era of many-core processors has arrived

IBM Cell

NVIDIA Tesla C1060Intel SCC research

chip

ATI RV770

3rd party names are the property of their owners. Source: SC09 OpenCL tutorial

80 cores

240 cores

1 CPU + 6 cores

160 cores

48 cores

8

Why many core? Its all about Power.Power vs Performance (normalized to i486 process tech.)

0

5

10

15

20

25

30

0 2 4 6 8Scalar Performance

Po

we

r power = perf ^ 1.74

Pentium M

i486 Pentium

Pentium Pro

Pentium 4 (Wmt)

Pentium 4 (Psc)

Growth in power

is unsustainable

Source: Ed Grochowski, Intel

9

Design with Power in mind

0

5

10

15

20

25

30

0 2 4 6 8Scalar Performance

Po

we

r power = perf ^ 1.74

Pentium M

i486 Pentium

Pentium Pro

Pentium 4 (Wmt)

Pentium 4 (Psc)

Mobile CPUs

with shallow

pipelines use

less power

31 Pipeline stages

10

Using Multiple cores to reduce power

Processor

f

Processor

f/2

Processor

f/2

f

Input Output

Input

Output

Capacitance = C

Voltage = V

Frequency = f

Power = CV2f Capacitance = 2.2C

Voltage = 0.6V

Frequency = 0.5f

Power = 0.396CV2f

Chandrakasan, A.P.; Potkonjak, M.; Mehra, R.; Rabaey, J.; Brodersen, R.W., "Optimizing power using transformations," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,, vol.14, no.1, pp.12-31, Jan 1995

Source: K. Keutzer of UCB

1111

Specialized Si is more power efficient

Peak single precisions GFLOPS at the peak thermal design point. All processors manufactured with a 65 nm process technology.

0

2

4

6

8

10

12

14

16

GFLO

P/W

att

NVIDIA GTX 280

Intel 80 core terascale processor

Intel CoreTM 2 Quad processor (Q6700)

236 W

97 W

95 W

Sources: vendor published data sheets and the 80 core paper form SC08 by Mattson, van der Wijngaart, and Frumkin.

1212

The future of HW

GMCHGPU

ICH

CPUCPU

DRAM

GMCH = graphics memory control hub, ICH = Input/output control hub SOC = system on a chip

Moores law marches on more and more transistors on a single die (Intels 22 nm Fab at a cost of 6 to 8 Billion dollars will open in 2013)

Heterogeneous, many core SOC will be the standard building block of computing

How will we use those transistors:

Many heterogeneous cores.

GPU, CPU, I/O hubs, etc all on one SOC die.

1313

How will the cores be connected?

An Intel Execs slide from IDF2006

1414

Challenging the sacred cows

SharedCache

LocalCache

StreamlinedIA Core

Assumes cache coherent shared address space!

Is that the right choice? Most expert programmers do not

fully understand relaxed consistency memory models required to make cache coherent architectures work.

The only programming models proven to scale non-trivial apps to 100s to 1000s of cores all based on distributed memory.

Coherence incurs additional architectural overhead

IA cores optimized for multithreading

1515

The Coherency Wall As you scale the number of cores on a cache coherent system (CC), cost in

time and memory grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall.

* Rakesh Kumar, Timothy G. Mattson, Gilles Pokam, and Rob van der Wijngaart, The case for Message Passing on Many-core Chips, University of Illinois Champaign-Urbana Technical Report, UILU-ENG-10-2203 (CRHC 10-01), 2010.

each directory entry will be 128 bytes long for a 1024 core processor supporting fully-mapped directory-based cache coherence. This may often be larger than the size of the cacheline that a directory entry is expected to track.*

CC: O(N) 1 2

2D Mesh: O(4) to O(N )

Cost

(tim

e a

nd/o

r m

em

ory

)

Number of cores

For a scalable, directory based scheme, CC incurs an N-body effect cost scales at best linearly (Fixed memory size as cores are added) and at worst quadratically (memory grows

linearly with number of cores).

HW Dist. Mem. HW cost scales at best a fixed cost for the local neighborhood and at worst as the diameter of

the network.

Assume an app whose performance is not bound by memory bandwidth

1616

Isnt shared memory programming easier? Not necessarily.

Time

Effo

rt

Extra work upfront, but easier optimization and debugging means

overall, less time to solutionMessage passing

Time

Effo

rt

initial parallelization can be quite easy

Multi-threading

But difficult debugging and optimization means overall

project takes longer

*P. N. Klein, H. Lu, and R. H. B. Netzer, Detecting Race Conditions in Parallel Programs that Use Semaphores, Algorithmica, vol. 35 pp. 321345, 2003

Proving that a shared address space program using semaphores is race free is an NP-complete problem*

The many-core challenge

Result: a fundamental and dangerous mismatch

Parallel hardware is ubiquitous.

Parallel software is rare

Our challenge make parallel software as routine as our parallel hardware.

We have arrived at many-core solutions notbecause of the success of our parallel

software but because of our failure to keep increasing CPU frequency*.

*Tim Mattson a famous-software-researcher wannabe



PLPP the original parallel pattern language

OPL patterns for engineering parallel software

Example


20

How about automatic parallelization?

0

5

10

15

20

25

30

bzip2

craf

tyga

pgc

cgz

ipm

cf

pars

ertw

olf

vorte

xvp

r

aver

age

Sp

ee

du

p %

Basic speculative multithreading

Software value predictionEnabling optimizations

A Cost-Driven Compilation Framework for Speculative Parallelization of Sequential Programs,

Zhao-Hui Du, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, Tin-Fook Ngai (Intel

Corporation) in PLDI 2004

Aggressive techniques such as speculative multithreading help, but they are not enough.

Ave SPECintspeedup of 8% will climb to ave. of 15% once their system is fully enabled.

There are no indications auto par. will radically improve any time soon.

Hence, I do not belive Auto-par will solve our problems.

Results for a simulated dual core platform configured as a main core and a core for speculative execution.

21

Parallel Software

Our only hope is to get programmers to write parallel software by hand.

But after 25+ years of research, we are no closer to solving the parallel programming problem

Only a tiny fraction of programmers write parallel code.

Will the if you build it they will come principle apply?

Many hope so, but ..

that implies that people didnt really try hard enough over the last 25 years. Does that really make sense?

Solution: Find A Good parallel programming model, right?

ABCPL

ACE

ACT++

Active messages

Adl

Adsmith

ADDAP

AFAPI

ALWAN

AM

AMDC

AppLeS

Amoeba

ARTS

Athapascan-0b

Aurora

Automap

bb_threads

Blaze

BSP

BlockComm

C*.

"C* in C

C**

CarlOS

Cashmere

C4

CC++

Chu

Charlotte

Charm

Charm++

Cid

Cilk

CM-Fortran

Converse

Code

COOL

CORRELATE

CPS

CRL

CSP

Cthreads

CUMULVS

DAGGER

DAPPLE

Data Parallel C

DC++

DCE++

DDD

DICE.

DIPC

DOLIB

DOME

DOSMOS.

DRL

DSM-Threads

Ease .

ECO

Eiffel

Eilean

Emerald

EPL

Excalibur

Express

Falcon

Filaments

FM

FLASH

The FORCE

Fork

Fortran-M

FX

GA

GAMMA

Glenda

GLU

GUARD

HAsL.

Haskell

HPC++

JAVAR.

HORUS

HPC

IMPACT

ISIS.

JAVAR

JADE

Java RMI

javaPG

JavaSpace

JIDL

Joyce

Khoros

Karma

KOAN/Fortran-S

LAM

Lilac

Linda

JADA

WWWinda

ISETL-Linda

ParLin

Eilean

P4-Linda

POSYBL

Objective-Linda

LiPS

Locust

Lparx

Lucid

Maisie

Manifold

Mentat

Legion

Meta Chaos

Midway

Millipede

CparPar

Mirage

MpC

MOSIX

Modula-P

Modula-2*

Multipol

MPI

MPC++

Munin

Nano-Threads

NESL

NetClasses++

Nexus

Nimrod

NOW

Objective Linda

Occam

Omega

OpenMP

Orca

OOF90

P++

P3L

Pablo

PADE

PADRE

Panda

Papers

AFAPI.

Para++

Paradigm

Parafrase2

Paralation

Parallel-C++

Parallaxis

ParC

ParLib++

ParLin

Parmacs

Parti

pC

PCN

PCP:

PH

PEACE

PCU

PET

PENNY

Phosphorus

POET.

Polaris

POOMA

POOL-T

PRESTO

P-RIO

Prospero

Proteus

QPC++

PVM

PSI

PSDM

Quake

Quark

Quick Threads

Sage++

SCANDAL

SAM

pC++

SCHEDULE

SciTL

SDDA.

SHMEM

SIMPLE

Sina

SISAL.

distributed smalltalk

SMI.

SONiC

Split-C.

SR

Sthreads

Strand.

SUIF.

Synergy

Telegrphos

SuperPascal

TCGMSG.

Threads.h++.

TreadMarks

TRAPPER

uC++

UNITY

UC

V

ViC*

Visifold V-NUS

VPE

Win32 threads

WinPar

XENOOPS

XPC

Zounds

ZPL

Third party names are the property of their owners.

Models from the golden age of parallel programming (~95)

The only thing sillier than creating too many models is using too many

ABCPL

ACE

ACT++

Active messages

Adl

Adsmith

ADDAP

AFAPI

ALWAN

AM

AMDC

AppLeS

Amoeba

ARTS

Athapascan-0b

Aurora

Automap

bb_threads

Blaze

BSP

BlockComm

C*.

"C* in C

C**

CarlOS

Cashmere

C4

CC++

Chu

Charlotte

Charm

Charm++

Cid

Cilk

CM-Fortran

Converse

Code

COOL

CORRELATE

CPS

CRL

CSP

Cthreads

CUMULVS

DAGGER

DAPPLE

Data Parallel C

DC++

DCE++

DDD

DICE.

DIPC

DOLIB

DOME

DOSMOS.

DRL

DSM-Threads

Ease .

ECO

Eiffel

Eilean

Emerald

EPL

Excalibur

Express

Falcon

Filaments

FM

FLASH

The FORCE

Fork

Fortran-M

FX

GA

GAMMA

Glenda

GLU

GUARD

HAsL.

Haskell

HPC++

JAVAR.

HORUS

HPC

IMPACT

ISIS.

JAVAR

JADE

Java RMI

javaPG

JavaSpace

JIDL

Joyce

Khoros

Karma

KOAN/Fortran-S

LAM

Lilac

Linda

JADA

WWWinda

ISETL-Linda

ParLin

Eilean

P4-Linda

POSYBL

Objective-Linda

LiPS

Locust

Lparx

Lucid

Maisie

Manifold

Mentat

Legion

Meta Chaos

Midway

Millipede

CparPar

Mirage

MpC

MOSIX

Modula-P

Modula-2*

Multipol

MPI

MPC++

Munin

Nano-Threads

NESL

NetClasses++

Nexus

Nimrod

NOW

Objective Linda

Occam

Omega

OpenMP

Orca

OOF90

P++

P3L

Pablo

PADE

PADRE

Panda

Papers

AFAPI.

Para++

Paradigm

Parafrase2

Paralation

Parallel-C++

Parallaxis

ParC

ParLib++

ParLin

Parmacs

Parti

pC

PCN

PCP:

PH

PEACE

PCU

PET

PENNY

Phosphorus

POET.

Polaris

POOMA

POOL-T

PRESTO

P-RIO

Prospero

Proteus

QPC++

PVM

PSI

PSDM

Quake

Quark

Quick Threads

Sage++

SCANDAL

SAM

pC++

SCHEDULE

SciTL

SDDA.

SHMEM

SIMPLE

Sina

SISAL.

distributed smalltalk

SMI.

SONiC

Split-C.

SR

Sthreads

Strand.

SUIF.

Synergy

Telegrphos

SuperPascal

TCGMSG.

Threads.h++.

TreadMarks

TRAPPER

uC++

UNITY

UC

V

ViC*

Visifold V-NUS

VPE

Win32 threads

WinPar

XENOOPS

XPC

Zounds

ZPL


Programming models Ive worked with.

Perc

en

tag

e

60

try

40

try

24 6

Choice overload:Too many options can hurt you

The Draeger Grocery Store experiment consumer choice :

Two Jam-displays with coupons for purchase discount.

24 different Jams

6 different Jams

How many stopped by to try samples at the display?

Of those who tried, how many bought jam?

3

bu

y

30

bu

y

The findings from this study show that an extensive array of options can at first seem highly appealing to

consumers, yet can reduce their subsequent motivation to purchase the product.

Iyengar, Sheena S., & Lepper, Mark (2000). When choice is demotivating: Can one desire too much of a good thing? Journal of Personality

and Social Psychology, 76, 995-1006.

Programmers dont need a glut of options just give us something that works OK on every platform we care about. Give us a decent standard and well do the rest

My optimistic view from 2005

Weve learned our

lesson we emphasize

a small number of

industry standards

But we didnt learn our lessonHistory is repeating itself!


A small sampling of models from the NEW golden age of parallel programming (2010)

Cilk++

CnC

Ct

MYO

RCCE

OpenCL

TBB

Chapel

Charm++

ConcRT

CUDA

Erlang

F#

Fortress

Go

Hadoop

mpc

UPC

PPL

X10

PLINQ

Weve lost our way and have slipped back into the just

create a new language mentality.

27

If language obsession is not the solution, what is?

Consider an early (and successful) adopter of many core technologies The gaming industry:

Game development gross generalizations:

Time and money: 1-3 years for 1-20 million dollars.

A blockbuster game has revenues that rival that from a major Hollywood movie.

Major games take teams with 50 to 100 people.

Only a quarter of the people are programmers.

28

Game Development

The key: Enforce a separation of concerns: A small number (

Whats happening at Microsoft?

Source: An Insiders View to Concurrency at Microsoft, Steven Toub, Univ. of Utah, Sept. 2010

A great

infrastructure for

building

composable

software

frameworks!

30

Modular software, frameworks and parallel computing

As parallelism goes mainstream and reaches extreme levels of scalability We cant retrain all the worlds programmers to handle disruptive scalable hardware.

Modular software development techniques and Frameworks will save the day: Framework: A software platform providing partial solutions to a class of

problems that can be specialized to solve a specific problem.

This is not a new idea: Cactus

Common Component Architecture

SIERRA

and many others

we just need to push it further and deeper.

We need a systematic way to build a useful collection of

frameworks for Scalable computing3rd party names are the property of their owners.

31

Systematic framework design with patterns

A pattern is a well known solution to a recurring problem captures expert knowledge in a form that can be peer-reviewed, refined, and passed-on to others.

An architecture defines the organization and structure of a software system it can be defined by a hierarchical composition of patterns.

A framework is a software environment based on an architecture a partial solution for a domain of problems that can be specialized to solve specific problems.

31

Patterns Architectures Frameworks





Example


33

A Pattern Language for Parallel

Programming (PLPP)

A pattern language for parallel

algorithm design with

examples in MPI, OpenMP

and Java.

This is our hypothesis for how

programmers think about

parallel programming.

I urge you all to tear this apart,

correct the errors, add

patterns where they are

missing, and help make this

better.

34

PPPLs structure:Four design spaces in parallel software development

Original Problem Tasks, shared and local data

Decomposition

Implementation. &

building blocks

Corresponding source code

Program SPMD_Emb_Par ()

{

TYPE *tmp, *func();

global_array Data(TYPE);

global_array Res(TYPE);

int N = get_num_procs();

int id = get_proc_id();

if (id==0) setup_problem(N,DATA);

for (int I= 0; I

35

Start with a specification that solves the original problem -- finish with

the problem decomposed into tasks, shared data, and a partial ordering.

Design Evaluation

Start

DependencyAnalysis

DecompositionAnalysis

DataDecomposition

TaskDecomposition

OrderGroups

GroupTasks

DataSharing

Decomposition(Finding Concurrency)

36

Algorithm Strategy(Algorithm Structure)

Start

Organize By Data

Geometric

Decomposition

Linear? Recursive?

Task

Parallelism

Divide and

Conquer

Recursive

Data

Linear? Recursive?

Organize By Flow of Data

Regular? Irregular?

Pipeline Event Based

Coordination

Design Pattern

Decision

Decision PointKey

Organize By Tasks

37

Implementation strategy(Supporting Structures)

High level constructs impacting large scale organization of the source

code.

Program Structure

Master/Worker

SPMD

Loop Parallelism

Fork/Join

Data Structures

Shared Data

Shared Queue

Distributed Array

38

Parallel Programming building blocks

Low level constructs implementing specific constructs used in

parallel computing. Examples in Java, OpenMP and MPI.

These are not properly design patterns, but they are included to

make the pattern language self-contained.

Process control

Thread control

UE* ManagementSynchronization

Memory sync/fences

barriers

Mutual Exclusion

Collective Comm

Message Passing

Other Comm

Communications

39

A simple Example: The PI programNumerical Integration

4.0

(1+x2)dx =

0

1

F(xi)x i = 0

N

Mathematically, we know that:

We can approximate the

integral as a sum of

rectangles:

Where each rectangle has

width x and height F(xi) at

the middle of interval i.

4.0

2.0

1.0

X0.0

40

PI Program: The sequential program

static long num_steps = 100000;

double step;

void main ()

{ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=1;i

41

OpenMP PI Program: Loop level parallelism pattern

#include

static long num_steps = 100000; double step;

#define NUM_THREADS 2

void main ()

{ int i; double x, pi, sum =0.0;


omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for private(x) reduction (+:sum)

for (i=0;i< num_steps; i++){

x = (i+0.5)*step;

sum += 4.0/(1.0+x*x);

}

pi = sum[i] * step;

}

Loop Level

Parallelism:

Parallelism

expressed

solely by (1)

exposing

concurrency,

(2) managing

dependencies,

and (3) splitting

up loops .

42

MPI Pi program: SPMD pattern

SPMD

Programs:

Each thread

runs the same

code with the

thread ID

selecting

thread-specific

behavior.

#include void main (int argc, char *argv[]){

int i, id, numprocs; double x, pi, step, sum = 0.0, step1, stepN ;step = 1.0/(double) num_steps ;MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;step1 = id *num_steps/numprocs ;stepN = (id+1)*num_steps/numprocs;if (stepN > num_steps) stepN = num_steps;for (i=my_id*my_steps; i

43

#include

#define NUM_THREADS 2

HANDLE thread_handles[NUM_THREADS];

CRITICAL_SECTION hUpdateMutex;

static long num_steps = 100000;

double step;

double global_sum = 0.0;

void Pi (void *arg)

{

int i, start;

double x, sum = 0.0;

start = *(int *) arg;


for (i=start;i

44 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 44

PI Program: Cilk (divide and conquer implemented with fork-join)

static long num_steps = 1073741824; // Im lazy make it a power of 2

double step = 1.0/(double) num_steps;

cilk double pi_comp(int istep, int nstep){

double x, sum;

if(nstep < MIN_SIZE)

for (int i=istep, sum=0.0; i

PLPP limitations

We tried to make PLPP for general purpose

programming, but it reflects the scientific

computing background of its authors.

Focuses on PDEs, solvers, Nbody, and other

scientific problems.

monolithic SW architecture (i.e. no idea of

architecture just build an algorithm and

make it fast)

To really solve the parallel programming problem, we need to address

the full breadth of the software engineering problem

PLPP is a pattern language for implementing parallel algorithms.

We need a pattern language for engineering parallel applications





Example


4713 dwarves

Working with Prof. Kurt Keutzer and his group at UC Berkeley, weve come up with a new and more expansive pattern language

PLPP: Pattern

language of

Parallel

Programming

48

Graph-Algorithms

Dynamic-Programming

Dense-Linear-Algebra

Sparse-Linear-Algebra

Unstructured-Grids

Structured-Grids

Model-View-Controller

Iterative-Refinement

Map-Reduce

Layered-Systems

Arbitrary-Static-Task-Graph

Pipe-and-Filter

Agent-and-Repository

Process-Control

Event-Based/Implicit-

Invocation

Puppeteer

Graphical-Models

Finite-State-Machines

Backtrack-Branch-and-

Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications

Structural Patterns Computational Patterns

Task-Parallelism

Divide and ConquerData-Parallelism

Pipeline

Discrete-Event

Geometric-Decomposition

Speculation

SPMD

Data-Par/index-

space

Fork/Join

Actors

Distributed-Array

Shared-Data

Shared-Queue

Shared-map

Partitioned Graph

MIMD

SIMD

Parallel Execution Patterns

Concurrent Algorithm Strategy Patterns

Implementation Strategy Patterns

Message-Passing

Collective-Comm.

Transactional memory

Thread-Pool

Task-Graph

Data structureProgram structure

Point-To-Point-Sync. (mutual exclusion)

collective sync. (barrier)

Memory sync/fence

Loop-Par.

Task-Queue

Transactions

Thread creation/destruction

Process creation/destruction

Concurrency Foundation constructs (not expressed as patterns)

OPL/PLPP 2010

49

Graph-Algorithms

Dynamic-Programming



Unstructured-Grids

Structured-Grids



Map-Reduce

Layered-Systems


Pipe-and-Filter


Process-Control


Invocation

Puppeteer

Graphical-Models



Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications


Task-Parallelism


Pipeline

Discrete-Event


Speculation

SPMD

Data-Par/index-

space

Fork/Join

Actors

Distributed-Array

Shared-Data

Shared-Queue

Shared-map

Partitioned Graph

MIMD

SIMD




Message-Passing

Collective-Comm.


Thread-Pool

Task-Graph




Memory sync/fence

Loop-Par.

Task-Queue

Transactions




OPL/PLPP 2010

http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

Berkeley View

13 dwarfs

Garlan and Shaw

Architectural Styles

50

Graph-Algorithms

Dynamic-Programming



Unstructured-Grids

Structured-Grids



Map-Reduce

Layered-Systems


Pipe-and-Filter


Process-Control


Invocation

Puppeteer

Graphical-Models



Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications


Task-Parallelism


Pipeline

Discrete-Event


Speculation

SPMD

Data-Par/index-

space

Fork/Join

Actors

Distributed-Array

Shared-Data

Shared-Queue

Shared-map

Partitioned Graph

MIMD

SIMD




Message-Passing

Collective-Comm.


Thread-Pool

Task-Graph




Memory sync/fence

Loop-Par.

Task-Queue

Transactions




OPL/PLPP 2010


Berkeley View

13 dwarfs

Garlan and Shaw


51

Pipe-and-Filter


Event-based coordination

Iterative refinement

MapReduce

Process Control

Layered Systems

Identify the SW Structure

Structural Patterns

These define the structure of our software but they do not describe what is computed

52

Elements of a structural pattern

Components are where the computation happens

Connectors are where the communication happens

A configuration is a

graph of

components

(vertices) and

connectors (edges)

A structural

patterns may be

described as a

familiy of graphs.

53

Filter 6

Filter 5

Filter 4

Filter 2

Filter 7

Filter 3

Filter 1

Pattern 1: Pipe and Filter

Filters embody computationOnly see inputs and produce outputs

Pipes embody communication

May have feedback

54

Examples of pipe and filter Almost every large software program has a pipe and filter structure at

the highest level

Logic optimizerImage Retrieval System

Compiler

55

Pattern 2: Iterative Refinement Pattern

itera

te

Exit condition met?

Initialization condition

Synchronize results of iteration

Variety of functions performed asynchronously

Yes

No

5656

Example of Iterative Refinement Pattern:

Training a Classifier: SVM Training

56

Updatesurface

IdentifyOutlier

itera

te

Iterative Refinement Structural Pattern

All points withinacceptable error? Yes

No

57

Pattern 3: MapReduce To us, it means

A map stage, where data is mapped onto independent

computations

A reduce stage, where the results of the map stage are

summarized (i.e. reduced)

Map

Reduce

Map

Reduce

58

Examples of Map Reduce

General structure:

Map a computation across distributed data sets

Reduce the results to find the best/(worst),

maxima/(minima)

Speech recognition

Map HMM computation

to evaluate word match

Reduce to find the most-

likely word sequences

Support-vector machines (ML)

Map to evaluate distance from

the frontier

Reduce to find the greatest

outlier from the frontier

59

Graph-Algorithms

Dynamic-Programming



Unstructured-Grids

Structured-Grids



Map-Reduce

Layered-Systems


Pipe-and-Filter


Process-Control


Invocation

Puppeteer

Graphical-Models



Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications


Task-Parallelism


Pipeline

Discrete-Event


Speculation

SPMD

Data-Par/index-

space

Fork/Join

Actors

Distributed-Array

Shared-Data

Shared-Queue

Shared-map

Partitioned Graph

MIMD

SIMD




Message-Passing

Collective-Comm.


Thread-Pool

Task-Graph




Memory sync/fence

Loop-Par.

Task-Queue

Transactions




OPL/PLPP 2010


Berkeley View

13 dwarfs

Garlan and Shaw


60

Identify Key Computations

Computational patterns describe the key computations but not how they

are implemented

Computational Patterns

61

Computational pattern Example:

N-body problem Consider a collection of particles that interact through a pair-wise force

N-body problems show up in a wide range or applications

Astrophysics and Celestial Mechanics

Plasma Simulation

Molecular Dynamics

Electron-Beam Lithography Device Simulation

Fluid Dynamics (vortex method)

Game physics (cloth simulation, blood spatter, etc)

Graph partitioning

Some Elliptic PDE solvers

62

A class of techniques to solve certain partial differential equations such

as Poissons equation:

),(),()(2

2

2

2

yxgyxfyx

The idea is to:

Apply a discrete transform to the PDE turning differential operators into algebraic operators.

Solve the resulting system of algebraic or ordinary differential equations.

Inverse transform the solution to return to the original domain.

Computational Pattern Example

Spectral Methods

The problem is cast as a network of random variables, where edges represent (potential) dependences

Generally, we have observed variables (shaded) and hidden variables

We need to reason about the hidden variables given our set of observations

Can be directed or undirected

Hidden Markov Model

Computational Pattern Example

Graphical models

64

Pipe-and-Filter


Event-based

Bulk Synchronous

MapReduce

Layered Systems

Arbitrary Task Graphs

Decompose Tasks/Data

Order tasks Identify Data Sharing and Access

Graph Algorithms

Dynamic programming

Dense/Spare Linear Algebra

(Un)Structured Grids

Graphical Models

Finite State Machines

Backtrack Branch-and-Bound

N-Body Methods

Circuits

Spectral Methods

Architecting Parallel Software

Identify the Software Structure

Identify the Key Computations

An architecture is a

composition of design

patterns.





Example


Inference Engine

Beam Search Iterations

LVCSR Software ArchitecturePipe-and-filter

Graphical Model

Dynamic Programming

Iterative Refinement

Task Graph

Speech Feature

Extractor

Voice Input

SpeechFeatures

Recognition Network

Acoustic Model

Pronunciation Model

Language Model

MapReduce

WordSequenceI think

therefore

I am

Active State Computation Steps

Patterns/framework example

LVCSR = Large vocabulary continuous speech recognition.

67/24

A Speech Framework with Extension Points

Read Files

Initialize data structures

CPU

GPU

Backtrack

Output Results

Phase 0

Phase 1

Compute Observation Probability

Phase 2

For each active arc: Compute arc transition

probability

Copy results back to CPU

Collect Backtrack Info

Prepare ActiveSet

Iteration Control

File Input Format

Pruning Strategy

Observation Probability Computation

Backtrack

Result output format

StaticData StructureOptimizations

DynamicData StructureOptimizations

A WFST Based Inference Engine Framework

68/24

A Speech Framework with Extension Points

Read Files

Initialize data structures

CPU

GPU

Backtrack

Output Results

Phase 0

Phase 1

Compute Observation Probability

Phase 2

For each active arc: Compute arc transition

probability

Copy results back to CPU

Collect Backtrack Info

Prepare ActiveSet

Iteration Control

A WFST Based Inference Engine FrameworkHTK Format

SRI Format

Fixed Beam Width

Adaptive Beam Width

HMM SRI GPU ObsProb

HMM WSJ GPU ObsProb

CHMM SRI GPU ObsProb

Backtrack

Backtrack + conf. metric

HResult format

SRI Scoring format

Two level WFST Network

One level WFST Network

Preload All

Selective Preload

The Programmers View

Jike Chong, Ekaterina Gonina, Kisun You,

Kurt Keutzer, Exploring Recognition

Network Representations for Efficient Speech

Inference on Highly Parallel Platforms,

Submitted to Interspeech 2010.

Dorothea Kolossa, Jike Chong, Steffen Zeiler,

Kurt Keutzer, Efficient Manycore CHMM

Speech Recognition for Audiovisual and

Multistream Data, Submitted to Interspeech

2010.

Patterns/framework example

70

Are Frameworks enough?

Efficiency to support the extremely low overheads required by Amdahl's law

collapse layers of abstractions. Dynamic optimization.

A stable, richly supported, and highly portable programming environment to make framework development practical

A hardware abstraction layer for heterogeneous platforms ... Creates economic justification for the effort since one frameworks supports many platforms

Analyze key app domain

Discover common paths through our pattern language.

These paths define the framework design

Implement framework

How to program the heterogenous platform?Let History can be our guide consider the origins of OpenMP

SGI

Cray

Merged,

needed

commonality

across

products

KAI ISV - needed larger market

was tired of

recoding for

SMPs. Forced

vendors to

standardize.

ASCI

Wrote a

rough draft

straw man

SMP API

DEC

IBM

Intel

HP

Other vendors

invited to join

1997Third party names are the property of their owners.

OpenCL: Can history repeat itself?

AMD

ATI

Merged,

needed

commonality

across

products

NvidiaGPU vendor -

wants to steel mkt

share from CPU

IntelCPU vendor -

wants to steel mkt

share from GPU

Wrote a

rough draft

straw man

API

was tired of recoding

for many core, GPUs.

Pushed vendors to

standardize.

Apple

Ericsson

Sony

Blizzard

Nokia

Khronos

Compute

group formed

Freescale

TI

IBM

+ many

more

As ASCI did for OpenMP, Apple is doing for

GPU/CPU with OpenCL

Dec 2008Third party names are the property of their owners.

OpenCL Working GroupDiverse industry participation

HW vendors (e.g. Apple), system OEMs, middleware vendors, application developers.

OpenCL became an important standard on release by virtue of the market coverage of the companies behind it.

http://www.codeplay.com/http://www.amd.com/http://www.umu.se/umu/index_eng.htmlhttp://www.gshark.com/http://www.st.com/

The BIG idea behind OpenCLOpenCL execution model execute a kernel at each point in a problem domain.

E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

void

trad_mul(int n,

const float *a,

const float *b,

float *c)

{

int i;

for (i=0; i

An N-dimension domain of work-itemsDefine an N-dimensioned index space that is best for your algorithm

Global Dimensions: 1024 x 1024 (whole problem space)

Local Dimensions: 128 x 128 (work group executes together)

1024

1024

Synchronization between work-items

possible only within workgroups:

barriers and memory fences

Cannot synchronize outside

of a workgroup

To use OpenCL, you must

Define the platform

Execute code on the platform

Move data around in memory

Write (and build) programs

OpenCL Platform Model

One Host + one or more Compute Devices

Each Compute Device is composed of one or more Compute Units

Each Compute Unit is further divided into one or more Processing Elements

OpenCL Execution ModelAn OpenCL application runs on a host which submits work to the compute devices.

Work item: the basic unit of work on an OpenCL device.

Kernel: the code for a work item. Basically a C function

Program: Collection of kernels and other functions (Analogous to a dynamic library)

Context: The environment within which work-items executes includes devices and their memories and command queues.

Queue Queue

Contex

t

GPU CPU

Applications queue kernel execution instances

Queued in-order one queue to a device

Executed in-order or out-of-order

OpenCL Memory Model

Memory management is Explicit

You must move data from host -> global -> local and back

Private Memory

Per work-item

Local Memory

Shared within a workgroup

Global/Constant Memory

Visible to all workgroups

Host Memory

On the CPU

Workgroup

Work-Item

Compute Device

Work-Item

Workgroup

Host

Private Memory

Private Memory

Local MemoryLocal Memory

Global/Constant Memory

Host Memory

Work-ItemWork-Item

Private Memory

Private Memory

Programming kernels: the OpenCL C Language

A subset of ISO C99

But without some C99 features such as standard C99 headers, function pointers, recursion, variable length arrays, and bit fields

A superset of ISO C99 with additions for:

Work-items and workgroups

Vector types

Synchronization

Address space qualifiers

Also includes a large set of built-in functions for image manipulation, work-item manipulation, specialized math routines, etc.

Programming Kernels: Data TypesScalar data types

char , uchar, short, ushort, int, uint, long, ulong, float

bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)

Image types

image2d_t, image3d_t, sampler_t

Vector data types

Vector lengths 2, 4, 8, & 16 (char2, ushort4, int8, float16, double2, )

Endian safe

Aligned at vector length

Vector operations and built-in functions

2 3 -7 -7

-7 -7 -7 -7int4 vi0 = (int4) -7;

0 1 2 3int4 vi1 = (int4)(0, 1, 2, 3);

vi0.lo = vi1.hi;

int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); 2 3 -7 -7 0 1 1 3

Double is an optional

type in OpenCL 1.0

Building Program objects

The program object encapsulates: A context

The program source/binary

List of target devices and build options

The Build process to create a program object clCreateProgramWithSource()

clCreateProgramWithBinary()

Programkernel void

horizontal_reflect(read_only image2d_t src,

write_only image2d_t dst)

{

int x = get_global_id(0); // x-coord

int y = get_global_id(1); // y-coord

int width = get_image_width(src);

float4 src_val = read_imagef(src, sampler,

(int2)(width-1-x, y));

write_imagef(dst, (int2)(x, y), src_val);

}

Compile for

GPU

Compile for

CPU

GPU

code

CPU

code

Kernel Code

Vector Addition - Kernel

__kernel void vec_add (__global const float *a,

__global const float *b,

__global float *c)

{

int gid = get_global_id(0);

c[gid] = a[gid] + b[gid];

}

Vector Addition: Host Program

// create the OpenCL context on a GPU device

cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

// get the list of GPU devices associated with context

clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

NULL, &cb);

devices = malloc(cb);

clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue

cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects

memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA,

NULL);}

memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB,

NULL);

memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

sizeof(cl_float)*n, NULL,

NULL);

// create the program

program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

// build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel

kernel = clCreateKernel(program, vec_add, NULL);

// set the args values

err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

sizeof(cl_mem));

err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],

sizeof(cl_mem));


sizeof(cl_mem));

// set work-item dimensions

global_work_size[0] = n;

// execute kernel

err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

// read output array

err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Vector Addition: Host Program

// create the OpenCL context on a GPU device

cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

// get the list of GPU devices associated with context

clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

NULL, &cb);

devices = malloc(cb);

clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue

cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects

memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);}

memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL);

memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

sizeof(cl_float)*n, NULL, NULL);

// create the program

program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

// build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel

kernel = clCreateKernel(program, vec_add, NULL);

// set the args values

err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

sizeof(cl_mem));


sizeof(cl_mem));


sizeof(cl_mem));

// set work-item dimensions

global_work_size[0] = n;

// execute kernel

err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

// read output array

err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Define platform and queues

Define Memory objects

Create the program

Build the program

Create and setup kernel

Execute the kernel

Read results on the host

Its complicated, but most of this is boilerplate and not as bad as it looks.

GPU

CPU

En

qu

eu

e K

ern

el 1

Kernel 1

En

qu

eu

e K

ern

el 2

Time

GPU

CPU

En

qu

eu

e K

ern

el 1

Kernel 1

En

qu

eu

e K

ern

el 2

Kernel 2

Time

Kernel 2

Kernel 2 waits for an event from Kernel 1 and does not

start until the results are ready

Kernel 2 starts before the results from Kernel 1 are

ready

Events can be used to synchronize kernel executions between queues

Example: 2 queues with 2 devices

OpenCL Synchronization: Queues & Events

arg [0]

value

arg [1]

value

arg [2]

value

arg [0]

value

arg [1]

value

arg [2]

value

In

Order

Queue

Out of

Order

Queue

GPU

Context

__kernel void

dp_mul(global const float *a,

global const float *b,

global float *c)

{

int id = get_global_id(0);

c[id] = a[id] * b[id];

}

dp_mul

CPU program binary

dp_mul

GPU program binary

Programs Kernels

arg[0] value

arg[1] value

arg[2] value

Images Buffers

In

Order

Queue

Out of

Order

Queue

Compute Device

GPUCPU

dp_mul

Programs Kernels Memory Objects Command Queues

OpenCL summary


89

ConclusionsMany core processors mean all software must be parallel.

We cant turn everyone into parallel algorithm experts we have to support a separation of concerns:

Hardcore experts build programming frameworks emphasize efficiency using industry standard languages.

Domain expert programmers assembly applications within a framework emphasize productivity.

Design Patterns are a tool to help us get this right.

We believe software architectures can be built up from a manageable number of design patterns. These patterns define the building blocks of all software engineering and are fundamental to the practice of architecting parallel software. Hence, an effort to propose, argue about, and finally agree on what constitutes this set of patterns is the seminal intellectual challenge of our field.

K. Keutzer and T. Mattson, A design pattern language for engineering (parallel) software, Intel Tech. Journal, Vol. 13, # 4, 2010.

90

Patterns Acknowledgements Kurt Keutzer (UCB), Ralph Johnson (UIUC) and our

community of patterns writers: Hugo Andrade, Chris Batten, Eric Battenberg, Hovig Bayandorian, Dai Bui,

Bryan Catanzaro, Jike Chong, Enylton Coelho, Katya Gonina, Yunsup Lee, Mark Murphy, Heidi Pan, Kaushik Ravindran, Sayak Ray, Erich Strohmaier, Bor-yiing Su, Narayanan Sundaram, Guogiang Wang, Youngmin Yi., Jeff Anderson-Lee, Joel Jones, Terry Ligocki, and Sam Williams.

The development of our pattern language has also received a boost from Par Lab faculty particularly: Krste Asanovic, Jim Demmel, and David Patterson.

My co-authors, colleagues and friends who helped write the PLPP patterns book

Beverly Sanders (University of Florida)

Berna Massingill (Trinity University)

par prog course many core sw pats ocl

Documents