par prog course many core sw pats ocl
DESCRIPTION
generalTRANSCRIPT
-
11
Many core processors: Opportunities and challenges
(or an old supercomputing hacker rambles on about parallel computing)
Tim Mattson
Intel Labs
-
Disclosure
The views expressed in this talk are those of the speaker and not his employer.
I am in a research group and know nothing about Intel products. So anything I say about them is highly suspect I might even lie!
This was a team effort, but if I say anything really stupid, its all my fault dont blame my collaborators.
-
Sources for the slides
Slides from my work at Intel.
Slides I created with UC Berkeley ParLab colleagues. Most of these come from courses I taught with Prof. Kurt Keutzer.
Slides I developed with members of the Khronos compute group (OpenCL).
-
Agenda Our hardware future
Rising to the many core software challenge
OpenCL: A common HW abstraction layer
-
Parallel hardware trendsTop 500: total number of processors (1993-2000)
0
20000
40000
60000
80000
100000
120000
140000
1993 1994 1995 1996 1997 1998 1999 2000
The early years SIMD MPP fades and clusters
take over.Source: the June lists from www.top500.org
-
Parallel Hardware TrendsTop 500: total number of processors (1993-2009)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
Disruptive
technology
trend!
Source: the June lists from www.top500.org
-
7
The era of many-core processors has arrived
IBM Cell
NVIDIA Tesla C1060Intel SCC research
chip
ATI RV770
3rd party names are the property of their owners. Source: SC09 OpenCL tutorial
80 cores
240 cores
1 CPU + 6 cores
160 cores
48 cores
-
8
Why many core? Its all about Power.Power vs Performance (normalized to i486 process tech.)
0
5
10
15
20
25
30
0 2 4 6 8Scalar Performance
Po
we
r power = perf ^ 1.74
Pentium M
i486 Pentium
Pentium Pro
Pentium 4 (Wmt)
Pentium 4 (Psc)
Growth in power
is unsustainable
Source: Ed Grochowski, Intel
-
9
Design with Power in mind
0
5
10
15
20
25
30
0 2 4 6 8Scalar Performance
Po
we
r power = perf ^ 1.74
Pentium M
i486 Pentium
Pentium Pro
Pentium 4 (Wmt)
Pentium 4 (Psc)
Mobile CPUs
with shallow
pipelines use
less power
31 Pipeline stages
-
10
Using Multiple cores to reduce power
Processor
f
Processor
f/2
Processor
f/2
f
Input Output
Input
Output
Capacitance = C
Voltage = V
Frequency = f
Power = CV2f Capacitance = 2.2C
Voltage = 0.6V
Frequency = 0.5f
Power = 0.396CV2f
Chandrakasan, A.P.; Potkonjak, M.; Mehra, R.; Rabaey, J.; Brodersen, R.W., "Optimizing power using transformations," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,, vol.14, no.1, pp.12-31, Jan 1995
Source: K. Keutzer of UCB
-
1111
Specialized Si is more power efficient
Peak single precisions GFLOPS at the peak thermal design point. All processors manufactured with a 65 nm process technology.
0
2
4
6
8
10
12
14
16
GFLO
P/W
att
NVIDIA GTX 280
Intel 80 core terascale processor
Intel CoreTM 2 Quad processor (Q6700)
236 W
97 W
95 W
Sources: vendor published data sheets and the 80 core paper form SC08 by Mattson, van der Wijngaart, and Frumkin.
-
1212
The future of HW
GMCHGPU
ICH
CPUCPU
DRAM
GMCH = graphics memory control hub, ICH = Input/output control hub SOC = system on a chip
Moores law marches on more and more transistors on a single die (Intels 22 nm Fab at a cost of 6 to 8 Billion dollars will open in 2013)
Heterogeneous, many core SOC will be the standard building block of computing
How will we use those transistors:
Many heterogeneous cores.
GPU, CPU, I/O hubs, etc all on one SOC die.
-
1313
How will the cores be connected?
An Intel Execs slide from IDF2006
-
1414
Challenging the sacred cows
SharedCache
LocalCache
StreamlinedIA Core
Assumes cache coherent shared address space!
Is that the right choice? Most expert programmers do not
fully understand relaxed consistency memory models required to make cache coherent architectures work.
The only programming models proven to scale non-trivial apps to 100s to 1000s of cores all based on distributed memory.
Coherence incurs additional architectural overhead
IA cores optimized for multithreading
-
1515
The Coherency Wall As you scale the number of cores on a cache coherent system (CC), cost in
time and memory grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall.
* Rakesh Kumar, Timothy G. Mattson, Gilles Pokam, and Rob van der Wijngaart, The case for Message Passing on Many-core Chips, University of Illinois Champaign-Urbana Technical Report, UILU-ENG-10-2203 (CRHC 10-01), 2010.
each directory entry will be 128 bytes long for a 1024 core processor supporting fully-mapped directory-based cache coherence. This may often be larger than the size of the cacheline that a directory entry is expected to track.*
CC: O(N) 1 2
2D Mesh: O(4) to O(N )
Cost
(tim
e a
nd/o
r m
em
ory
)
Number of cores
For a scalable, directory based scheme, CC incurs an N-body effect cost scales at best linearly (Fixed memory size as cores are added) and at worst quadratically (memory grows
linearly with number of cores).
HW Dist. Mem. HW cost scales at best a fixed cost for the local neighborhood and at worst as the diameter of
the network.
Assume an app whose performance is not bound by memory bandwidth
-
1616
Isnt shared memory programming easier? Not necessarily.
Time
Effo
rt
Extra work upfront, but easier optimization and debugging means
overall, less time to solutionMessage passing
Time
Effo
rt
initial parallelization can be quite easy
Multi-threading
But difficult debugging and optimization means overall
project takes longer
*P. N. Klein, H. Lu, and R. H. B. Netzer, Detecting Race Conditions in Parallel Programs that Use Semaphores, Algorithmica, vol. 35 pp. 321345, 2003
Proving that a shared address space program using semaphores is race free is an NP-complete problem*
-
The many-core challenge
Result: a fundamental and dangerous mismatch
Parallel hardware is ubiquitous.
Parallel software is rare
Our challenge make parallel software as routine as our parallel hardware.
We have arrived at many-core solutions notbecause of the success of our parallel
software but because of our failure to keep increasing CPU frequency*.
*Tim Mattson a famous-software-researcher wannabe
-
Agenda Our hardware future
Rising to the many core software challenge
OpenCL: A common HW abstraction layer
-
Agenda Our hardware future
Rising to the many core software challenge
PLPP the original parallel pattern language
OPL patterns for engineering parallel software
Example
OpenCL: A common HW abstraction layer
-
20
How about automatic parallelization?
0
5
10
15
20
25
30
bzip2
craf
tyga
pgc
cgz
ipm
cf
pars
ertw
olf
vorte
xvp
r
aver
age
Sp
ee
du
p %
Basic speculative multithreading
Software value predictionEnabling optimizations
A Cost-Driven Compilation Framework for Speculative Parallelization of Sequential Programs,
Zhao-Hui Du, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, Tin-Fook Ngai (Intel
Corporation) in PLDI 2004
Aggressive techniques such as speculative multithreading help, but they are not enough.
Ave SPECintspeedup of 8% will climb to ave. of 15% once their system is fully enabled.
There are no indications auto par. will radically improve any time soon.
Hence, I do not belive Auto-par will solve our problems.
Results for a simulated dual core platform configured as a main core and a core for speculative execution.
-
21
Parallel Software
Our only hope is to get programmers to write parallel software by hand.
But after 25+ years of research, we are no closer to solving the parallel programming problem
Only a tiny fraction of programmers write parallel code.
Will the if you build it they will come principle apply?
Many hope so, but ..
that implies that people didnt really try hard enough over the last 25 years. Does that really make sense?
-
Solution: Find A Good parallel programming model, right?
ABCPL
ACE
ACT++
Active messages
Adl
Adsmith
ADDAP
AFAPI
ALWAN
AM
AMDC
AppLeS
Amoeba
ARTS
Athapascan-0b
Aurora
Automap
bb_threads
Blaze
BSP
BlockComm
C*.
"C* in C
C**
CarlOS
Cashmere
C4
CC++
Chu
Charlotte
Charm
Charm++
Cid
Cilk
CM-Fortran
Converse
Code
COOL
CORRELATE
CPS
CRL
CSP
Cthreads
CUMULVS
DAGGER
DAPPLE
Data Parallel C
DC++
DCE++
DDD
DICE.
DIPC
DOLIB
DOME
DOSMOS.
DRL
DSM-Threads
Ease .
ECO
Eiffel
Eilean
Emerald
EPL
Excalibur
Express
Falcon
Filaments
FM
FLASH
The FORCE
Fork
Fortran-M
FX
GA
GAMMA
Glenda
GLU
GUARD
HAsL.
Haskell
HPC++
JAVAR.
HORUS
HPC
IMPACT
ISIS.
JAVAR
JADE
Java RMI
javaPG
JavaSpace
JIDL
Joyce
Khoros
Karma
KOAN/Fortran-S
LAM
Lilac
Linda
JADA
WWWinda
ISETL-Linda
ParLin
Eilean
P4-Linda
POSYBL
Objective-Linda
LiPS
Locust
Lparx
Lucid
Maisie
Manifold
Mentat
Legion
Meta Chaos
Midway
Millipede
CparPar
Mirage
MpC
MOSIX
Modula-P
Modula-2*
Multipol
MPI
MPC++
Munin
Nano-Threads
NESL
NetClasses++
Nexus
Nimrod
NOW
Objective Linda
Occam
Omega
OpenMP
Orca
OOF90
P++
P3L
Pablo
PADE
PADRE
Panda
Papers
AFAPI.
Para++
Paradigm
Parafrase2
Paralation
Parallel-C++
Parallaxis
ParC
ParLib++
ParLin
Parmacs
Parti
pC
PCN
PCP:
PH
PEACE
PCU
PET
PENNY
Phosphorus
POET.
Polaris
POOMA
POOL-T
PRESTO
P-RIO
Prospero
Proteus
QPC++
PVM
PSI
PSDM
Quake
Quark
Quick Threads
Sage++
SCANDAL
SAM
pC++
SCHEDULE
SciTL
SDDA.
SHMEM
SIMPLE
Sina
SISAL.
distributed smalltalk
SMI.
SONiC
Split-C.
SR
Sthreads
Strand.
SUIF.
Synergy
Telegrphos
SuperPascal
TCGMSG.
Threads.h++.
TreadMarks
TRAPPER
uC++
UNITY
UC
V
ViC*
Visifold V-NUS
VPE
Win32 threads
WinPar
XENOOPS
XPC
Zounds
ZPL
Third party names are the property of their owners.
Models from the golden age of parallel programming (~95)
-
The only thing sillier than creating too many models is using too many
ABCPL
ACE
ACT++
Active messages
Adl
Adsmith
ADDAP
AFAPI
ALWAN
AM
AMDC
AppLeS
Amoeba
ARTS
Athapascan-0b
Aurora
Automap
bb_threads
Blaze
BSP
BlockComm
C*.
"C* in C
C**
CarlOS
Cashmere
C4
CC++
Chu
Charlotte
Charm
Charm++
Cid
Cilk
CM-Fortran
Converse
Code
COOL
CORRELATE
CPS
CRL
CSP
Cthreads
CUMULVS
DAGGER
DAPPLE
Data Parallel C
DC++
DCE++
DDD
DICE.
DIPC
DOLIB
DOME
DOSMOS.
DRL
DSM-Threads
Ease .
ECO
Eiffel
Eilean
Emerald
EPL
Excalibur
Express
Falcon
Filaments
FM
FLASH
The FORCE
Fork
Fortran-M
FX
GA
GAMMA
Glenda
GLU
GUARD
HAsL.
Haskell
HPC++
JAVAR.
HORUS
HPC
IMPACT
ISIS.
JAVAR
JADE
Java RMI
javaPG
JavaSpace
JIDL
Joyce
Khoros
Karma
KOAN/Fortran-S
LAM
Lilac
Linda
JADA
WWWinda
ISETL-Linda
ParLin
Eilean
P4-Linda
POSYBL
Objective-Linda
LiPS
Locust
Lparx
Lucid
Maisie
Manifold
Mentat
Legion
Meta Chaos
Midway
Millipede
CparPar
Mirage
MpC
MOSIX
Modula-P
Modula-2*
Multipol
MPI
MPC++
Munin
Nano-Threads
NESL
NetClasses++
Nexus
Nimrod
NOW
Objective Linda
Occam
Omega
OpenMP
Orca
OOF90
P++
P3L
Pablo
PADE
PADRE
Panda
Papers
AFAPI.
Para++
Paradigm
Parafrase2
Paralation
Parallel-C++
Parallaxis
ParC
ParLib++
ParLin
Parmacs
Parti
pC
PCN
PCP:
PH
PEACE
PCU
PET
PENNY
Phosphorus
POET.
Polaris
POOMA
POOL-T
PRESTO
P-RIO
Prospero
Proteus
QPC++
PVM
PSI
PSDM
Quake
Quark
Quick Threads
Sage++
SCANDAL
SAM
pC++
SCHEDULE
SciTL
SDDA.
SHMEM
SIMPLE
Sina
SISAL.
distributed smalltalk
SMI.
SONiC
Split-C.
SR
Sthreads
Strand.
SUIF.
Synergy
Telegrphos
SuperPascal
TCGMSG.
Threads.h++.
TreadMarks
TRAPPER
uC++
UNITY
UC
V
ViC*
Visifold V-NUS
VPE
Win32 threads
WinPar
XENOOPS
XPC
Zounds
ZPL
Third party names are the property of their owners.
Programming models Ive worked with.
-
Perc
en
tag
e
60
try
40
try
24 6
Choice overload:Too many options can hurt you
The Draeger Grocery Store experiment consumer choice :
Two Jam-displays with coupons for purchase discount.
24 different Jams
6 different Jams
How many stopped by to try samples at the display?
Of those who tried, how many bought jam?
3
bu
y
30
bu
y
The findings from this study show that an extensive array of options can at first seem highly appealing to
consumers, yet can reduce their subsequent motivation to purchase the product.
Iyengar, Sheena S., & Lepper, Mark (2000). When choice is demotivating: Can one desire too much of a good thing? Journal of Personality
and Social Psychology, 76, 995-1006.
Programmers dont need a glut of options just give us something that works OK on every platform we care about. Give us a decent standard and well do the rest
-
My optimistic view from 2005
Weve learned our
lesson we emphasize
a small number of
industry standards
-
But we didnt learn our lessonHistory is repeating itself!
Third party names are the property of their owners.
A small sampling of models from the NEW golden age of parallel programming (2010)
Cilk++
CnC
Ct
MYO
RCCE
OpenCL
TBB
Chapel
Charm++
ConcRT
CUDA
Erlang
F#
Fortress
Go
Hadoop
mpc
UPC
PPL
X10
PLINQ
Weve lost our way and have slipped back into the just
create a new language mentality.
-
27
If language obsession is not the solution, what is?
Consider an early (and successful) adopter of many core technologies The gaming industry:
Game development gross generalizations:
Time and money: 1-3 years for 1-20 million dollars.
A blockbuster game has revenues that rival that from a major Hollywood movie.
Major games take teams with 50 to 100 people.
Only a quarter of the people are programmers.
-
28
Game Development
The key: Enforce a separation of concerns: A small number (
-
Whats happening at Microsoft?
Source: An Insiders View to Concurrency at Microsoft, Steven Toub, Univ. of Utah, Sept. 2010
A great
infrastructure for
building
composable
software
frameworks!
-
30
Modular software, frameworks and parallel computing
As parallelism goes mainstream and reaches extreme levels of scalability We cant retrain all the worlds programmers to handle disruptive scalable hardware.
Modular software development techniques and Frameworks will save the day: Framework: A software platform providing partial solutions to a class of
problems that can be specialized to solve a specific problem.
This is not a new idea: Cactus
Common Component Architecture
SIERRA
and many others
we just need to push it further and deeper.
We need a systematic way to build a useful collection of
frameworks for Scalable computing3rd party names are the property of their owners.
-
31
Systematic framework design with patterns
A pattern is a well known solution to a recurring problem captures expert knowledge in a form that can be peer-reviewed, refined, and passed-on to others.
An architecture defines the organization and structure of a software system it can be defined by a hierarchical composition of patterns.
A framework is a software environment based on an architecture a partial solution for a domain of problems that can be specialized to solve specific problems.
31
Patterns Architectures Frameworks
-
Agenda Our hardware future
Rising to the many core software challenge
PLPP the original parallel pattern language
OPL patterns for engineering parallel software
Example
OpenCL: A common HW abstraction layer
-
33
A Pattern Language for Parallel
Programming (PLPP)
A pattern language for parallel
algorithm design with
examples in MPI, OpenMP
and Java.
This is our hypothesis for how
programmers think about
parallel programming.
I urge you all to tear this apart,
correct the errors, add
patterns where they are
missing, and help make this
better.
-
34
PPPLs structure:Four design spaces in parallel software development
Original Problem Tasks, shared and local data
Decomposition
Implementation. &
building blocks
Corresponding source code
Program SPMD_Emb_Par ()
{
TYPE *tmp, *func();
global_array Data(TYPE);
global_array Res(TYPE);
int N = get_num_procs();
int id = get_proc_id();
if (id==0) setup_problem(N,DATA);
for (int I= 0; I
-
35
Start with a specification that solves the original problem -- finish with
the problem decomposed into tasks, shared data, and a partial ordering.
Design Evaluation
Start
DependencyAnalysis
DecompositionAnalysis
DataDecomposition
TaskDecomposition
OrderGroups
GroupTasks
DataSharing
Decomposition(Finding Concurrency)
-
36
Algorithm Strategy(Algorithm Structure)
Start
Organize By Data
Geometric
Decomposition
Linear? Recursive?
Task
Parallelism
Divide and
Conquer
Recursive
Data
Linear? Recursive?
Organize By Flow of Data
Regular? Irregular?
Pipeline Event Based
Coordination
Design Pattern
Decision
Decision PointKey
Organize By Tasks
-
37
Implementation strategy(Supporting Structures)
High level constructs impacting large scale organization of the source
code.
Program Structure
Master/Worker
SPMD
Loop Parallelism
Fork/Join
Data Structures
Shared Data
Shared Queue
Distributed Array
-
38
Parallel Programming building blocks
Low level constructs implementing specific constructs used in
parallel computing. Examples in Java, OpenMP and MPI.
These are not properly design patterns, but they are included to
make the pattern language self-contained.
Process control
Thread control
UE* ManagementSynchronization
Memory sync/fences
barriers
Mutual Exclusion
Collective Comm
Message Passing
Other Comm
Communications
-
39
A simple Example: The PI programNumerical Integration
4.0
(1+x2)dx =
0
1
F(xi)x i = 0
N
Mathematically, we know that:
We can approximate the
integral as a sum of
rectangles:
Where each rectangle has
width x and height F(xi) at
the middle of interval i.
4.0
2.0
1.0
X0.0
-
40
PI Program: The sequential program
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=1;i
-
41
OpenMP PI Program: Loop level parallelism pattern
#include
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ int i; double x, pi, sum =0.0;
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(x) reduction (+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
pi = sum[i] * step;
}
Loop Level
Parallelism:
Parallelism
expressed
solely by (1)
exposing
concurrency,
(2) managing
dependencies,
and (3) splitting
up loops .
-
42
MPI Pi program: SPMD pattern
SPMD
Programs:
Each thread
runs the same
code with the
thread ID
selecting
thread-specific
behavior.
#include void main (int argc, char *argv[]){
int i, id, numprocs; double x, pi, step, sum = 0.0, step1, stepN ;step = 1.0/(double) num_steps ;MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;step1 = id *num_steps/numprocs ;stepN = (id+1)*num_steps/numprocs;if (stepN > num_steps) stepN = num_steps;for (i=my_id*my_steps; i
-
43
#include
#define NUM_THREADS 2
HANDLE thread_handles[NUM_THREADS];
CRITICAL_SECTION hUpdateMutex;
static long num_steps = 100000;
double step;
double global_sum = 0.0;
void Pi (void *arg)
{
int i, start;
double x, sum = 0.0;
start = *(int *) arg;
step = 1.0/(double) num_steps;
for (i=start;i
-
44 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 44
PI Program: Cilk (divide and conquer implemented with fork-join)
static long num_steps = 1073741824; // Im lazy make it a power of 2
double step = 1.0/(double) num_steps;
cilk double pi_comp(int istep, int nstep){
double x, sum;
if(nstep < MIN_SIZE)
for (int i=istep, sum=0.0; i
-
PLPP limitations
We tried to make PLPP for general purpose
programming, but it reflects the scientific
computing background of its authors.
Focuses on PDEs, solvers, Nbody, and other
scientific problems.
monolithic SW architecture (i.e. no idea of
architecture just build an algorithm and
make it fast)
To really solve the parallel programming problem, we need to address
the full breadth of the software engineering problem
PLPP is a pattern language for implementing parallel algorithms.
We need a pattern language for engineering parallel applications
-
Agenda Our hardware future
Rising to the many core software challenge
PLPP the original parallel pattern language
OPL patterns for engineering parallel software
Example
OpenCL: A common HW abstraction layer
-
4713 dwarves
Working with Prof. Kurt Keutzer and his group at UC Berkeley, weve come up with a new and more expansive pattern language
PLPP: Pattern
language of
Parallel
Programming
-
48
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-
Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-
Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Structural Patterns Computational Patterns
Task-Parallelism
Divide and ConquerData-Parallelism
Pipeline
Discrete-Event
Geometric-Decomposition
Speculation
SPMD
Data-Par/index-
space
Fork/Join
Actors
Distributed-Array
Shared-Data
Shared-Queue
Shared-map
Partitioned Graph
MIMD
SIMD
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-Passing
Collective-Comm.
Transactional memory
Thread-Pool
Task-Graph
Data structureProgram structure
Point-To-Point-Sync. (mutual exclusion)
collective sync. (barrier)
Memory sync/fence
Loop-Par.
Task-Queue
Transactions
Thread creation/destruction
Process creation/destruction
Concurrency Foundation constructs (not expressed as patterns)
OPL/PLPP 2010
-
49
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-
Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-
Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Structural Patterns Computational Patterns
Task-Parallelism
Divide and ConquerData-Parallelism
Pipeline
Discrete-Event
Geometric-Decomposition
Speculation
SPMD
Data-Par/index-
space
Fork/Join
Actors
Distributed-Array
Shared-Data
Shared-Queue
Shared-map
Partitioned Graph
MIMD
SIMD
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-Passing
Collective-Comm.
Transactional memory
Thread-Pool
Task-Graph
Data structureProgram structure
Point-To-Point-Sync. (mutual exclusion)
collective sync. (barrier)
Memory sync/fence
Loop-Par.
Task-Queue
Transactions
Thread creation/destruction
Process creation/destruction
Concurrency Foundation constructs (not expressed as patterns)
OPL/PLPP 2010
http://parlab.eecs.berkeley.edu/wiki/patterns/patterns
Berkeley View
13 dwarfs
Garlan and Shaw
Architectural Styles
-
50
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-
Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-
Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Structural Patterns Computational Patterns
Task-Parallelism
Divide and ConquerData-Parallelism
Pipeline
Discrete-Event
Geometric-Decomposition
Speculation
SPMD
Data-Par/index-
space
Fork/Join
Actors
Distributed-Array
Shared-Data
Shared-Queue
Shared-map
Partitioned Graph
MIMD
SIMD
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-Passing
Collective-Comm.
Transactional memory
Thread-Pool
Task-Graph
Data structureProgram structure
Point-To-Point-Sync. (mutual exclusion)
collective sync. (barrier)
Memory sync/fence
Loop-Par.
Task-Queue
Transactions
Thread creation/destruction
Process creation/destruction
Concurrency Foundation constructs (not expressed as patterns)
OPL/PLPP 2010
http://parlab.eecs.berkeley.edu/wiki/patterns/patterns
Berkeley View
13 dwarfs
Garlan and Shaw
Architectural Styles
-
51
Pipe-and-Filter
Agent-and-Repository
Event-based coordination
Iterative refinement
MapReduce
Process Control
Layered Systems
Identify the SW Structure
Structural Patterns
These define the structure of our software but they do not describe what is computed
-
52
Elements of a structural pattern
Components are where the computation happens
Connectors are where the communication happens
A configuration is a
graph of
components
(vertices) and
connectors (edges)
A structural
patterns may be
described as a
familiy of graphs.
-
53
Filter 6
Filter 5
Filter 4
Filter 2
Filter 7
Filter 3
Filter 1
Pattern 1: Pipe and Filter
Filters embody computationOnly see inputs and produce outputs
Pipes embody communication
May have feedback
-
54
Examples of pipe and filter Almost every large software program has a pipe and filter structure at
the highest level
Logic optimizerImage Retrieval System
Compiler
-
55
Pattern 2: Iterative Refinement Pattern
itera
te
Exit condition met?
Initialization condition
Synchronize results of iteration
Variety of functions performed asynchronously
Yes
No
-
5656
Example of Iterative Refinement Pattern:
Training a Classifier: SVM Training
56
Updatesurface
IdentifyOutlier
itera
te
Iterative Refinement Structural Pattern
All points withinacceptable error? Yes
No
-
57
Pattern 3: MapReduce To us, it means
A map stage, where data is mapped onto independent
computations
A reduce stage, where the results of the map stage are
summarized (i.e. reduced)
Map
Reduce
Map
Reduce
-
58
Examples of Map Reduce
General structure:
Map a computation across distributed data sets
Reduce the results to find the best/(worst),
maxima/(minima)
Speech recognition
Map HMM computation
to evaluate word match
Reduce to find the most-
likely word sequences
Support-vector machines (ML)
Map to evaluate distance from
the frontier
Reduce to find the greatest
outlier from the frontier
-
59
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-
Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-
Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Structural Patterns Computational Patterns
Task-Parallelism
Divide and ConquerData-Parallelism
Pipeline
Discrete-Event
Geometric-Decomposition
Speculation
SPMD
Data-Par/index-
space
Fork/Join
Actors
Distributed-Array
Shared-Data
Shared-Queue
Shared-map
Partitioned Graph
MIMD
SIMD
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-Passing
Collective-Comm.
Transactional memory
Thread-Pool
Task-Graph
Data structureProgram structure
Point-To-Point-Sync. (mutual exclusion)
collective sync. (barrier)
Memory sync/fence
Loop-Par.
Task-Queue
Transactions
Thread creation/destruction
Process creation/destruction
Concurrency Foundation constructs (not expressed as patterns)
OPL/PLPP 2010
http://parlab.eecs.berkeley.edu/wiki/patterns/patterns
Berkeley View
13 dwarfs
Garlan and Shaw
Architectural Styles
-
60
Identify Key Computations
Computational patterns describe the key computations but not how they
are implemented
Computational Patterns
-
61
Computational pattern Example:
N-body problem Consider a collection of particles that interact through a pair-wise force
N-body problems show up in a wide range or applications
Astrophysics and Celestial Mechanics
Plasma Simulation
Molecular Dynamics
Electron-Beam Lithography Device Simulation
Fluid Dynamics (vortex method)
Game physics (cloth simulation, blood spatter, etc)
Graph partitioning
Some Elliptic PDE solvers
-
62
A class of techniques to solve certain partial differential equations such
as Poissons equation:
),(),()(2
2
2
2
yxgyxfyx
The idea is to:
Apply a discrete transform to the PDE turning differential operators into algebraic operators.
Solve the resulting system of algebraic or ordinary differential equations.
Inverse transform the solution to return to the original domain.
Computational Pattern Example
Spectral Methods
-
The problem is cast as a network of random variables, where edges represent (potential) dependences
Generally, we have observed variables (shaded) and hidden variables
We need to reason about the hidden variables given our set of observations
Can be directed or undirected
Hidden Markov Model
Computational Pattern Example
Graphical models
-
64
Pipe-and-Filter
Agent-and-Repository
Event-based
Bulk Synchronous
MapReduce
Layered Systems
Arbitrary Task Graphs
Decompose Tasks/Data
Order tasks Identify Data Sharing and Access
Graph Algorithms
Dynamic programming
Dense/Spare Linear Algebra
(Un)Structured Grids
Graphical Models
Finite State Machines
Backtrack Branch-and-Bound
N-Body Methods
Circuits
Spectral Methods
Architecting Parallel Software
Identify the Software Structure
Identify the Key Computations
An architecture is a
composition of design
patterns.
-
Agenda Our hardware future
Rising to the many core software challenge
PLPP the original parallel pattern language
OPL patterns for engineering parallel software
Example
OpenCL: A common HW abstraction layer
-
Inference Engine
Beam Search Iterations
LVCSR Software ArchitecturePipe-and-filter
Graphical Model
Dynamic Programming
Iterative Refinement
Task Graph
Speech Feature
Extractor
Voice Input
SpeechFeatures
Recognition Network
Acoustic Model
Pronunciation Model
Language Model
MapReduce
WordSequenceI think
therefore
I am
Active State Computation Steps
Patterns/framework example
LVCSR = Large vocabulary continuous speech recognition.
-
67/24
A Speech Framework with Extension Points
Read Files
Initialize data structures
CPU
GPU
Backtrack
Output Results
Phase 0
Phase 1
Compute Observation Probability
Phase 2
For each active arc: Compute arc transition
probability
Copy results back to CPU
Collect Backtrack Info
Prepare ActiveSet
Iteration Control
File Input Format
Pruning Strategy
Observation Probability Computation
Backtrack
Result output format
StaticData StructureOptimizations
DynamicData StructureOptimizations
A WFST Based Inference Engine Framework
-
68/24
A Speech Framework with Extension Points
Read Files
Initialize data structures
CPU
GPU
Backtrack
Output Results
Phase 0
Phase 1
Compute Observation Probability
Phase 2
For each active arc: Compute arc transition
probability
Copy results back to CPU
Collect Backtrack Info
Prepare ActiveSet
Iteration Control
A WFST Based Inference Engine FrameworkHTK Format
SRI Format
Fixed Beam Width
Adaptive Beam Width
HMM SRI GPU ObsProb
HMM WSJ GPU ObsProb
CHMM SRI GPU ObsProb
Backtrack
Backtrack + conf. metric
HResult format
SRI Scoring format
Two level WFST Network
One level WFST Network
Preload All
Selective Preload
-
The Programmers View
Jike Chong, Ekaterina Gonina, Kisun You,
Kurt Keutzer, Exploring Recognition
Network Representations for Efficient Speech
Inference on Highly Parallel Platforms,
Submitted to Interspeech 2010.
Dorothea Kolossa, Jike Chong, Steffen Zeiler,
Kurt Keutzer, Efficient Manycore CHMM
Speech Recognition for Audiovisual and
Multistream Data, Submitted to Interspeech
2010.
Patterns/framework example
-
70
Are Frameworks enough?
Efficiency to support the extremely low overheads required by Amdahl's law
collapse layers of abstractions. Dynamic optimization.
A stable, richly supported, and highly portable programming environment to make framework development practical
A hardware abstraction layer for heterogeneous platforms ... Creates economic justification for the effort since one frameworks supports many platforms
Analyze key app domain
Discover common paths through our pattern language.
These paths define the framework design
Implement framework
-
Agenda Our hardware future
Rising to the many core software challenge
OpenCL: A common HW abstraction layer
-
How to program the heterogenous platform?Let History can be our guide consider the origins of OpenMP
SGI
Cray
Merged,
needed
commonality
across
products
KAI ISV - needed larger market
was tired of
recoding for
SMPs. Forced
vendors to
standardize.
ASCI
Wrote a
rough draft
straw man
SMP API
DEC
IBM
Intel
HP
Other vendors
invited to join
1997Third party names are the property of their owners.
-
OpenCL: Can history repeat itself?
AMD
ATI
Merged,
needed
commonality
across
products
NvidiaGPU vendor -
wants to steel mkt
share from CPU
IntelCPU vendor -
wants to steel mkt
share from GPU
Wrote a
rough draft
straw man
API
was tired of recoding
for many core, GPUs.
Pushed vendors to
standardize.
Apple
Ericsson
Sony
Blizzard
Nokia
Khronos
Compute
group formed
Freescale
TI
IBM
+ many
more
As ASCI did for OpenMP, Apple is doing for
GPU/CPU with OpenCL
Dec 2008Third party names are the property of their owners.
-
OpenCL Working GroupDiverse industry participation
HW vendors (e.g. Apple), system OEMs, middleware vendors, application developers.
OpenCL became an important standard on release by virtue of the market coverage of the companies behind it.
Third party names are the property of their owners.
http://www.codeplay.com/http://www.amd.com/http://www.umu.se/umu/index_eng.htmlhttp://www.gshark.com/http://www.st.com/ -
The BIG idea behind OpenCLOpenCL execution model execute a kernel at each point in a problem domain.
E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions
void
trad_mul(int n,
const float *a,
const float *b,
float *c)
{
int i;
for (i=0; i
-
An N-dimension domain of work-itemsDefine an N-dimensioned index space that is best for your algorithm
Global Dimensions: 1024 x 1024 (whole problem space)
Local Dimensions: 128 x 128 (work group executes together)
1024
1024
Synchronization between work-items
possible only within workgroups:
barriers and memory fences
Cannot synchronize outside
of a workgroup
-
To use OpenCL, you must
Define the platform
Execute code on the platform
Move data around in memory
Write (and build) programs
-
OpenCL Platform Model
One Host + one or more Compute Devices
Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements
-
OpenCL Execution ModelAn OpenCL application runs on a host which submits work to the compute devices.
Work item: the basic unit of work on an OpenCL device.
Kernel: the code for a work item. Basically a C function
Program: Collection of kernels and other functions (Analogous to a dynamic library)
Context: The environment within which work-items executes includes devices and their memories and command queues.
Queue Queue
Contex
t
GPU CPU
Applications queue kernel execution instances
Queued in-order one queue to a device
Executed in-order or out-of-order
-
OpenCL Memory Model
Memory management is Explicit
You must move data from host -> global -> local and back
Private Memory
Per work-item
Local Memory
Shared within a workgroup
Global/Constant Memory
Visible to all workgroups
Host Memory
On the CPU
Workgroup
Work-Item
Compute Device
Work-Item
Workgroup
Host
Private Memory
Private Memory
Local MemoryLocal Memory
Global/Constant Memory
Host Memory
Work-ItemWork-Item
Private Memory
Private Memory
-
Programming kernels: the OpenCL C Language
A subset of ISO C99
But without some C99 features such as standard C99 headers, function pointers, recursion, variable length arrays, and bit fields
A superset of ISO C99 with additions for:
Work-items and workgroups
Vector types
Synchronization
Address space qualifiers
Also includes a large set of built-in functions for image manipulation, work-item manipulation, specialized math routines, etc.
-
Programming Kernels: Data TypesScalar data types
char , uchar, short, ushort, int, uint, long, ulong, float
bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
Image types
image2d_t, image3d_t, sampler_t
Vector data types
Vector lengths 2, 4, 8, & 16 (char2, ushort4, int8, float16, double2, )
Endian safe
Aligned at vector length
Vector operations and built-in functions
2 3 -7 -7
-7 -7 -7 -7int4 vi0 = (int4) -7;
0 1 2 3int4 vi1 = (int4)(0, 1, 2, 3);
vi0.lo = vi1.hi;
int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); 2 3 -7 -7 0 1 1 3
Double is an optional
type in OpenCL 1.0
-
Building Program objects
The program object encapsulates: A context
The program source/binary
List of target devices and build options
The Build process to create a program object clCreateProgramWithSource()
clCreateProgramWithBinary()
Programkernel void
horizontal_reflect(read_only image2d_t src,
write_only image2d_t dst)
{
int x = get_global_id(0); // x-coord
int y = get_global_id(1); // y-coord
int width = get_image_width(src);
float4 src_val = read_imagef(src, sampler,
(int2)(width-1-x, y));
write_imagef(dst, (int2)(x, y), src_val);
}
Compile for
GPU
Compile for
CPU
GPU
code
CPU
code
Kernel Code
-
Vector Addition - Kernel
__kernel void vec_add (__global const float *a,
__global const float *b,
__global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
-
Vector Addition: Host Program
// create the OpenCL context on a GPU device
cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// get the list of GPU devices associated with context
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,
NULL, &cb);
devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
// create a command-queue
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA,
NULL);}
memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB,
NULL);
memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,
sizeof(cl_float)*n, NULL,
NULL);
// create the program
program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);
// build the program
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// create the kernel
kernel = clCreateKernel(program, vec_add, NULL);
// set the args values
err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],
sizeof(cl_mem));
// set work-item dimensions
global_work_size[0] = n;
// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
// read output array
err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
-
Vector Addition: Host Program
// create the OpenCL context on a GPU device
cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// get the list of GPU devices associated with context
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,
NULL, &cb);
devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
// create a command-queue
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);}
memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL);
memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,
sizeof(cl_float)*n, NULL, NULL);
// create the program
program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);
// build the program
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// create the kernel
kernel = clCreateKernel(program, vec_add, NULL);
// set the args values
err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],
sizeof(cl_mem));
// set work-item dimensions
global_work_size[0] = n;
// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
// read output array
err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
Define platform and queues
Define Memory objects
Create the program
Build the program
Create and setup kernel
Execute the kernel
Read results on the host
Its complicated, but most of this is boilerplate and not as bad as it looks.
-
GPU
CPU
En
qu
eu
e K
ern
el 1
Kernel 1
En
qu
eu
e K
ern
el 2
Time
GPU
CPU
En
qu
eu
e K
ern
el 1
Kernel 1
En
qu
eu
e K
ern
el 2
Kernel 2
Time
Kernel 2
Kernel 2 waits for an event from Kernel 1 and does not
start until the results are ready
Kernel 2 starts before the results from Kernel 1 are
ready
Events can be used to synchronize kernel executions between queues
Example: 2 queues with 2 devices
OpenCL Synchronization: Queues & Events
-
arg [0]
value
arg [1]
value
arg [2]
value
arg [0]
value
arg [1]
value
arg [2]
value
In
Order
Queue
Out of
Order
Queue
GPU
Context
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
}
dp_mul
CPU program binary
dp_mul
GPU program binary
Programs Kernels
arg[0] value
arg[1] value
arg[2] value
Images Buffers
In
Order
Queue
Out of
Order
Queue
Compute Device
GPUCPU
dp_mul
Programs Kernels Memory Objects Command Queues
OpenCL summary
Third party names are the property of their owners.
-
89
ConclusionsMany core processors mean all software must be parallel.
We cant turn everyone into parallel algorithm experts we have to support a separation of concerns:
Hardcore experts build programming frameworks emphasize efficiency using industry standard languages.
Domain expert programmers assembly applications within a framework emphasize productivity.
Design Patterns are a tool to help us get this right.
We believe software architectures can be built up from a manageable number of design patterns. These patterns define the building blocks of all software engineering and are fundamental to the practice of architecting parallel software. Hence, an effort to propose, argue about, and finally agree on what constitutes this set of patterns is the seminal intellectual challenge of our field.
K. Keutzer and T. Mattson, A design pattern language for engineering (parallel) software, Intel Tech. Journal, Vol. 13, # 4, 2010.
-
90
Patterns Acknowledgements Kurt Keutzer (UCB), Ralph Johnson (UIUC) and our
community of patterns writers: Hugo Andrade, Chris Batten, Eric Battenberg, Hovig Bayandorian, Dai Bui,
Bryan Catanzaro, Jike Chong, Enylton Coelho, Katya Gonina, Yunsup Lee, Mark Murphy, Heidi Pan, Kaushik Ravindran, Sayak Ray, Erich Strohmaier, Bor-yiing Su, Narayanan Sundaram, Guogiang Wang, Youngmin Yi., Jeff Anderson-Lee, Joel Jones, Terry Ligocki, and Sam Williams.
The development of our pattern language has also received a boost from Par Lab faculty particularly: Krste Asanovic, Jim Demmel, and David Patterson.
My co-authors, colleagues and friends who helped write the PLPP patterns book
Beverly Sanders (University of Florida)
Berna Massingill (Trinity University)