par prog course many core sw pats ocl

Download Par Prog Course Many Core Sw Pats Ocl

If you can't read please download the document

Upload: anonymous-6gg4ev9ai

Post on 06-Dec-2015

13 views

Category:

Documents


0 download

DESCRIPTION

general

TRANSCRIPT

  • 11

    Many core processors: Opportunities and challenges

    (or an old supercomputing hacker rambles on about parallel computing)

    Tim Mattson

    Intel Labs

    [email protected]

  • Disclosure

    The views expressed in this talk are those of the speaker and not his employer.

    I am in a research group and know nothing about Intel products. So anything I say about them is highly suspect I might even lie!

    This was a team effort, but if I say anything really stupid, its all my fault dont blame my collaborators.

  • Sources for the slides

    Slides from my work at Intel.

    Slides I created with UC Berkeley ParLab colleagues. Most of these come from courses I taught with Prof. Kurt Keutzer.

    Slides I developed with members of the Khronos compute group (OpenCL).

  • Agenda Our hardware future

    Rising to the many core software challenge

    OpenCL: A common HW abstraction layer

  • Parallel hardware trendsTop 500: total number of processors (1993-2000)

    0

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    1993 1994 1995 1996 1997 1998 1999 2000

    The early years SIMD MPP fades and clusters

    take over.Source: the June lists from www.top500.org

  • Parallel Hardware TrendsTop 500: total number of processors (1993-2009)

    0

    500000

    1000000

    1500000

    2000000

    2500000

    3000000

    3500000

    4000000

    4500000

    19

    93

    19

    94

    19

    95

    19

    96

    19

    97

    19

    98

    19

    99

    20

    00

    20

    01

    20

    02

    20

    03

    20

    04

    20

    05

    20

    06

    20

    07

    20

    08

    20

    09

    Disruptive

    technology

    trend!

    Source: the June lists from www.top500.org

  • 7

    The era of many-core processors has arrived

    IBM Cell

    NVIDIA Tesla C1060Intel SCC research

    chip

    ATI RV770

    3rd party names are the property of their owners. Source: SC09 OpenCL tutorial

    80 cores

    240 cores

    1 CPU + 6 cores

    160 cores

    48 cores

  • 8

    Why many core? Its all about Power.Power vs Performance (normalized to i486 process tech.)

    0

    5

    10

    15

    20

    25

    30

    0 2 4 6 8Scalar Performance

    Po

    we

    r power = perf ^ 1.74

    Pentium M

    i486 Pentium

    Pentium Pro

    Pentium 4 (Wmt)

    Pentium 4 (Psc)

    Growth in power

    is unsustainable

    Source: Ed Grochowski, Intel

  • 9

    Design with Power in mind

    0

    5

    10

    15

    20

    25

    30

    0 2 4 6 8Scalar Performance

    Po

    we

    r power = perf ^ 1.74

    Pentium M

    i486 Pentium

    Pentium Pro

    Pentium 4 (Wmt)

    Pentium 4 (Psc)

    Mobile CPUs

    with shallow

    pipelines use

    less power

    31 Pipeline stages

  • 10

    Using Multiple cores to reduce power

    Processor

    f

    Processor

    f/2

    Processor

    f/2

    f

    Input Output

    Input

    Output

    Capacitance = C

    Voltage = V

    Frequency = f

    Power = CV2f Capacitance = 2.2C

    Voltage = 0.6V

    Frequency = 0.5f

    Power = 0.396CV2f

    Chandrakasan, A.P.; Potkonjak, M.; Mehra, R.; Rabaey, J.; Brodersen, R.W., "Optimizing power using transformations," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,, vol.14, no.1, pp.12-31, Jan 1995

    Source: K. Keutzer of UCB

  • 1111

    Specialized Si is more power efficient

    Peak single precisions GFLOPS at the peak thermal design point. All processors manufactured with a 65 nm process technology.

    0

    2

    4

    6

    8

    10

    12

    14

    16

    GFLO

    P/W

    att

    NVIDIA GTX 280

    Intel 80 core terascale processor

    Intel CoreTM 2 Quad processor (Q6700)

    236 W

    97 W

    95 W

    Sources: vendor published data sheets and the 80 core paper form SC08 by Mattson, van der Wijngaart, and Frumkin.

  • 1212

    The future of HW

    GMCHGPU

    ICH

    CPUCPU

    DRAM

    GMCH = graphics memory control hub, ICH = Input/output control hub SOC = system on a chip

    Moores law marches on more and more transistors on a single die (Intels 22 nm Fab at a cost of 6 to 8 Billion dollars will open in 2013)

    Heterogeneous, many core SOC will be the standard building block of computing

    How will we use those transistors:

    Many heterogeneous cores.

    GPU, CPU, I/O hubs, etc all on one SOC die.

  • 1313

    How will the cores be connected?

    An Intel Execs slide from IDF2006

  • 1414

    Challenging the sacred cows

    SharedCache

    LocalCache

    StreamlinedIA Core

    Assumes cache coherent shared address space!

    Is that the right choice? Most expert programmers do not

    fully understand relaxed consistency memory models required to make cache coherent architectures work.

    The only programming models proven to scale non-trivial apps to 100s to 1000s of cores all based on distributed memory.

    Coherence incurs additional architectural overhead

    IA cores optimized for multithreading

  • 1515

    The Coherency Wall As you scale the number of cores on a cache coherent system (CC), cost in

    time and memory grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall.

    * Rakesh Kumar, Timothy G. Mattson, Gilles Pokam, and Rob van der Wijngaart, The case for Message Passing on Many-core Chips, University of Illinois Champaign-Urbana Technical Report, UILU-ENG-10-2203 (CRHC 10-01), 2010.

    each directory entry will be 128 bytes long for a 1024 core processor supporting fully-mapped directory-based cache coherence. This may often be larger than the size of the cacheline that a directory entry is expected to track.*

    CC: O(N) 1 2

    2D Mesh: O(4) to O(N )

    Cost

    (tim

    e a

    nd/o

    r m

    em

    ory

    )

    Number of cores

    For a scalable, directory based scheme, CC incurs an N-body effect cost scales at best linearly (Fixed memory size as cores are added) and at worst quadratically (memory grows

    linearly with number of cores).

    HW Dist. Mem. HW cost scales at best a fixed cost for the local neighborhood and at worst as the diameter of

    the network.

    Assume an app whose performance is not bound by memory bandwidth

  • 1616

    Isnt shared memory programming easier? Not necessarily.

    Time

    Effo

    rt

    Extra work upfront, but easier optimization and debugging means

    overall, less time to solutionMessage passing

    Time

    Effo

    rt

    initial parallelization can be quite easy

    Multi-threading

    But difficult debugging and optimization means overall

    project takes longer

    *P. N. Klein, H. Lu, and R. H. B. Netzer, Detecting Race Conditions in Parallel Programs that Use Semaphores, Algorithmica, vol. 35 pp. 321345, 2003

    Proving that a shared address space program using semaphores is race free is an NP-complete problem*

  • The many-core challenge

    Result: a fundamental and dangerous mismatch

    Parallel hardware is ubiquitous.

    Parallel software is rare

    Our challenge make parallel software as routine as our parallel hardware.

    We have arrived at many-core solutions notbecause of the success of our parallel

    software but because of our failure to keep increasing CPU frequency*.

    *Tim Mattson a famous-software-researcher wannabe

  • Agenda Our hardware future

    Rising to the many core software challenge

    OpenCL: A common HW abstraction layer

  • Agenda Our hardware future

    Rising to the many core software challenge

    PLPP the original parallel pattern language

    OPL patterns for engineering parallel software

    Example

    OpenCL: A common HW abstraction layer

  • 20

    How about automatic parallelization?

    0

    5

    10

    15

    20

    25

    30

    bzip2

    craf

    tyga

    pgc

    cgz

    ipm

    cf

    pars

    ertw

    olf

    vorte

    xvp

    r

    aver

    age

    Sp

    ee

    du

    p %

    Basic speculative multithreading

    Software value predictionEnabling optimizations

    A Cost-Driven Compilation Framework for Speculative Parallelization of Sequential Programs,

    Zhao-Hui Du, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, Tin-Fook Ngai (Intel

    Corporation) in PLDI 2004

    Aggressive techniques such as speculative multithreading help, but they are not enough.

    Ave SPECintspeedup of 8% will climb to ave. of 15% once their system is fully enabled.

    There are no indications auto par. will radically improve any time soon.

    Hence, I do not belive Auto-par will solve our problems.

    Results for a simulated dual core platform configured as a main core and a core for speculative execution.

  • 21

    Parallel Software

    Our only hope is to get programmers to write parallel software by hand.

    But after 25+ years of research, we are no closer to solving the parallel programming problem

    Only a tiny fraction of programmers write parallel code.

    Will the if you build it they will come principle apply?

    Many hope so, but ..

    that implies that people didnt really try hard enough over the last 25 years. Does that really make sense?

  • Solution: Find A Good parallel programming model, right?

    ABCPL

    ACE

    ACT++

    Active messages

    Adl

    Adsmith

    ADDAP

    AFAPI

    ALWAN

    AM

    AMDC

    AppLeS

    Amoeba

    ARTS

    Athapascan-0b

    Aurora

    Automap

    bb_threads

    Blaze

    BSP

    BlockComm

    C*.

    "C* in C

    C**

    CarlOS

    Cashmere

    C4

    CC++

    Chu

    Charlotte

    Charm

    Charm++

    Cid

    Cilk

    CM-Fortran

    Converse

    Code

    COOL

    CORRELATE

    CPS

    CRL

    CSP

    Cthreads

    CUMULVS

    DAGGER

    DAPPLE

    Data Parallel C

    DC++

    DCE++

    DDD

    DICE.

    DIPC

    DOLIB

    DOME

    DOSMOS.

    DRL

    DSM-Threads

    Ease .

    ECO

    Eiffel

    Eilean

    Emerald

    EPL

    Excalibur

    Express

    Falcon

    Filaments

    FM

    FLASH

    The FORCE

    Fork

    Fortran-M

    FX

    GA

    GAMMA

    Glenda

    GLU

    GUARD

    HAsL.

    Haskell

    HPC++

    JAVAR.

    HORUS

    HPC

    IMPACT

    ISIS.

    JAVAR

    JADE

    Java RMI

    javaPG

    JavaSpace

    JIDL

    Joyce

    Khoros

    Karma

    KOAN/Fortran-S

    LAM

    Lilac

    Linda

    JADA

    WWWinda

    ISETL-Linda

    ParLin

    Eilean

    P4-Linda

    POSYBL

    Objective-Linda

    LiPS

    Locust

    Lparx

    Lucid

    Maisie

    Manifold

    Mentat

    Legion

    Meta Chaos

    Midway

    Millipede

    CparPar

    Mirage

    MpC

    MOSIX

    Modula-P

    Modula-2*

    Multipol

    MPI

    MPC++

    Munin

    Nano-Threads

    NESL

    NetClasses++

    Nexus

    Nimrod

    NOW

    Objective Linda

    Occam

    Omega

    OpenMP

    Orca

    OOF90

    P++

    P3L

    Pablo

    PADE

    PADRE

    Panda

    Papers

    AFAPI.

    Para++

    Paradigm

    Parafrase2

    Paralation

    Parallel-C++

    Parallaxis

    ParC

    ParLib++

    ParLin

    Parmacs

    Parti

    pC

    PCN

    PCP:

    PH

    PEACE

    PCU

    PET

    PENNY

    Phosphorus

    POET.

    Polaris

    POOMA

    POOL-T

    PRESTO

    P-RIO

    Prospero

    Proteus

    QPC++

    PVM

    PSI

    PSDM

    Quake

    Quark

    Quick Threads

    Sage++

    SCANDAL

    SAM

    pC++

    SCHEDULE

    SciTL

    SDDA.

    SHMEM

    SIMPLE

    Sina

    SISAL.

    distributed smalltalk

    SMI.

    SONiC

    Split-C.

    SR

    Sthreads

    Strand.

    SUIF.

    Synergy

    Telegrphos

    SuperPascal

    TCGMSG.

    Threads.h++.

    TreadMarks

    TRAPPER

    uC++

    UNITY

    UC

    V

    ViC*

    Visifold V-NUS

    VPE

    Win32 threads

    WinPar

    XENOOPS

    XPC

    Zounds

    ZPL

    Third party names are the property of their owners.

    Models from the golden age of parallel programming (~95)

  • The only thing sillier than creating too many models is using too many

    ABCPL

    ACE

    ACT++

    Active messages

    Adl

    Adsmith

    ADDAP

    AFAPI

    ALWAN

    AM

    AMDC

    AppLeS

    Amoeba

    ARTS

    Athapascan-0b

    Aurora

    Automap

    bb_threads

    Blaze

    BSP

    BlockComm

    C*.

    "C* in C

    C**

    CarlOS

    Cashmere

    C4

    CC++

    Chu

    Charlotte

    Charm

    Charm++

    Cid

    Cilk

    CM-Fortran

    Converse

    Code

    COOL

    CORRELATE

    CPS

    CRL

    CSP

    Cthreads

    CUMULVS

    DAGGER

    DAPPLE

    Data Parallel C

    DC++

    DCE++

    DDD

    DICE.

    DIPC

    DOLIB

    DOME

    DOSMOS.

    DRL

    DSM-Threads

    Ease .

    ECO

    Eiffel

    Eilean

    Emerald

    EPL

    Excalibur

    Express

    Falcon

    Filaments

    FM

    FLASH

    The FORCE

    Fork

    Fortran-M

    FX

    GA

    GAMMA

    Glenda

    GLU

    GUARD

    HAsL.

    Haskell

    HPC++

    JAVAR.

    HORUS

    HPC

    IMPACT

    ISIS.

    JAVAR

    JADE

    Java RMI

    javaPG

    JavaSpace

    JIDL

    Joyce

    Khoros

    Karma

    KOAN/Fortran-S

    LAM

    Lilac

    Linda

    JADA

    WWWinda

    ISETL-Linda

    ParLin

    Eilean

    P4-Linda

    POSYBL

    Objective-Linda

    LiPS

    Locust

    Lparx

    Lucid

    Maisie

    Manifold

    Mentat

    Legion

    Meta Chaos

    Midway

    Millipede

    CparPar

    Mirage

    MpC

    MOSIX

    Modula-P

    Modula-2*

    Multipol

    MPI

    MPC++

    Munin

    Nano-Threads

    NESL

    NetClasses++

    Nexus

    Nimrod

    NOW

    Objective Linda

    Occam

    Omega

    OpenMP

    Orca

    OOF90

    P++

    P3L

    Pablo

    PADE

    PADRE

    Panda

    Papers

    AFAPI.

    Para++

    Paradigm

    Parafrase2

    Paralation

    Parallel-C++

    Parallaxis

    ParC

    ParLib++

    ParLin

    Parmacs

    Parti

    pC

    PCN

    PCP:

    PH

    PEACE

    PCU

    PET

    PENNY

    Phosphorus

    POET.

    Polaris

    POOMA

    POOL-T

    PRESTO

    P-RIO

    Prospero

    Proteus

    QPC++

    PVM

    PSI

    PSDM

    Quake

    Quark

    Quick Threads

    Sage++

    SCANDAL

    SAM

    pC++

    SCHEDULE

    SciTL

    SDDA.

    SHMEM

    SIMPLE

    Sina

    SISAL.

    distributed smalltalk

    SMI.

    SONiC

    Split-C.

    SR

    Sthreads

    Strand.

    SUIF.

    Synergy

    Telegrphos

    SuperPascal

    TCGMSG.

    Threads.h++.

    TreadMarks

    TRAPPER

    uC++

    UNITY

    UC

    V

    ViC*

    Visifold V-NUS

    VPE

    Win32 threads

    WinPar

    XENOOPS

    XPC

    Zounds

    ZPL

    Third party names are the property of their owners.

    Programming models Ive worked with.

  • Perc

    en

    tag

    e

    60

    try

    40

    try

    24 6

    Choice overload:Too many options can hurt you

    The Draeger Grocery Store experiment consumer choice :

    Two Jam-displays with coupons for purchase discount.

    24 different Jams

    6 different Jams

    How many stopped by to try samples at the display?

    Of those who tried, how many bought jam?

    3

    bu

    y

    30

    bu

    y

    The findings from this study show that an extensive array of options can at first seem highly appealing to

    consumers, yet can reduce their subsequent motivation to purchase the product.

    Iyengar, Sheena S., & Lepper, Mark (2000). When choice is demotivating: Can one desire too much of a good thing? Journal of Personality

    and Social Psychology, 76, 995-1006.

    Programmers dont need a glut of options just give us something that works OK on every platform we care about. Give us a decent standard and well do the rest

  • My optimistic view from 2005

    Weve learned our

    lesson we emphasize

    a small number of

    industry standards

  • But we didnt learn our lessonHistory is repeating itself!

    Third party names are the property of their owners.

    A small sampling of models from the NEW golden age of parallel programming (2010)

    Cilk++

    CnC

    Ct

    MYO

    RCCE

    OpenCL

    TBB

    Chapel

    Charm++

    ConcRT

    CUDA

    Erlang

    F#

    Fortress

    Go

    Hadoop

    mpc

    UPC

    PPL

    X10

    PLINQ

    Weve lost our way and have slipped back into the just

    create a new language mentality.

  • 27

    If language obsession is not the solution, what is?

    Consider an early (and successful) adopter of many core technologies The gaming industry:

    Game development gross generalizations:

    Time and money: 1-3 years for 1-20 million dollars.

    A blockbuster game has revenues that rival that from a major Hollywood movie.

    Major games take teams with 50 to 100 people.

    Only a quarter of the people are programmers.

  • 28

    Game Development

    The key: Enforce a separation of concerns: A small number (

  • Whats happening at Microsoft?

    Source: An Insiders View to Concurrency at Microsoft, Steven Toub, Univ. of Utah, Sept. 2010

    A great

    infrastructure for

    building

    composable

    software

    frameworks!

  • 30

    Modular software, frameworks and parallel computing

    As parallelism goes mainstream and reaches extreme levels of scalability We cant retrain all the worlds programmers to handle disruptive scalable hardware.

    Modular software development techniques and Frameworks will save the day: Framework: A software platform providing partial solutions to a class of

    problems that can be specialized to solve a specific problem.

    This is not a new idea: Cactus

    Common Component Architecture

    SIERRA

    and many others

    we just need to push it further and deeper.

    We need a systematic way to build a useful collection of

    frameworks for Scalable computing3rd party names are the property of their owners.

  • 31

    Systematic framework design with patterns

    A pattern is a well known solution to a recurring problem captures expert knowledge in a form that can be peer-reviewed, refined, and passed-on to others.

    An architecture defines the organization and structure of a software system it can be defined by a hierarchical composition of patterns.

    A framework is a software environment based on an architecture a partial solution for a domain of problems that can be specialized to solve specific problems.

    31

    Patterns Architectures Frameworks

  • Agenda Our hardware future

    Rising to the many core software challenge

    PLPP the original parallel pattern language

    OPL patterns for engineering parallel software

    Example

    OpenCL: A common HW abstraction layer

  • 33

    A Pattern Language for Parallel

    Programming (PLPP)

    A pattern language for parallel

    algorithm design with

    examples in MPI, OpenMP

    and Java.

    This is our hypothesis for how

    programmers think about

    parallel programming.

    I urge you all to tear this apart,

    correct the errors, add

    patterns where they are

    missing, and help make this

    better.

  • 34

    PPPLs structure:Four design spaces in parallel software development

    Original Problem Tasks, shared and local data

    Decomposition

    Implementation. &

    building blocks

    Corresponding source code

    Program SPMD_Emb_Par ()

    {

    TYPE *tmp, *func();

    global_array Data(TYPE);

    global_array Res(TYPE);

    int N = get_num_procs();

    int id = get_proc_id();

    if (id==0) setup_problem(N,DATA);

    for (int I= 0; I

  • 35

    Start with a specification that solves the original problem -- finish with

    the problem decomposed into tasks, shared data, and a partial ordering.

    Design Evaluation

    Start

    DependencyAnalysis

    DecompositionAnalysis

    DataDecomposition

    TaskDecomposition

    OrderGroups

    GroupTasks

    DataSharing

    Decomposition(Finding Concurrency)

  • 36

    Algorithm Strategy(Algorithm Structure)

    Start

    Organize By Data

    Geometric

    Decomposition

    Linear? Recursive?

    Task

    Parallelism

    Divide and

    Conquer

    Recursive

    Data

    Linear? Recursive?

    Organize By Flow of Data

    Regular? Irregular?

    Pipeline Event Based

    Coordination

    Design Pattern

    Decision

    Decision PointKey

    Organize By Tasks

  • 37

    Implementation strategy(Supporting Structures)

    High level constructs impacting large scale organization of the source

    code.

    Program Structure

    Master/Worker

    SPMD

    Loop Parallelism

    Fork/Join

    Data Structures

    Shared Data

    Shared Queue

    Distributed Array

  • 38

    Parallel Programming building blocks

    Low level constructs implementing specific constructs used in

    parallel computing. Examples in Java, OpenMP and MPI.

    These are not properly design patterns, but they are included to

    make the pattern language self-contained.

    Process control

    Thread control

    UE* ManagementSynchronization

    Memory sync/fences

    barriers

    Mutual Exclusion

    Collective Comm

    Message Passing

    Other Comm

    Communications

  • 39

    A simple Example: The PI programNumerical Integration

    4.0

    (1+x2)dx =

    0

    1

    F(xi)x i = 0

    N

    Mathematically, we know that:

    We can approximate the

    integral as a sum of

    rectangles:

    Where each rectangle has

    width x and height F(xi) at

    the middle of interval i.

    4.0

    2.0

    1.0

    X0.0

  • 40

    PI Program: The sequential program

    static long num_steps = 100000;

    double step;

    void main ()

    { int i; double x, pi, sum = 0.0;

    step = 1.0/(double) num_steps;

    for (i=1;i

  • 41

    OpenMP PI Program: Loop level parallelism pattern

    #include

    static long num_steps = 100000; double step;

    #define NUM_THREADS 2

    void main ()

    { int i; double x, pi, sum =0.0;

    step = 1.0/(double) num_steps;

    omp_set_num_threads(NUM_THREADS);

    #pragma omp parallel for private(x) reduction (+:sum)

    for (i=0;i< num_steps; i++){

    x = (i+0.5)*step;

    sum += 4.0/(1.0+x*x);

    }

    pi = sum[i] * step;

    }

    Loop Level

    Parallelism:

    Parallelism

    expressed

    solely by (1)

    exposing

    concurrency,

    (2) managing

    dependencies,

    and (3) splitting

    up loops .

  • 42

    MPI Pi program: SPMD pattern

    SPMD

    Programs:

    Each thread

    runs the same

    code with the

    thread ID

    selecting

    thread-specific

    behavior.

    #include void main (int argc, char *argv[]){

    int i, id, numprocs; double x, pi, step, sum = 0.0, step1, stepN ;step = 1.0/(double) num_steps ;MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;step1 = id *num_steps/numprocs ;stepN = (id+1)*num_steps/numprocs;if (stepN > num_steps) stepN = num_steps;for (i=my_id*my_steps; i

  • 43

    #include

    #define NUM_THREADS 2

    HANDLE thread_handles[NUM_THREADS];

    CRITICAL_SECTION hUpdateMutex;

    static long num_steps = 100000;

    double step;

    double global_sum = 0.0;

    void Pi (void *arg)

    {

    int i, start;

    double x, sum = 0.0;

    start = *(int *) arg;

    step = 1.0/(double) num_steps;

    for (i=start;i

  • 44 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 44

    PI Program: Cilk (divide and conquer implemented with fork-join)

    static long num_steps = 1073741824; // Im lazy make it a power of 2

    double step = 1.0/(double) num_steps;

    cilk double pi_comp(int istep, int nstep){

    double x, sum;

    if(nstep < MIN_SIZE)

    for (int i=istep, sum=0.0; i

  • PLPP limitations

    We tried to make PLPP for general purpose

    programming, but it reflects the scientific

    computing background of its authors.

    Focuses on PDEs, solvers, Nbody, and other

    scientific problems.

    monolithic SW architecture (i.e. no idea of

    architecture just build an algorithm and

    make it fast)

    To really solve the parallel programming problem, we need to address

    the full breadth of the software engineering problem

    PLPP is a pattern language for implementing parallel algorithms.

    We need a pattern language for engineering parallel applications

  • Agenda Our hardware future

    Rising to the many core software challenge

    PLPP the original parallel pattern language

    OPL patterns for engineering parallel software

    Example

    OpenCL: A common HW abstraction layer

  • 4713 dwarves

    Working with Prof. Kurt Keutzer and his group at UC Berkeley, weve come up with a new and more expansive pattern language

    PLPP: Pattern

    language of

    Parallel

    Programming

  • 48

    Graph-Algorithms

    Dynamic-Programming

    Dense-Linear-Algebra

    Sparse-Linear-Algebra

    Unstructured-Grids

    Structured-Grids

    Model-View-Controller

    Iterative-Refinement

    Map-Reduce

    Layered-Systems

    Arbitrary-Static-Task-Graph

    Pipe-and-Filter

    Agent-and-Repository

    Process-Control

    Event-Based/Implicit-

    Invocation

    Puppeteer

    Graphical-Models

    Finite-State-Machines

    Backtrack-Branch-and-

    Bound

    N-Body-Methods

    Circuits

    Spectral-Methods

    Monte-Carlo

    Applications

    Structural Patterns Computational Patterns

    Task-Parallelism

    Divide and ConquerData-Parallelism

    Pipeline

    Discrete-Event

    Geometric-Decomposition

    Speculation

    SPMD

    Data-Par/index-

    space

    Fork/Join

    Actors

    Distributed-Array

    Shared-Data

    Shared-Queue

    Shared-map

    Partitioned Graph

    MIMD

    SIMD

    Parallel Execution Patterns

    Concurrent Algorithm Strategy Patterns

    Implementation Strategy Patterns

    Message-Passing

    Collective-Comm.

    Transactional memory

    Thread-Pool

    Task-Graph

    Data structureProgram structure

    Point-To-Point-Sync. (mutual exclusion)

    collective sync. (barrier)

    Memory sync/fence

    Loop-Par.

    Task-Queue

    Transactions

    Thread creation/destruction

    Process creation/destruction

    Concurrency Foundation constructs (not expressed as patterns)

    OPL/PLPP 2010

  • 49

    Graph-Algorithms

    Dynamic-Programming

    Dense-Linear-Algebra

    Sparse-Linear-Algebra

    Unstructured-Grids

    Structured-Grids

    Model-View-Controller

    Iterative-Refinement

    Map-Reduce

    Layered-Systems

    Arbitrary-Static-Task-Graph

    Pipe-and-Filter

    Agent-and-Repository

    Process-Control

    Event-Based/Implicit-

    Invocation

    Puppeteer

    Graphical-Models

    Finite-State-Machines

    Backtrack-Branch-and-

    Bound

    N-Body-Methods

    Circuits

    Spectral-Methods

    Monte-Carlo

    Applications

    Structural Patterns Computational Patterns

    Task-Parallelism

    Divide and ConquerData-Parallelism

    Pipeline

    Discrete-Event

    Geometric-Decomposition

    Speculation

    SPMD

    Data-Par/index-

    space

    Fork/Join

    Actors

    Distributed-Array

    Shared-Data

    Shared-Queue

    Shared-map

    Partitioned Graph

    MIMD

    SIMD

    Parallel Execution Patterns

    Concurrent Algorithm Strategy Patterns

    Implementation Strategy Patterns

    Message-Passing

    Collective-Comm.

    Transactional memory

    Thread-Pool

    Task-Graph

    Data structureProgram structure

    Point-To-Point-Sync. (mutual exclusion)

    collective sync. (barrier)

    Memory sync/fence

    Loop-Par.

    Task-Queue

    Transactions

    Thread creation/destruction

    Process creation/destruction

    Concurrency Foundation constructs (not expressed as patterns)

    OPL/PLPP 2010

    http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

    Berkeley View

    13 dwarfs

    Garlan and Shaw

    Architectural Styles

  • 50

    Graph-Algorithms

    Dynamic-Programming

    Dense-Linear-Algebra

    Sparse-Linear-Algebra

    Unstructured-Grids

    Structured-Grids

    Model-View-Controller

    Iterative-Refinement

    Map-Reduce

    Layered-Systems

    Arbitrary-Static-Task-Graph

    Pipe-and-Filter

    Agent-and-Repository

    Process-Control

    Event-Based/Implicit-

    Invocation

    Puppeteer

    Graphical-Models

    Finite-State-Machines

    Backtrack-Branch-and-

    Bound

    N-Body-Methods

    Circuits

    Spectral-Methods

    Monte-Carlo

    Applications

    Structural Patterns Computational Patterns

    Task-Parallelism

    Divide and ConquerData-Parallelism

    Pipeline

    Discrete-Event

    Geometric-Decomposition

    Speculation

    SPMD

    Data-Par/index-

    space

    Fork/Join

    Actors

    Distributed-Array

    Shared-Data

    Shared-Queue

    Shared-map

    Partitioned Graph

    MIMD

    SIMD

    Parallel Execution Patterns

    Concurrent Algorithm Strategy Patterns

    Implementation Strategy Patterns

    Message-Passing

    Collective-Comm.

    Transactional memory

    Thread-Pool

    Task-Graph

    Data structureProgram structure

    Point-To-Point-Sync. (mutual exclusion)

    collective sync. (barrier)

    Memory sync/fence

    Loop-Par.

    Task-Queue

    Transactions

    Thread creation/destruction

    Process creation/destruction

    Concurrency Foundation constructs (not expressed as patterns)

    OPL/PLPP 2010

    http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

    Berkeley View

    13 dwarfs

    Garlan and Shaw

    Architectural Styles

  • 51

    Pipe-and-Filter

    Agent-and-Repository

    Event-based coordination

    Iterative refinement

    MapReduce

    Process Control

    Layered Systems

    Identify the SW Structure

    Structural Patterns

    These define the structure of our software but they do not describe what is computed

  • 52

    Elements of a structural pattern

    Components are where the computation happens

    Connectors are where the communication happens

    A configuration is a

    graph of

    components

    (vertices) and

    connectors (edges)

    A structural

    patterns may be

    described as a

    familiy of graphs.

  • 53

    Filter 6

    Filter 5

    Filter 4

    Filter 2

    Filter 7

    Filter 3

    Filter 1

    Pattern 1: Pipe and Filter

    Filters embody computationOnly see inputs and produce outputs

    Pipes embody communication

    May have feedback

  • 54

    Examples of pipe and filter Almost every large software program has a pipe and filter structure at

    the highest level

    Logic optimizerImage Retrieval System

    Compiler

  • 55

    Pattern 2: Iterative Refinement Pattern

    itera

    te

    Exit condition met?

    Initialization condition

    Synchronize results of iteration

    Variety of functions performed asynchronously

    Yes

    No

  • 5656

    Example of Iterative Refinement Pattern:

    Training a Classifier: SVM Training

    56

    Updatesurface

    IdentifyOutlier

    itera

    te

    Iterative Refinement Structural Pattern

    All points withinacceptable error? Yes

    No

  • 57

    Pattern 3: MapReduce To us, it means

    A map stage, where data is mapped onto independent

    computations

    A reduce stage, where the results of the map stage are

    summarized (i.e. reduced)

    Map

    Reduce

    Map

    Reduce

  • 58

    Examples of Map Reduce

    General structure:

    Map a computation across distributed data sets

    Reduce the results to find the best/(worst),

    maxima/(minima)

    Speech recognition

    Map HMM computation

    to evaluate word match

    Reduce to find the most-

    likely word sequences

    Support-vector machines (ML)

    Map to evaluate distance from

    the frontier

    Reduce to find the greatest

    outlier from the frontier

  • 59

    Graph-Algorithms

    Dynamic-Programming

    Dense-Linear-Algebra

    Sparse-Linear-Algebra

    Unstructured-Grids

    Structured-Grids

    Model-View-Controller

    Iterative-Refinement

    Map-Reduce

    Layered-Systems

    Arbitrary-Static-Task-Graph

    Pipe-and-Filter

    Agent-and-Repository

    Process-Control

    Event-Based/Implicit-

    Invocation

    Puppeteer

    Graphical-Models

    Finite-State-Machines

    Backtrack-Branch-and-

    Bound

    N-Body-Methods

    Circuits

    Spectral-Methods

    Monte-Carlo

    Applications

    Structural Patterns Computational Patterns

    Task-Parallelism

    Divide and ConquerData-Parallelism

    Pipeline

    Discrete-Event

    Geometric-Decomposition

    Speculation

    SPMD

    Data-Par/index-

    space

    Fork/Join

    Actors

    Distributed-Array

    Shared-Data

    Shared-Queue

    Shared-map

    Partitioned Graph

    MIMD

    SIMD

    Parallel Execution Patterns

    Concurrent Algorithm Strategy Patterns

    Implementation Strategy Patterns

    Message-Passing

    Collective-Comm.

    Transactional memory

    Thread-Pool

    Task-Graph

    Data structureProgram structure

    Point-To-Point-Sync. (mutual exclusion)

    collective sync. (barrier)

    Memory sync/fence

    Loop-Par.

    Task-Queue

    Transactions

    Thread creation/destruction

    Process creation/destruction

    Concurrency Foundation constructs (not expressed as patterns)

    OPL/PLPP 2010

    http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

    Berkeley View

    13 dwarfs

    Garlan and Shaw

    Architectural Styles

  • 60

    Identify Key Computations

    Computational patterns describe the key computations but not how they

    are implemented

    Computational Patterns

  • 61

    Computational pattern Example:

    N-body problem Consider a collection of particles that interact through a pair-wise force

    N-body problems show up in a wide range or applications

    Astrophysics and Celestial Mechanics

    Plasma Simulation

    Molecular Dynamics

    Electron-Beam Lithography Device Simulation

    Fluid Dynamics (vortex method)

    Game physics (cloth simulation, blood spatter, etc)

    Graph partitioning

    Some Elliptic PDE solvers

  • 62

    A class of techniques to solve certain partial differential equations such

    as Poissons equation:

    ),(),()(2

    2

    2

    2

    yxgyxfyx

    The idea is to:

    Apply a discrete transform to the PDE turning differential operators into algebraic operators.

    Solve the resulting system of algebraic or ordinary differential equations.

    Inverse transform the solution to return to the original domain.

    Computational Pattern Example

    Spectral Methods

  • The problem is cast as a network of random variables, where edges represent (potential) dependences

    Generally, we have observed variables (shaded) and hidden variables

    We need to reason about the hidden variables given our set of observations

    Can be directed or undirected

    Hidden Markov Model

    Computational Pattern Example

    Graphical models

  • 64

    Pipe-and-Filter

    Agent-and-Repository

    Event-based

    Bulk Synchronous

    MapReduce

    Layered Systems

    Arbitrary Task Graphs

    Decompose Tasks/Data

    Order tasks Identify Data Sharing and Access

    Graph Algorithms

    Dynamic programming

    Dense/Spare Linear Algebra

    (Un)Structured Grids

    Graphical Models

    Finite State Machines

    Backtrack Branch-and-Bound

    N-Body Methods

    Circuits

    Spectral Methods

    Architecting Parallel Software

    Identify the Software Structure

    Identify the Key Computations

    An architecture is a

    composition of design

    patterns.

  • Agenda Our hardware future

    Rising to the many core software challenge

    PLPP the original parallel pattern language

    OPL patterns for engineering parallel software

    Example

    OpenCL: A common HW abstraction layer

  • Inference Engine

    Beam Search Iterations

    LVCSR Software ArchitecturePipe-and-filter

    Graphical Model

    Dynamic Programming

    Iterative Refinement

    Task Graph

    Speech Feature

    Extractor

    Voice Input

    SpeechFeatures

    Recognition Network

    Acoustic Model

    Pronunciation Model

    Language Model

    MapReduce

    WordSequenceI think

    therefore

    I am

    Active State Computation Steps

    Patterns/framework example

    LVCSR = Large vocabulary continuous speech recognition.

  • 67/24

    A Speech Framework with Extension Points

    Read Files

    Initialize data structures

    CPU

    GPU

    Backtrack

    Output Results

    Phase 0

    Phase 1

    Compute Observation Probability

    Phase 2

    For each active arc: Compute arc transition

    probability

    Copy results back to CPU

    Collect Backtrack Info

    Prepare ActiveSet

    Iteration Control

    File Input Format

    Pruning Strategy

    Observation Probability Computation

    Backtrack

    Result output format

    StaticData StructureOptimizations

    DynamicData StructureOptimizations

    A WFST Based Inference Engine Framework

  • 68/24

    A Speech Framework with Extension Points

    Read Files

    Initialize data structures

    CPU

    GPU

    Backtrack

    Output Results

    Phase 0

    Phase 1

    Compute Observation Probability

    Phase 2

    For each active arc: Compute arc transition

    probability

    Copy results back to CPU

    Collect Backtrack Info

    Prepare ActiveSet

    Iteration Control

    A WFST Based Inference Engine FrameworkHTK Format

    SRI Format

    Fixed Beam Width

    Adaptive Beam Width

    HMM SRI GPU ObsProb

    HMM WSJ GPU ObsProb

    CHMM SRI GPU ObsProb

    Backtrack

    Backtrack + conf. metric

    HResult format

    SRI Scoring format

    Two level WFST Network

    One level WFST Network

    Preload All

    Selective Preload

  • The Programmers View

    Jike Chong, Ekaterina Gonina, Kisun You,

    Kurt Keutzer, Exploring Recognition

    Network Representations for Efficient Speech

    Inference on Highly Parallel Platforms,

    Submitted to Interspeech 2010.

    Dorothea Kolossa, Jike Chong, Steffen Zeiler,

    Kurt Keutzer, Efficient Manycore CHMM

    Speech Recognition for Audiovisual and

    Multistream Data, Submitted to Interspeech

    2010.

    Patterns/framework example

  • 70

    Are Frameworks enough?

    Efficiency to support the extremely low overheads required by Amdahl's law

    collapse layers of abstractions. Dynamic optimization.

    A stable, richly supported, and highly portable programming environment to make framework development practical

    A hardware abstraction layer for heterogeneous platforms ... Creates economic justification for the effort since one frameworks supports many platforms

    Analyze key app domain

    Discover common paths through our pattern language.

    These paths define the framework design

    Implement framework

  • Agenda Our hardware future

    Rising to the many core software challenge

    OpenCL: A common HW abstraction layer

  • How to program the heterogenous platform?Let History can be our guide consider the origins of OpenMP

    SGI

    Cray

    Merged,

    needed

    commonality

    across

    products

    KAI ISV - needed larger market

    was tired of

    recoding for

    SMPs. Forced

    vendors to

    standardize.

    ASCI

    Wrote a

    rough draft

    straw man

    SMP API

    DEC

    IBM

    Intel

    HP

    Other vendors

    invited to join

    1997Third party names are the property of their owners.

  • OpenCL: Can history repeat itself?

    AMD

    ATI

    Merged,

    needed

    commonality

    across

    products

    NvidiaGPU vendor -

    wants to steel mkt

    share from CPU

    IntelCPU vendor -

    wants to steel mkt

    share from GPU

    Wrote a

    rough draft

    straw man

    API

    was tired of recoding

    for many core, GPUs.

    Pushed vendors to

    standardize.

    Apple

    Ericsson

    Sony

    Blizzard

    Nokia

    Khronos

    Compute

    group formed

    Freescale

    TI

    IBM

    + many

    more

    As ASCI did for OpenMP, Apple is doing for

    GPU/CPU with OpenCL

    Dec 2008Third party names are the property of their owners.

  • OpenCL Working GroupDiverse industry participation

    HW vendors (e.g. Apple), system OEMs, middleware vendors, application developers.

    OpenCL became an important standard on release by virtue of the market coverage of the companies behind it.

    Third party names are the property of their owners.

    http://www.codeplay.com/http://www.amd.com/http://www.umu.se/umu/index_eng.htmlhttp://www.gshark.com/http://www.st.com/
  • The BIG idea behind OpenCLOpenCL execution model execute a kernel at each point in a problem domain.

    E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

    void

    trad_mul(int n,

    const float *a,

    const float *b,

    float *c)

    {

    int i;

    for (i=0; i

  • An N-dimension domain of work-itemsDefine an N-dimensioned index space that is best for your algorithm

    Global Dimensions: 1024 x 1024 (whole problem space)

    Local Dimensions: 128 x 128 (work group executes together)

    1024

    1024

    Synchronization between work-items

    possible only within workgroups:

    barriers and memory fences

    Cannot synchronize outside

    of a workgroup

  • To use OpenCL, you must

    Define the platform

    Execute code on the platform

    Move data around in memory

    Write (and build) programs

  • OpenCL Platform Model

    One Host + one or more Compute Devices

    Each Compute Device is composed of one or more Compute Units

    Each Compute Unit is further divided into one or more Processing Elements

  • OpenCL Execution ModelAn OpenCL application runs on a host which submits work to the compute devices.

    Work item: the basic unit of work on an OpenCL device.

    Kernel: the code for a work item. Basically a C function

    Program: Collection of kernels and other functions (Analogous to a dynamic library)

    Context: The environment within which work-items executes includes devices and their memories and command queues.

    Queue Queue

    Contex

    t

    GPU CPU

    Applications queue kernel execution instances

    Queued in-order one queue to a device

    Executed in-order or out-of-order

  • OpenCL Memory Model

    Memory management is Explicit

    You must move data from host -> global -> local and back

    Private Memory

    Per work-item

    Local Memory

    Shared within a workgroup

    Global/Constant Memory

    Visible to all workgroups

    Host Memory

    On the CPU

    Workgroup

    Work-Item

    Compute Device

    Work-Item

    Workgroup

    Host

    Private Memory

    Private Memory

    Local MemoryLocal Memory

    Global/Constant Memory

    Host Memory

    Work-ItemWork-Item

    Private Memory

    Private Memory

  • Programming kernels: the OpenCL C Language

    A subset of ISO C99

    But without some C99 features such as standard C99 headers, function pointers, recursion, variable length arrays, and bit fields

    A superset of ISO C99 with additions for:

    Work-items and workgroups

    Vector types

    Synchronization

    Address space qualifiers

    Also includes a large set of built-in functions for image manipulation, work-item manipulation, specialized math routines, etc.

  • Programming Kernels: Data TypesScalar data types

    char , uchar, short, ushort, int, uint, long, ulong, float

    bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)

    Image types

    image2d_t, image3d_t, sampler_t

    Vector data types

    Vector lengths 2, 4, 8, & 16 (char2, ushort4, int8, float16, double2, )

    Endian safe

    Aligned at vector length

    Vector operations and built-in functions

    2 3 -7 -7

    -7 -7 -7 -7int4 vi0 = (int4) -7;

    0 1 2 3int4 vi1 = (int4)(0, 1, 2, 3);

    vi0.lo = vi1.hi;

    int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); 2 3 -7 -7 0 1 1 3

    Double is an optional

    type in OpenCL 1.0

  • Building Program objects

    The program object encapsulates: A context

    The program source/binary

    List of target devices and build options

    The Build process to create a program object clCreateProgramWithSource()

    clCreateProgramWithBinary()

    Programkernel void

    horizontal_reflect(read_only image2d_t src,

    write_only image2d_t dst)

    {

    int x = get_global_id(0); // x-coord

    int y = get_global_id(1); // y-coord

    int width = get_image_width(src);

    float4 src_val = read_imagef(src, sampler,

    (int2)(width-1-x, y));

    write_imagef(dst, (int2)(x, y), src_val);

    }

    Compile for

    GPU

    Compile for

    CPU

    GPU

    code

    CPU

    code

    Kernel Code

  • Vector Addition - Kernel

    __kernel void vec_add (__global const float *a,

    __global const float *b,

    __global float *c)

    {

    int gid = get_global_id(0);

    c[gid] = a[gid] + b[gid];

    }

  • Vector Addition: Host Program

    // create the OpenCL context on a GPU device

    cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

    // get the list of GPU devices associated with context

    clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

    NULL, &cb);

    devices = malloc(cb);

    clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

    // create a command-queue

    cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

    // allocate the buffer memory objects

    memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA,

    NULL);}

    memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB,

    NULL);

    memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

    sizeof(cl_float)*n, NULL,

    NULL);

    // create the program

    program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

    // build the program

    err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    // create the kernel

    kernel = clCreateKernel(program, vec_add, NULL);

    // set the args values

    err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

    sizeof(cl_mem));

    err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],

    sizeof(cl_mem));

    err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],

    sizeof(cl_mem));

    // set work-item dimensions

    global_work_size[0] = n;

    // execute kernel

    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

    // read output array

    err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

  • Vector Addition: Host Program

    // create the OpenCL context on a GPU device

    cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

    // get the list of GPU devices associated with context

    clGetContextInfo(context, CL_CONTEXT_DEVICES, 0,

    NULL, &cb);

    devices = malloc(cb);

    clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

    // create a command-queue

    cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

    // allocate the buffer memory objects

    memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);}

    memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL);

    memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY,

    sizeof(cl_float)*n, NULL, NULL);

    // create the program

    program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

    // build the program

    err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    // create the kernel

    kernel = clCreateKernel(program, vec_add, NULL);

    // set the args values

    err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],

    sizeof(cl_mem));

    err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],

    sizeof(cl_mem));

    err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],

    sizeof(cl_mem));

    // set work-item dimensions

    global_work_size[0] = n;

    // execute kernel

    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

    // read output array

    err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

    Define platform and queues

    Define Memory objects

    Create the program

    Build the program

    Create and setup kernel

    Execute the kernel

    Read results on the host

    Its complicated, but most of this is boilerplate and not as bad as it looks.

  • GPU

    CPU

    En

    qu

    eu

    e K

    ern

    el 1

    Kernel 1

    En

    qu

    eu

    e K

    ern

    el 2

    Time

    GPU

    CPU

    En

    qu

    eu

    e K

    ern

    el 1

    Kernel 1

    En

    qu

    eu

    e K

    ern

    el 2

    Kernel 2

    Time

    Kernel 2

    Kernel 2 waits for an event from Kernel 1 and does not

    start until the results are ready

    Kernel 2 starts before the results from Kernel 1 are

    ready

    Events can be used to synchronize kernel executions between queues

    Example: 2 queues with 2 devices

    OpenCL Synchronization: Queues & Events

  • arg [0]

    value

    arg [1]

    value

    arg [2]

    value

    arg [0]

    value

    arg [1]

    value

    arg [2]

    value

    In

    Order

    Queue

    Out of

    Order

    Queue

    GPU

    Context

    __kernel void

    dp_mul(global const float *a,

    global const float *b,

    global float *c)

    {

    int id = get_global_id(0);

    c[id] = a[id] * b[id];

    }

    dp_mul

    CPU program binary

    dp_mul

    GPU program binary

    Programs Kernels

    arg[0] value

    arg[1] value

    arg[2] value

    Images Buffers

    In

    Order

    Queue

    Out of

    Order

    Queue

    Compute Device

    GPUCPU

    dp_mul

    Programs Kernels Memory Objects Command Queues

    OpenCL summary

    Third party names are the property of their owners.

  • 89

    ConclusionsMany core processors mean all software must be parallel.

    We cant turn everyone into parallel algorithm experts we have to support a separation of concerns:

    Hardcore experts build programming frameworks emphasize efficiency using industry standard languages.

    Domain expert programmers assembly applications within a framework emphasize productivity.

    Design Patterns are a tool to help us get this right.

    We believe software architectures can be built up from a manageable number of design patterns. These patterns define the building blocks of all software engineering and are fundamental to the practice of architecting parallel software. Hence, an effort to propose, argue about, and finally agree on what constitutes this set of patterns is the seminal intellectual challenge of our field.

    K. Keutzer and T. Mattson, A design pattern language for engineering (parallel) software, Intel Tech. Journal, Vol. 13, # 4, 2010.

  • 90

    Patterns Acknowledgements Kurt Keutzer (UCB), Ralph Johnson (UIUC) and our

    community of patterns writers: Hugo Andrade, Chris Batten, Eric Battenberg, Hovig Bayandorian, Dai Bui,

    Bryan Catanzaro, Jike Chong, Enylton Coelho, Katya Gonina, Yunsup Lee, Mark Murphy, Heidi Pan, Kaushik Ravindran, Sayak Ray, Erich Strohmaier, Bor-yiing Su, Narayanan Sundaram, Guogiang Wang, Youngmin Yi., Jeff Anderson-Lee, Joel Jones, Terry Ligocki, and Sam Williams.

    The development of our pattern language has also received a boost from Par Lab faculty particularly: Krste Asanovic, Jim Demmel, and David Patterson.

    My co-authors, colleagues and friends who helped write the PLPP patterns book

    Beverly Sanders (University of Florida)

    Berna Massingill (Trinity University)