a “hands-on” introduction tointroduction to openmpkorobkin/tmp/sc10/tutorials/... · 2010. 11....

177
A Hands-onIntroduction to A Hands on Introduction to OpenMP * Tim Mattson ItlC ti Michael Wrinn ItlC ti Intel Corporation [email protected] Intel Corporation [email protected] Mark Bull EPCC, The University of Edinburgh [email protected] 1 * The name “OpenMP” is the property of the OpenMP Architecture Review Board. [email protected]

Upload: others

Post on 06-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • A “Hands-on” Introduction toA Hands on Introduction to OpenMP*

    Tim MattsonI t l C ti

    Michael WrinnI t l C tiIntel Corporation

    [email protected] Corporation

    [email protected]

    Mark BullEPCC, The University of Edinburgh

    [email protected]

    1* The name “OpenMP” is the property of the OpenMP Architecture Review Board.

    [email protected]

  • Preliminaries: part 1p

    DisclosuresThe views expressed in this tutorial are those of the

    people delivering the tutorial. W t ki f l– We are not speaking for our employers.

    – We are not speaking for the OpenMP ARBWe take these tutorials VERY seriously: We take these tutorials VERY seriously:Help us improve … tell us how you would make this

    tutorial better.tuto a bette

    2

  • Preliminaries: Part 2 Our plan for the day .. Active learning!

    We will mix short lectures with short exercises.You will use your laptop for the exercises … that

    way you’ll have an OpenMP environment to takeway you ll have an OpenMP environment to take home so you can keep learning on your own.

    Please follow these simple rulesDo the exercises we assign and then change things

    around and experiment.– Embrace active learning!b ace act e ea g

    Don’t cheat: Do Not look at the solutions before you complete an exercise … even if you get really frustrated

    3

    frustrated.

  • Our Plan for the day

    Topic Exercise conceptsI. OMP Intro Install sw, hello_world Parallel regionsII. Creating threads Pi_spmd_simple Parallel, default data g _ p _ p

    environment, runtime library calls

    III. Synchronization Pi_spmd_final False sharing, critical, atomicBreak

    IV. Parallel loops Pi_loop, Matmul For, schedule, reduction,

    V. Odds and ends No exercise Single, sections, master, runtime libraries, environment

    i bl h i ti tvariables, synchronization, etc.VI. Data Environment Molecular Dyn. Data environment details,

    software optimizationVII. OpenMP 3 and Linked list (tasks) Tasks and other OpenMP 3

    Lunch

    VII. OpenMP 3 and tasks

    Linked list (tasks) Linked list (no tasks)

    Tasks and other OpenMP 3 features

    VIII. Memory model No exercise The need for flush, but its actual use is left to an appendix

    Break

    4

    IX. Threadprivate Pi_mc Modular software

  • Outline

    Introduction to OpenMP Creating Threads Synchronization y Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t5

    Threadprivate Data

  • OpenMP* Overview:Ope O e e#pragma omp criticalC$OMP FLUSH

    C$OMP THREADPRIVATE(/ABC/)

    C$OMP parallel do shared(a, b, c)

    ll OMP INIT LOCK (il k)

    call omp_test_lock(jlok)

    CALL OMP_SET_NUM_THREADS(10)C$OMP THREADPRIVATE(/ABC/)

    OpenMP: An API for Writing Multithreaded Applications

    call OMP_INIT_LOCK (ilok)

    setenv OMP_SCHEDULE “dynamic”C$OMP SINGLE PRIVATE(X)

    C$OMP MASTERC$OMP ATOMICpp

    A set of compiler directives and library routines for parallel application programmers

    C$OMP PARALLEL REDUCTION (+: A, B)

    C$OMP ORDERED

    C$OMP SECTIONS

    C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)routines for parallel application programmersGreatly simplifies writing multi-threaded (MT) programs in Fortran, C and C++

    #pragma omp parallel for private(A, B)

    C$OMP DO lastprivate(XX)C$OMP PARALLEL COPYIN(/blk/)

    !$OMP BARRIER

    p g ,Standardizes last 20 years of SMP practice

    6omp_set_lock(lck)

    p

    Nthrds = OMP_GET_NUM_PROCS()

    * The name “OpenMP” is the property of the OpenMP Architecture Review Board.

  • OpenMP Basic Defs: Solution Stack

    A li ti

    End User

    Directives, OpenMP library Environment

    Application

    O MP R ti lib

    ,Compiler

    OpenMP library variables

    OpenMP Runtime library

    OS/system support for shared memory and threading

    Proc3Proc2Proc1 ProcN

    7

    Shared Address Space

  • OpenMP core syntax Most of the constructs in OpenMP are compiler

    directives.#pragma omp construct [clause [clause]…]

    Example#pragma omp parallel num threads(4)#pragma omp parallel num_threads(4)

    Function prototypes and types in the file: #include #include

    Most OpenMP* constructs apply to a “structured block”.Structured block: a block of one or more statements

    with one point of entry at the top and one point of exit at the bottom.

    8

    It’s OK to have an exit() within the structured block.

  • Exercise 1, Part A: Hello worldV if th t i t kVerify that your environment works Write a program that prints “hello world”.

    int main()int main(){{

    int ID = 0;int ID = 0;;

    printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    ;

    printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    9}}

  • Exercise 1, Part B: Hello worldV if th t O MP i t kVerify that your OpenMP environment works Write a multithreaded program that prints “hello world”.

    void main()void main()#include “omp.h” Switches for compiling and

    linking

    {{#pragma omp parallel

    gcc -fopenmp gcc

    pgcc -mp pgi

    i l /Q i t l ( i d )

    int ID = 0;int ID = 0;

    { icl /Qopenmp intel (windows)icc –openmp intel (linux)

    ;

    printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    ;

    printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    }

    10}}}

  • Exercise 1: SolutionA lti th d d “H ll ld”A multi-threaded “Hello world” program Write a multithreaded program where each

    thread prints “hello world”.#include “omp.h”

    id i ()#include “omp.h”

    id i ()OpenMP include file

    void main(){

    #pragma omp parallel

    void main(){

    #pragma omp parallel

    Sample Output:hello(1) hello(0) world(1)

    Sample Output:hello(1) hello(0) world(1)

    Parallel region with default number of threads

    #pragma omp parallel{

    int ID = omp get thread num();

    #pragma omp parallel{

    int ID = omp get thread num();

    hello(1) hello(0) world(1)

    world(0)

    h ll (3) h ll (2) ld(3)

    hello(1) hello(0) world(1)

    world(0)

    h ll (3) h ll (2) ld(3)int ID = omp_get_thread_num();printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    }

    int ID = omp_get_thread_num();printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    }

    hello (3) hello(2) world(3)

    world(2)

    hello (3) hello(2) world(3)

    world(2)

    11

    }}

    }}

    Runtime library function to return a thread ID.End of the Parallel region

  • OpenMP Overview:How do threads interact?How do threads interact? OpenMP is a multi-threading, shared address

    modelmodel.– Threads communicate by sharing variables.

    Unintended sharing of data causes race Unintended sharing of data causes race conditions:

    – race condition: when the program’s outcome changes as the threads are scheduled differently.

    To control race conditions:– Use synchronization to protect data conflicts.

    Synchronization is expensive so:

    12

    – Change how data is accessed to minimize the need for synchronization.

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t13

    Threadprivate Data

  • OpenMP Programming Model: Fork-Join Parallelism:

    Master thread spawns a team of threads as needed.

    Parallelism added incrementally until performance goals are met: i.e. the sequential program evolves into a parallel programparallel program.

    Parallel RegionsMaster Thread

    A Nested Parallel region

    A Nested Parallel regionThread

    in redregionregion

    14Sequential Parts

  • Thread Creation: Parallel Regionsg

    You create threads in OpenMP* with the parallel p pconstruct.

    For example, To create a 4 thread Parallel region:double A[1000];omp_set_num_threads(4);# ll l

    Each thread executes aEach thread executes a

    Runtime function to request a certain number of threads

    Runtime function to request a certain number of threads#pragma omp parallel

    {int ID = omp_get_thread_num();

    executes a copy of the code within

    the

    executes a copy of the code within

    the

    number of threadsnumber of threads

    p_g _ _ ();pooh(ID,A);

    }

    the structured

    block

    the structured

    blockRuntime function returning a thread IDRuntime function returning a thread ID

    15

    Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3

    * The name “OpenMP” is the property of the OpenMP Architecture Review Board

  • Thread Creation: Parallel Regionsg You create threads in OpenMP* with the parallel

    constructconstruct. For example, To create a 4 thread Parallel region:

    l t t t il t t t idouble A[1000];

    # ll l th d (4)Each thread executes aEach thread executes a

    clause to request a certain number of threadsclause to request a certain number of threads

    #pragma omp parallel num_threads(4){

    int ID = omp_get_thread_num();

    executes a copy of the code within

    the

    executes a copy of the code within

    the p_g _ _ ();pooh(ID,A);

    }

    the structured

    block

    the structured

    blockRuntime function returning a thread IDRuntime function returning a thread ID

    16

    Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3

    * The name “OpenMP” is the property of the OpenMP Architecture Review Board

  • Thread Creation: Parallel Regions example

    Each thread executes the double A[1000]; Each thread executes the same code redundantly.

    omp_set_num_threads(4);#pragma omp parallel{

    int ID = omp_get_thread_num();pooh(ID, A);

    }double A[1000];

    printf(“all done\n”);omp_set_num_threads(4)

    A single copy of AA single copy of A

    pooh(1,A) pooh(2,A) pooh(3,A)pooh(0,A)copy of A is shared between all threads.

    copy of A is shared between all threads.

    17

    printf(“all done\n”); Threads wait here for all threads to finish before proceeding (i.e. a barrier)Threads wait here for all threads to finish before proceeding (i.e. a barrier)

    * The name “OpenMP” is the property of the OpenMP Architecture Review Board

  • Exercises 2 to 4: Numerical IntegrationNumerical Integration

    4 01

    Mathematically, we know that:

    4.0

    4.0(1+x2) dx = 0

    4.0

    We can approximate the integral as a sum of rectangles:2.0

    F(xi)x N

    rectangles:

    ( i)i = 0

    Where each rectangle has idth d h i ht F( ) t

    1.0X0.0

    18

    width x and height F(xi) at the middle of interval i.

    X

  • Exercises 2 to 4: Serial PI Programg

    static long num steps = 100000;g _ p ;double step;void main (){ int i; double x pi sum = 0 0;{ int i; double x, pi, sum = 0.0;

    step = 1.0/(double) num_steps;

    for (i=0;i< num_steps; i++){x = (i+0.5)*step;( ) p;sum = sum + 4.0/(1.0+x*x);

    }pi = step * sum;

    19

    pi = step sum;}}

  • Exercise 2

    Create a parallel version of the pi program using a parallel construct.

    Pay close attention to shared versus private variables.

    In addition to a parallel construct, you will need th ti lib tithe runtime library routinesint omp_get_num_threads();int omp get thread num();

    Number of threads in the team

    int omp_get_thread_num();double omp_get_wtime(); Thread ID or rank

    20Time in Seconds since a fixed

    point in the past

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t21

    Threadprivate Data

  • Synchronization Synchronization is usedSy c o at o

    High level synchronization:

    Synchronization is used to impose order

    constraints and to t t t h d

    g y– critical– atomic

    protect access to shared data

    – barrier– ordered

    Discussed later Low level synchronization

    – flush– locks (both simple and nested)

    22

  • Synchronization: critical

    Mutual exclusion: Only one thread at a time can enter a critical region.

    float res;

    #pragma omp parallel

    { float B; int i, id, nthrds;

    id = omp_get_thread_num();

    nthrds = omp_get_num_threads();

    for(i=id;i

  • Synchronization: Atomicy Atomic provides mutual exclusion but only

    applies to the update of a memory location (the pp p y (update of X in the following example)

    #pragma omp parallel#pragma omp parallel#pragma omp parallel

    { double tmp, B;

    #pragma omp parallel

    { double tmp, B;p, ;

    B = DOIT();

    p, ;

    B = DOIT();

    tmp = big ugly(B); Atomic only protects the

    #pragma omp atomic X += big ugly(B);

    tmp = big_ugly(B);

    #pragma omp atomic X += tmp;

    Atomic only protects the read/update of X

    24

    X big_ugly(B);

    }

    X tmp;

    }

  • Exercise 3

    In exercise 2, you probably used an array to create space for each thread to store its partial sum.

    If array elements happen to share a cache line, this leads to false sharing.

    Non shared data in the same cache line so each– Non-shared data in the same cache line so each update invalidates the cache line … in essence “sloshing independent data” back and forth b t th dbetween threads.

    Modify your “pi program” from exercise 2 to avoid false sharing due to the sum array

    25

    avoid false sharing due to the sum array.

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t26

    Threadprivate Data

  • SPMD vs. worksharingg A parallel construct by itself creates an SPMD

    “Si l P M lti l D t ”or “Single Program Multiple Data” program … i.e., each thread redundantly executes the same code.

    How do you split up pathways through the code between threads within a team?This is called worksharing

    – Loop constructSections/section constructs Discussed later– Sections/section constructs

    – Single construct– Task construct …. Available in OpenMP 3.0

    27

  • The loop worksharing Constructs The loop worksharing construct splits up loop

    iterations among the threads in a teamg

    #pragma omp parallel

    { Loop construct{#pragma omp for

    for (I=0;I

  • Loop worksharing ConstructsA ti ti lA motivating example

    for(i=0;I

  • loop worksharing constructs:The schedule clauseThe schedule clause

    The schedule clause affects how loop iterations are mapped onto threadsschedule(static [,chunk])

    – Deal-out blocks of iterations of size “chunk” to each thread.

    h d l (d i [ h k])schedule(dynamic[,chunk])– Each thread grabs “chunk” iterations off a queue until all

    iterations have been handled.

    schedule(guided[,chunk])– Threads dynamically grab blocks of iterations. The size of the

    block starts large and shrinks down to size “chunk” as theblock starts large and shrinks down to size chunk as the calculation proceeds.

    schedule(runtime)S h d l d h k i t k f th OMP SCHEDULE

    30

    – Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime library … for OpenMP 3.0).

  • loop work-sharing constructs:The schedule clauseThe schedule clause

    Schedule Clause When To Use

    The schedule clauseThe schedule clause

    L t k tL t k tSchedule Clause When To Use

    STATIC Pre-determined and di t bl b th

    Least work at runtime : scheduling done at

    Least work at runtime : scheduling done at predictable by the

    programmer

    DYNAMIC Unpredictable, highly

    compile-timecompile-time

    variable work per iteration

    GUIDED Special case of dynamic

    Most work at runtime : complex scheduling

    Most work at runtime : complex schedulingyto reduce scheduling

    overheadscheduling logic used at run-time

    scheduling logic used at run-time

    31

  • Combined parallel/worksharing construct

    OpenMP shortcut: Put the “parallel” and the p pworksharing directive on the same line

    d bl [MAX] i t idouble res[MAX]; int i;#pragma omp parallel {

    #pragma omp for

    double res[MAX]; int i;#pragma omp parallel for

    for (i=0;i< MAX; i++) {i ()#pragma omp for

    for (i=0;i< MAX; i++) {res[i] = huge();

    }

    res[i] = huge();}

    } }

    These are equivalent

    32

  • Working with loopsg p Basic approach

    Find compute intensive loopsFind compute intensive loopsMake the loop iterations independent .. So they can

    safely execute in any order without loop-carried d d idependencies

    Place the appropriate OpenMP directive and test

    int i, j, A[MAX];j = 5;

    int i, A[MAX];#pragma omp parallel for

    Note: loop index “i” is private by

    defaultj ;for (i=0;i< MAX; i++) {

    j +=2;A[i] = big(j);

    p g p pfor (i=0;i< MAX; i++) {

    int j = 5 + 2*(i+1);A[i] = big(j); Remove loop

    33

    } } e ove oopcarried

    dependence

  • Reduction How do we handle this case?

    double ave=0.0, A[MAX]; int i;for (i=0;i< MAX; i++) {

    How do we handle this case?

    for (i=0;i< MAX; i++) {ave + = A[i];

    } ave = ave/MAX;

    We are combining values into a single accumulation variable (ave) … there is a true dependence between

    ave ave/MAX;

    ( ) ploop iterations that can’t be trivially removed

    This is a very common situation … it is called a “reduction”“reduction”.

    Support for reduction operations is included in most parallel programming environments.

    34

    p p g g

  • Reduction OpenMP reduction clause: OpenMP reduction clause:

    reduction (op : list) Inside a parallel or a work-sharing construct:p g

    – A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”).

    – Updates occur on the local copyUpdates occur on the local copy. – Local copies are reduced into a single value and

    combined with the original global value.Th i bl i “li t” t b h d i th l i The variables in “list” must be shared in the enclosing parallel region.

    double ave=0.0, A[MAX]; int i;#pragma omp parallel for reduction (+:ave)for (i=0;i< MAX; i++) {

    35

    ( ; ; ) {ave + = A[i];

    } ave = ave/MAX;

  • OpenMP: Reduction operands/initial-values Many different associative operands can be used with reduction: Initial values are the ones that make sense mathematically.

    O t I iti l lOperator Initial value

    + 0* 1

    Fortran Only

    Operator Initial value1- 0

    C/C++ l

    .AND. .true..OR. .false.

    NEQV falseC/C++ onlyOperator Initial value

    & ~0

    .NEQV. .false..IEOR. 0.IOR. 0

    & 0

    | 0

    ^ 0

    .IAND. All bits on.EQV. .true.MIN* Largest pos. number

    36

    && 1|| 0

    MAX* Most neg. number

  • Exercise 4: Pi with loopsp

    Go back to the serial pi program and parallelize it with a loop construct

    Your goal is to minimize the number of changes made to the serial program.

    37

  • Exercise 5: Optimizing loopsp g p

    Parallelize the matrix multiplication program in the file matmul.c

    Can you optimize the program by playing with how the loops are scheduled?

    38

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t39

    Threadprivate Data

  • Synchronization: BarrierSy c o at o a e Barrier: Each thread waits until all threads arrive.

    #pragma omp parallel shared (A, B, C) private(id){

    id=omp_get_thread_num();A[id] = big_calc1(id);

    #pragma omp barrier implicit barrier at the end of a f k h i t timplicit barrier at the end of a f k h i t t#pragma omp barrier#pragma omp for

    for(i=0;i

  • Master Construct The master construct denotes a structured

    block that is only executed by the master threadblock that is only executed by the master thread. The other threads just skip it (no

    synchronization is implied).synchronization is implied).

    #pragma omp parallel {{

    do_many_things();#pragma omp master

    { exchange_boundaries(); }#pragma omp barrier

    do many other things();

    41

    do_many_other_things();}

  • Sections worksharing Constructg The Sections worksharing construct gives a

    different structured block to each thread. #pragma omp parallel{

    #pragma omp sections

    #pragma omp parallel{

    #pragma omp sections#pragma omp sections{#pragma omp section

    X calculation();

    #pragma omp sections{#pragma omp section

    X calculation();X_calculation();#pragma omp section

    y_calculation();#pragma omp section

    X_calculation();#pragma omp section

    y_calculation();#pragma omp sectionp g p

    z_calculation();}

    }

    p g pz_calculation();

    }

    }

    42

    }}

    By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.

  • Single worksharing ConstructS g e o s a g Co st uct The single construct denotes a block of code that is

    executed by only one thread (not necessarily theexecuted by only one thread (not necessarily the master thread).

    A barrier is implied at the end of the single block (can th b i ith it l )remove the barrier with a nowait clause).

    #pragma omp parallel#pragma omp parallel {

    do_many_things();#pragma omp single

    { exchange_boundaries(); }do many other things();

    43

    _ y_ _ g ();}

  • Synchronization: orderedSy c o at o o de ed

    The ordered region executes in the sequential gorder.

    #pragma omp parallel private (tmp)#pragma omp for ordered reduction(+:res)

    for (I=0;I

  • Synchronization: Lock routines Simple Lock routines: A lock implies a Simple Lock routines:

    A simple lock is available if it is unset.– omp init lock(), omp set lock(),

    A lock implies a memory fence (a

    “flush”) of all thread visible omp_init_lock(), omp_set_lock(),

    omp_unset_lock(), omp_test_lock(), omp_destroy_lock()

    variables

    Nested LocksA nested lock is available if it is unset or if it is set but

    d b h h d i h d l k f iowned by the thread executing the nested lock function– omp_init_nest_lock(), omp_set_nest_lock(),

    omp unset nest lock() omp test nest lock()omp_unset_nest_lock(), omp_test_nest_lock(), omp_destroy_nest_lock()

    45

    Note: a thread always accesses the most recent copy of the lock, so you don’t need to use a flush on the lock variable.

  • Synchronization: Simple Locksy p Protect resources with locks.

    omp_lock_t lck;omp_init_lock(&lck);#pragma omp parallel private (tmp id)#pragma omp parallel private (tmp, id){

    id = omp get thread num(); Wait here for Wait here for id omp_get_thread_num();tmp = do_lots_of_work(id);omp_set_lock(&lck);

    your turn.your turn.

    Release the lockRelease the lockprintf(“%d %d”, id, tmp);

    omp_unset_lock(&lck);}

    Release the lock so the next thread gets a turn.

    Release the lock so the next thread gets a turn.

    46

    } omp_destroy_lock(&lck); Free-up storage when done.Free-up storage when done.

  • Runtime Library routines Runtime environment routines:

    – Modify/Check the number of threads– omp_set_num_threads(), omp_get_num_threads(),

    omp_get_thread_num(), omp_get_max_threads()– Are we in an active parallel region?p g

    – omp_in_parallel()– Do you want the system to dynamically vary the number of

    threads from one parallel construct to another?threads from one parallel construct to another?– omp_set_dynamic, omp_get_dynamic();

    – How many processors in the system?– omp_num_procs()

    …plus a few less commonly used routines.

    47

    p us a e ess co o y used out es

  • Runtime Library routines To use a known, fixed number of threads in a program,

    (1) tell the system that you don’t want dynamic adjustment of the number of threads, (2) set the number of threads, then (3)

    th b tsave the number you got.

    #include void main()

    Disable dynamic adjustment of the number of threads.

    void main(){ int num_threads;

    omp_set_dynamic( 0 );omp set num threads( omp num procs() );

    Request as many threads as you have processors.

    p_ _ _ ( p_ _p () );#pragma omp parallel

    { int id=omp_get_thread_num();#pragma omp single

    Protect this op since Memory stores are not atomic

    num_threads = omp_get_num_threads();do_lots_of_stuff(id);

    }} Even in this case the system may give you fewer threads

    48

    } Even in this case, the system may give you fewer threads than requested. If the precise # of threads matters, test for it and respond accordingly.

  • Environment Variableso e t a ab es

    Set the default number of threads to use.– OMP_NUM_THREADS int_literal

    Control how “omp for schedule(RUNTIME)” loop iterations are scheduled.

    – OMP_SCHEDULE “schedule[, chunk_size]”

    … Plus several less commonly used environment variables.

    49

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t50

    Threadprivate Data

  • Data environment:D f lt t tt ib tDefault storage attributes

    Shared Memory programming model: Shared Memory programming model: – Most variables are shared by default

    Global variables are SHARED among threadsg– Fortran: COMMON blocks, SAVE variables, MODULE

    variables– C: File scope variables staticC: File scope variables, static– Both: dynamically allocated memory (ALLOCATE, malloc, new)

    But not everything is shared...y g– Stack variables in subprograms(Fortran) or functions(C) called

    from parallel regions are PRIVATE– Automatic variables within a statement block are PRIVATE.

    51

    Automatic variables within a statement block are PRIVATE.

  • Data sharing: Examples

    double A[10]; extern double A[10];[ ]int main() {int index[10];#pragma omp parallel

    void work(int *index) {double temp[10];static int count;#pragma omp parallel

    work(index);printf(“%d\n”, index[0]);

    static int count;...

    }}

    A, index, countA, index and count areA, index and count are

    temptemp temp

    A, index and count are shared by all threads.

    temp is local to each

    A, index and count are shared by all threads.

    temp is local to each

    52A, index, count

    threadthread

    * Third party trademarks and names are the property of their respective owner.

  • Data sharing:Ch i t tt ib tChanging storage attributes One can selectively change storage attributes for y g g

    constructs using the following clauses*– SHARED

    PRIVATEAll the clauses on this page apply to the OpenMP constructAll the clauses on this page apply to the OpenMP construct– PRIVATE

    – FIRSTPRIVATE The final value of a private inside a parallel loop can be

    apply to the OpenMP construct NOT to the entire region.apply to the OpenMP construct NOT to the entire region.

    transmitted to the shared variable outside the loop with:– LASTPRIVATE

    The default attributes can be overridden with: The default attributes can be overridden with:– DEFAULT (PRIVATE | SHARED | NONE)

    DEFAULT(PRIVATE) is Fortran only

    53

    All data clauses apply to parallel constructs and worksharing constructs except “shared” which only applies to parallel constructs.

  • Data Sharing: Private ClauseData Sharing: Private Clause private(var) creates a new local copy of var for each thread.

    – The value is uninitialized– In OpenMP 2.5 the value of the shared variable is undefined after

    the regionvoid wrong() {

    int tmp = 0;#pragma omp for private(tmp)

    for (int j = 0; j < 1000; ++j)for (int j = 0; j < 1000; ++j) tmp += j;

    printf(“%d\n”, tmp);}

    tmp was not initialized

    }

    tmp: 0 in 3 0

    54

    tmp: 0 in 3.0, unspecified in 2.5

  • Data Sharing: Private ClauseWh i th i i l i bl lid?When is the original variable valid?

    The original variable’s value is unspecified in OpenMP 2.5. In OpenMP 3.0, if it is referenced outside of the construct

    – Implementations may reference the original variable or a copy ….. A dangerous programming practice!

    int tmp;void danger() { extern int tmp;o d da ge () {

    tmp = 0;#pragma omp parallel private(tmp)

    work();

    p;void work() {

    tmp = 5;}

    printf(“%d\n”, tmp);}

    unspecified which tmp has unspecified

    55

    copy of tmptmp has unspecified value

  • Data Sharing: Firstprivate ClauseData Sharing: Firstprivate Clause

    Firstprivate is a special case of private.– Initializes each private copy with the corresponding

    value from the master thread.

    void useless() {int tmp = 0;

    # f fi t i t (t )#pragma omp for firstprivate(tmp)for (int j = 0; j < 1000; ++j)

    tmp += j;printf(“%d\n” tmp);

    Each thread gets its own tmp with an initial value of 0printf( %d\n , tmp);

    }tmp with an initial value of 0

    56

    tmp: 0 in 3.0, unspecified in 2.5

  • Data sharing: Lastprivate ClauseData sharing: Lastprivate Clause

    Lastprivate passes the value of a private from the last iteration to a global variable.

    void closer() {int tmp = 0;

    #pragma omp parallel for firstprivate(tmp) \p g p p p ( p)lastprivate(tmp)for (int j = 0; j < 1000; ++j)

    tmp += j;Each thread gets its own tmp with an initial value of 0printf(“%d\n”, tmp);

    }

    with an initial value of 0

    57

    tmp is defined as its value at the “last sequential” iteration (i.e., for j=999)

  • Data Sharing: A d t i t t tA data environment test Consider this example of PRIVATE and FIRSTPRIVATE

    i bl A B d C 1

    A A B C l l t h th d h d i id th ll l i ?

    variables A,B, and C = 1#pragma omp parallel private(B) firstprivate(C)

    Are A,B,C local to each thread or shared inside the parallel region? What are their initial values inside and values after the parallel region?

    Inside this parallel region ...Inside this parallel region ... “A” is shared by all threads; equals 1 “B” and “C” are local to each thread.

    – B’s initial value is undefined– B s initial value is undefined– C’s initial value equals 1

    Outside this parallel region ...Th l f “B” d “C” ifi d i O MP 2 5 d i

    58

    The values of “B” and “C” are unspecified in OpenMP 2.5, and in OpenMP 3.0 if referenced in the region but outside the construct.

  • Data Sharing: Default ClauseData Sharing: Default Clause

    Note that the default storage attribute is DEFAULT(SHARED) (so no need to use it) Exception: #pragma omp task

    To change default: DEFAULT(PRIVATE) each variable in the construct is made private as if specified in a

    private clause mostly saves typing

    DEFAULT(NONE): no default for variables in static extent. Must list storage attribute for each variable in static extent. Good

    i ti !programming practice!

    Only the Fortran API supports default(private).

    59

    C/C++ only has default(shared) or default(none).

  • D t Sh i D f lt Cl E lData Sharing: Default Clause Example

    itotal = 1000C$OMP PARALLEL PRIVATE(np, each)

    np = omp_get_num_threads() each = itotal/np………

    C$OMP END PARALLEL

    These two code fragments are

    itotal = 1000C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)

    fragments are equivalent

    C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)np = omp_get_num_threads() each = itotal/np………

    60

    C$OMP END PARALLEL

  • Data Sharing: tasks (OpenMP 3 0)

    3.0

    Data Sharing: tasks (OpenMP 3.0) The default for tasks is usually firstprivate, because the task may

    not be executed until later (and variables may have gone out ofnot be executed until later (and variables may have gone out of scope).

    Variables that are shared in all constructs starting from the innermost enclosing parallel construct are shared because theinnermost enclosing parallel construct are shared, because the barrier guarantees task completion.

    #pragma omp parallel shared(A) private(B)p g p p ( ) p ( ){

    ...#pragma omp task

    A is sharedB is firstprivateC is private

    {int C;compute(A, B, C);

    }

    C is private

    61

    }}

  • Exercise 6: Molecular dynamicsy

    The code supplied is a simple molecular dynamics simulation of the melting of solid argon.

    Computation is dominated by the calculation of force pairs in subroutine forces (in forces.c)P ll li thi ti i ll l f Parallelise this routine using a parallel for construct and atomics. Think carefully about which variables should be SHARED, PRIVATEwhich variables should be SHARED, PRIVATE or REDUCTION variables.

    Experiment with different schedules kinds.

    62

    p

  • Exercise 6 (cont.)( ) Once you have a working version, move the

    ll l i t t th it tiparallel region out to encompass the iteration loop in main.c code other than the forces loop must be executedcode other than the forces loop must be executed

    by a single thread (or workshared).how does the data sharing change?

    Th t i b ttl k t t The atomics are a bottleneck on most systems. This can be avoided by introducing a temporary

    array for the force accumulation, with an extra y ,dimension indexed by thread number.

    Which thread(s) should do the final accumulation into f?

    63

    into f?

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t64

    Threadprivate Data

  • OpenMP pre-historyp p y

    OpenMP based upon SMP directive standardization efforts PCF and aborted ANSI X3H5 – late 80’s

    N b d f ll i l t d ith t d dNobody fully implemented either standardOnly a couple of partial implementations

    Vendors considered proprietary API’s to be a Vendors considered proprietary API’s to be a competitive feature: Every vendor had proprietary directives setsEvery vendor had proprietary directives setsEven KAP, a “portable” multi-platform parallelization

    tool used different directives on each platform

    65

    PCF – Parallel computing forum KAP – parallelization tool from KAI.

  • History of OpenMP DECy p

    SGI Merged,

    DEC

    HPSGI

    Cray

    g ,needed commonality across products

    IBMCray products

    Wrote a

    Intel

    KAI ISV - needed larger marketWrote a rough draft straw man SMP API

    Other vendors invited to join

    was tired of recoding for

    ASCI66

    SMPs. Urged vendors to standardize.

    ASCI1997

  • OpenMP Release HistoryOpenMP Release HistoryA single specification

    OpenMP OpenMPOpenMP

    1998 2002specification for Fortran, C and C++

    2008C/C++ 1.0OpenMP

    C/C++ 1.0 C/C++ 2.0p

    C/C++ 2.0

    OpenMPOpenMP2 5

    2005

    OpenMPOpenMP3.0

    2008

    Fortran 1.1OpenMP

    Fortran 1.1 Fortran 2.0OpenMP

    Fortran 2.0Fortran 1.0OpenMP

    Fortran 1.0

    2.5 3.0

    200019991997tasking, other new features

    67

  • Tasks3.0

    Adding tasking is the biggest addition for 3.0

    Worked on by a separate subcommitteey pled by Jay Hoeflinger at Intel

    Re-examined issue from ground up quite different from Intel taskq’sq q

    68

  • General task characteristics3.0

    A task hasCode to executeA data environment (it owns its data)An assigned thread that executes the code and

    uses the dataTwo activities: packaging and execution Two activities: packaging and executionEach encountering thread packages a new instance

    of a task (code and data)o a tas (code a d data)Some thread in the team executes the task at some

    later time

    69

  • Definitions3.0

    Task construct – task directive plus structured block

    Task – the package of code and instructions for allocating data created when a thread encounters a task constructT k i th d i f Task region – the dynamic sequence of instructions produced by the execution of a task by a threadtask by a thread

    70

  • Tasks and OpenMP3.0

    Tasks have been fully integrated into OpenMP Key concept: OpenMP has always had tasks, we justKey concept: OpenMP has always had tasks, we just

    never called them that.Thread encountering parallel construct packages

    up a set of implicit tasks, one per thread.up a set of implicit tasks, one per thread.Team of threads is created.Each thread in team is assigned to one of the tasks

    (and tied to it)(and tied to it).Barrier holds original master thread until all implicit

    tasks are finished. We have simply added a way to create a task explicitly

    for the team to execute. Every part of an OpenMP program is part of one task or

    71

    y p p p g panother!

  • task Construct3.0

    #pragma omp task [clause[[,]clause] ...]structured-block

    where clause can be one of:

    if (expression)

    where clause can be one of:

    ( p )untiedshared (list)private (list)private (list) firstprivate (list)default( shared | none )

    72

  • The if clause3.0

    When the if clause argument is falseThe task is executed immediately by the encounteringThe task is executed immediately by the encountering

    thread.The data environment is still local to the new task... and it’s still a different task with respect to ...and it s still a different task with respect to

    synchronization.

    It’ di t d ti i ti It’s a user directed optimizationwhen the cost of deferring the task is too great

    compared to the cost of executing the task code to control cache and memory affinity

    73

  • When/where are tasks complete?3.0

    At thread barriers, explicit or implicitli t ll t k t d i th t ll lapplies to all tasks generated in the current parallel

    region up to the barriermatches user expectation

    At task barriers i e Wait until all tasks defined in the current task have i.e. Wait until all tasks defined in the current task have

    completed. #pragma omp taskwait

    Note: applies only to tasks generated in the current task, not to “descendants” .

    74

  • Example – parallel pointer chasing using tasks

    3.0

    using tasks#pragma omp parallel#pragma omp parallel{

    #pragma omp single private(p){{ p = listhead ;while (p) {

    p is firstprivate inside this task

    #pragma omp taskprocess (p);

    p=next (p) ;p e t (p) ;}

    } }

    75

    }

  • Example – parallel pointer chasing on multiple lists using tasks

    3.0

    multiple lists using tasks

    #pragma omp parallel#pragma omp parallel {

    #pragma omp for private(p)f ( i t i 0 i li t i ) {for ( int i =0; i

  • Example: postorder tree traversal3.0

    void postorder(node *p) {if (p->left)if (p >left)

    #pragma omp taskpostorder(p->left);

    if (p->right)if (p >right)#pragma omp task

    postorder(p->right);#pragma omp taskwait // wait for descendants#pragma omp taskwait // wait for descendants

    process(p->data);} Task scheduling point

    Parent task suspended until children tasks complete

    77

  • Task switching3.0

    g

    Certain constructs have task scheduling points at defined locations within them

    When a thread encounters a task scheduling point, it is allowed to suspend the current task and execute another (called task switching)It th t t th i i l t k d It can then return to the original task and resume

    78

  • Task switching example3.0

    #pragma omp single{{

    for (i=0; i

  • Thread switching3.0

    #pragma omp single{

    #pragma omp task untied#pragma omp taskfor (i=0; i

  • Dealing with taskprivate data3.0

    g p

    The Taskprivate directive was removed from OpenMP 3.0Too expensive to implement

    Restrictions on task scheduling allow threadprivate data to be used

    U id th d it hi ith ti d t kUser can avoid thread switching with tied tasksTask scheduling points are well defined

    81

  • Conclusions on tasks3.0

    Enormous amount of work by many people

    Tightly integrated into 3.0 spec

    Flexible model for irregular parallelism

    Provides balanced solution despite often conflicting goals

    Appears that performance can be reasonable

    82

  • Nested parallelism3.0

    p Better support for nested parallelism Per-thread internal control variables

    Allows, for example, calling omp set num threads() inside a parallel regionomp_set_num_threads() inside a parallel region.

    Controls the team sizes for next level of parallelism Library routines to determine depth of nesting,Library routines to determine depth of nesting,

    IDs of parent/grandparent etc. threads, team sizes of parent/grandparent etc. teams

    t ti l l()omp_get_active_level()omp_get_ancestor(level)

    t t i (l l)

    83

    omp_get_teamsize(level)

  • Parallel loops3.0

    p Guarantee that this works … i.e. that the same

    schedule is used in the two loops:schedule is used in the two loops:

    !$omp do schedule(static)p ( )do i=1,n

    a(i) = ....end doend do !$omp end do nowait!$omp do schedule(static)

    1do i=1,n.... = a(i)

    end do

    84

  • Loops (cont.) 3.0

    p ( )

    Allow collapsing of perfectly nested loops

    !$omp parallel do collapse(2)!$omp parallel do collapse(2)do i=1,n

    do j=1,n .....

    end do end do

    Will f i l l d th ll li th t85

    Will form a single loop and then parallelize that

  • Loops (cont.)3.0

    Made schedule(runtime) more useful( )can get/set it with library routines

    omp_set_schedule()omp_get_schedule()

    allow implementations to implement their own schedule kindsschedule kinds

    Added a new schedule kind AUTO which gives full freedom to the runtime to determine the

    h d li f it ti t th dscheduling of iterations to threads. Allowed C++ Random access iterators as loop

    control variables in parallel loops

    86

    control variables in parallel loops

  • Portable control of threads3.0

    Added environment variable to control the size of child threads’ stack

    OMP_STACKSIZE

    Added environment variable to hint to runtime how to treat idle threads

    OMP_WAIT_POLICYACTIVE keep threads alive at barriers/locks PASSIVE try to release processor at barriers/locksPASSIVE try to release processor at barriers/locks

    87

  • Control program execution3.0

    p g Added environment variable and runtime

    ti t t/ t th i b froutines to get/set the maximum number of active levels of nested parallelism

    OMP MAX ACTIVE LEVELSOMP_MAX_ACTIVE_LEVELSomp_set_max_active_levels()omp get max active levels()omp_get_max_active_levels()

    Added environment variable to set maximum number of threads in use

    OMP_THREAD_LIMITomp_get_thread_limit()

    88

  • Odds and ends3.0

    Allow unsigned ints in parallel for loops Disallow use of the original variable as master thread’s

    private variable Make it clearer where/how private objects are Make it clearer where/how private objects are

    constructed/destructed Relax some restrictions on allocatable arrays and y

    Fortran pointers Plug some minor gaps in memory model

    All C t ti l b t b th d i t Allow C++ static class members to be threadprivate Improve C/C++ grammar Minor fixes and clarifications to 2 5

    89

    Minor fixes and clarifications to 2.5

  • Exercise 7: tasks in OpenMPp

    Consider the program linked.cTraverses a linked list computing a sequence of

    Fibonacci numbers at each node. Parallelize this program using tasks. Compare your solution’s complexity to an

    h ith t t kapproach without tasks.

    90

  • Exercise 8: linked lists the hard wayy

    Consider the program linked.cTraverses a linked list computing a sequence of

    Fibonacci numbers at each node. Parallelize this program using constructs

    defined in OpenMP 2.5 (loop worksharing constructs i e don’t use OpenMP 3 0 tasks)constructs … i.e. don t use OpenMP 3.0 tasks).

    Once you have a correct program, optimize it.

    91

  • Conclusion

    OpenMP 3.0 is a major upgrade … expands the range of algorithms accessible from OpenMP.

    92

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t93

    Threadprivate Data

  • OpenMP memory modelp y OpenMP supports a shared memory model. All threads share an address space, but it can getAll threads share an address space, but it can get

    complicated:

    Sh dShared memory

    cache1 cache2 cache3 cacheNa

    proc1 proc2 proc3 procN

    cache1 cache2 cache3 cacheN. . .

    Multiple copies of data may be present in various

    a

    94

    Multiple copies of data may be present in various levels of cache, or in registers.

  • OpenMP and Relaxed Consistencyp y

    OpenMP supports a relaxed-consistencyshared memory model.Threads can maintain a temporary view of shared

    memory which is not consistent with that of othermemory which is not consistent with that of other threads.

    These temporary views are made consistent only at p y ycertain points in the program.

    The operation which enforces consistency is called the flush operationthe flush operation

    95

  • Flush operationp

    Defines a sequence point at which a thread is Defines a sequence point at which a thread is guaranteed to see a consistent view of memoryAll previous read/writes by this thread have

    completed and are visible to other threadsNo subsequent read/writes by this thread have

    occurredoccurredA flush operation is analogous to a fence in other

    shared memory API’s

    96

  • Flush and synchronizationy

    A flush operation is implied by OpenMP synchronizations, e.g.at entry/exit of parallel regionsat implicit and explicit barriersat entry/exit of critical regions h l k i t twhenever a lock is set or unset….(but not at entry to worksharing regions or entry/exit(but not at entry to worksharing regions or entry/exit

    of master regions)

    97

  • Example: producer-consumer patternp p pThread 0 Thread 1

    a = foo(); flag = 1;

    while (!flag); b = a;

    This is incorrect code The compiler and/or hardware may re-order the

    reads/writes to a and flag or flag may be held in areads/writes to a and flag, or flag may be held in a register.

    OpenMP has a #pragma omp flush directive which specifies an explicit flush operationcan be used to make the above example work but it’s use is difficult and prone to subtle bugs

    98

    … but it s use is difficult and prone to subtle bugs… To learn more about flush and its use, see the appendix

  • Outline

    Introduction to OpenMP Creating Threads Synchronizationy Parallel Loops Synchronize single masters and stuff Synchronize single masters and stuff Data environment OpenMP 3 0 and Tasks OpenMP 3.0 and Tasks Memory model

    Th d i t D t99

    Threadprivate Data

  • Data sharing: ThreadprivateData sharing: Threadprivate

    Makes global data private to a threadFortran: COMMON blocksC: File scope and static variables, static class membersC: File scope and static variables, static class members

    Different from making them PRIVATEwith PRIVATE global variables are masked. THREADPRIVATE preserves global scope within each

    thread

    Threadprivate variables can be initialized usingThreadprivate variables can be initialized usingCOPYIN or at time of definition (using language-defined initialization capabilities).

    100

  • A threadprivate example (C)p p ( )Use threadprivate to create a counter for each thread.

    int counter = 0;#pragma omp threadprivate(counter)int counter = 0;#pragma omp threadprivate(counter)

    thread.

    int increment_counter(){int increment_counter(){{

    counter++;return (counter);

    }

    {counter++;return (counter);

    }}}

    101

  • Data Copying: CopyinYou initialize threadprivate data using a copyin clause.

    parameter (N=1000)common/buf/A(N)

    !$OMP THREADPRIVATE(/buf/)

    parameter (N=1000)common/buf/A(N)

    !$OMP THREADPRIVATE(/buf/)( )

    C Initialize the A arraycall init_data(N,A)

    ( )

    C Initialize the A arraycall init_data(N,A)

    !$OMP PARALLEL COPYIN(A)

    N h th d th d i t A i iti li d

    !$OMP PARALLEL COPYIN(A)

    N h th d th d i t A i iti li d… Now each thread sees threadprivate array A initialied … to the global value set in the subroutine init_data()

    !$OMP END PARALLEL

    … Now each thread sees threadprivate array A initialied … to the global value set in the subroutine init_data()

    !$OMP END PARALLEL

    102

    !$OMP END PARALLEL

    end

    !$OMP END PARALLEL

    end

  • Data Copying: CopyprivateU d ith i l i t b d t l f i t

    #include #include

    Used with a single region to broadcast values of privates from one member of a team to the rest of the team.

    pvoid input_parameters (int, int); // fetch values of input parameters void do_work(int, int);

    pvoid input_parameters (int, int); // fetch values of input parameters void do_work(int, int);

    void main(){

    int Nsize, choice;

    void main(){

    int Nsize, choice;

    #pragma omp parallel private (Nsize, choice){

    #pragma omp single copyprivate (Nsize choice)

    #pragma omp parallel private (Nsize, choice){

    #pragma omp single copyprivate (Nsize choice)#pragma omp single copyprivate (Nsize, choice)input_parameters (Nsize, choice);

    do work(Nsize, choice);

    #pragma omp single copyprivate (Nsize, choice)input_parameters (Nsize, choice);

    do work(Nsize, choice);

    103

    _ ( , );}

    }

    _ ( , );}

    }

  • Exercise 9: Monte Carlo CalculationsUsing Random numbers to solve tough problemsUsing Random numbers to solve tough problems Sample a problem domain to estimate areas, compute

    probabilities, find optimal values, etc. Example: Computing π with a digital dart board: Example: Computing π with a digital dart board:

    Throw darts at the circle/square.Ch f f lli i i l i

    2 * r Chance of falling in circle is

    proportional to ratio of areas:Ac = r2 * πA = (2*r) * (2*r) = 4 * r2As = (2*r) * (2*r) = 4 * r2P = Ac/As = π /4

    Compute π by randomly choosing points, count the fraction that falls inpoints, count the fraction that falls in the circle, compute pi.

    N= 10 π = 2.8

    N=100 π = 3 16

    104

    N=100 π = 3.16

    N= 1000 π = 3.148

  • Exercise 9 We provide three files for this exercise

    pi_mc.c: the monte carlo method pi program random.c: a simple random number generator random.h: include file for random number generator

    Create a parallel version of this program without changing the interfaces to functions in random.cThis is an exercise in modular software … why should a user y

    of your parallel random number generator have to know any details of the generator or make any changes to how the generator is called?

    The random number generator must be threadsafeThe random number generator must be threadsafe. Extra Credit:

    Make your random number generator numerically correct (non-l i f d d b )

    105

    overlapping sequences of pseudo-random numbers).

  • Conclusion We have now covered the full sweep of the

    O MP ifi tiOpenMP specification.We’ve left off some minor details, but we’ve covered

    all the major topics … remaining content you can j p g ypick up on your own.

    Download the spec to learn more … the spec is filled with examples to support your continuingfilled with examples to support your continuing education.www.openmp.org

    Get involved:get your organization to join the OpenMP ARB.

    106

    Work with us through Compunity.

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    107

    Fortran and OpenMP Compiler Notes

  • OpenMP Organizationsp g

    OpenMP architecture review board URL, p ,the “owner” of the OpenMP specification:

    www.openmp.org p p gOpenMP User’s Group (cOMPunity) URL:

    www compunity orgwww.compunity.org

    Get involved, join compunity and help define the future of OpenMP

    Get involved, join compunity and help define the future of OpenMP

    108

    define the future of OpenMPdefine the future of OpenMP

  • Books about OpenMP

    A new book about A book about how to

    109

    OpenMP 2.5 by a team of authors at the forefront of OpenMP’s evolution.

    “think parallel” with examples in OpenMP, MPI and java

  • OpenMP Papers Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on a

    ccNUMA architecture using OpenMP. III. Parallel Computing, vol.26, no.7-8, July 2000, pp.843-56. Publisher: Elsevier, Netherlands.

    Couturier R Chipot C Parallel molecular dynamics using OPENMP on a shared Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a shared memory machine. Computer Physics Communications, vol.124, no.1, Jan. 2000, pp.49-59. Publisher: Elsevier, Netherlands.

    Bentz J., Kendall R., “Parallelization of General Matrix Multiply Routines Using OpenMP” Shared Memory Parallel Programming with OpenMP Lecture notes inOpenMP”, Shared Memory Parallel Programming with OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 1, 2005

    Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallel analysis of harbor wave response using MPI and OpenMP. International Journal y p g pof High Performance Computing Applications, vol.14, no.1, Spring 2000, pp.49-64. Publisher: Sage Science Press, USA.

    Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiple levels of parallelism in OpenMP: a case study Proceedings of the 1999levels of parallelism in OpenMP: a case study. Proceedings of the 1999 International Conference on Parallel Processing. IEEE Comput. Soc. 1999, pp.172-80. Los Alamitos, CA, USA.

    Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting OpenMP in an MPI application Proceedings of the ISCA 12th International Conference

    110

    an MPI application. Proceedings of the ISCA 12th International Conference.Parallel and Distributed Systems. ISCA. 1999, pp.566-71. Cary, NC, USA.

  • OpenMP Papers (continued) Jost G., Labarta J., Gimenez J., What Multilevel Parallel Programs do when you

    are not watching: a Performance analysis case study comparing MPI/OpenMP, MLP, and Nested OpenMP, Shared Memory Parallel Programming with OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 29, 2005

    Gonzalez M, Serra A, Martorell X, Oliver J, Ayguade E, Labarta J, Navarro N. Applying interposition techniques for performance analysis of OPENMP parallel applications. Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000. IEEE Comput. Soc. 2000, pp.235-40.

    Chapman B, Mehrotra P, Zima H. Enhancing OpenMP with features for locality control. Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology. Towards Teracomputing. World Scientific Publishing. 1999, pp.301-13. Singapore.St W B Cl P B h H G bb R d lf Ei G G t Steve W. Bova, Clay P. Breshears, Henry Gabb, Rudolf Eigenmann, Greg Gaertner, Bob Kuhn, Bill Magro, Stefano Salvini. Parallel Programming with Message Passing and Directives; SIAM News, Volume 32, No 9, Nov. 1999.

    Cappello F, Richard O, Etiemble D. Performance of the NAS benchmarks on a cl ster of SMP PCs sing a paralleli ation of the MPI programs ith OpenMPcluster of SMP PCs using a parallelization of the MPI programs with OpenMP. Lecture Notes in Computer Science Vol.1662. Springer-Verlag. 1999, pp.339-50.

    Liu Z., Huang L., Chapman B., Weng T., Efficient Implementationi of OpenMP for Clusters with Implicit Data Distribution, Shared Memory Parallel Programming with OpenMP Lecture notes in Computer Science Vol 3349 P 121 2005

    111

    OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 121, 2005

  • OpenMP Papers (continued)

    B. Chapman, F. Bregier, A. Patil, A. Prabhakar, “Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems,” Concurrency and Computation: Practice and Experience 14(8 9): 713 739 2002Experience. 14(8-9): 713-739, 2002.

    J. M. Bull and M. E. Kambites. JOMP: an OpenMP-like interface for Java. Proceedings of the ACM 2000 conference on Java Grande, 2000, Pages 44 - 53.g

    L. Adhianto and B. Chapman, “Performance modeling of communication and computation in hybrid MPI and OpenMP applications, Simulation Modeling Practice and Theory, vol 15, p. 481-491 2007491, 2007.

    Shah S, Haab G, Petersen P, Throop J. Flexible control structures for parallelism in OpenMP; Concurrency: Practice and Experience, 2000; 12:1219-1239. Publisher John Wiley & Sons, Ltd.

    Mattson, T.G., How Good is OpenMP? Scientific Programming, Vol. 11, Number 2, p.81-93, 2003.

    Duran A., Silvera R., Corbalan J., Labarta J., “Runtime Adjustment of P ll l N t d L ” Sh d M P ll l P i ith

    112

    Parallel Nested Loops”, Shared Memory Parallel Programming with OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 137, 2005

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    113

    Fortran and OpenMP Compiler Notes

  • Exercise 1: SolutionA lti th d d “H ll ld”A multi-threaded “Hello world” program Write a multithreaded program where each

    thread prints “hello world”.#include “omp.h”

    id i ()#include “omp.h”

    id i ()OpenMP include file

    void main(){

    #pragma omp parallel

    void main(){

    #pragma omp parallel

    Sample Output:hello(1) hello(0) world(1)

    Sample Output:hello(1) hello(0) world(1)

    Parallel region with default number of threads

    #pragma omp parallel{

    int ID = omp get thread num();

    #pragma omp parallel{

    int ID = omp get thread num();

    hello(1) hello(0) world(1)

    world(0)

    h ll (3) h ll (2) ld(3)

    hello(1) hello(0) world(1)

    world(0)

    h ll (3) h ll (2) ld(3)int ID = omp_get_thread_num();printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    }

    int ID = omp_get_thread_num();printf(“ hello(%d) ”, ID);printf(“ world(%d) \n”, ID);

    }

    hello (3) hello(2) world(3)

    world(2)

    hello (3) hello(2) world(3)

    world(2)

    114

    }}

    }}

    Runtime library function to return a thread ID.End of the Parallel region

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    115

    Fortran and OpenMP Compiler Notes

  • The SPMD patternp The most common approach for parallel

    l ith i th SPMD Si l Palgorithms is the SPMD or Single Program Multiple Data pattern.

    Each thread runs the same program (Single Each thread runs the same program (Single Program), but using the thread ID, they operate on different data (Multiple Data) or take slightly different paths through the codedifferent paths through the code.

    In OpenMP this means:A parallel region “near the top of the code”A parallel region near the top of the code .Pick up thread ID and num_threads.Use them to split up loops and select different blocks

    116

    p p pof data to work on.

  • Exercise 2: A simple SPMD pi program P t l t

    #include static long num_steps = 100000; double step;#define NUM_THREADS 2

    id i ()

    Promote scalar to an array dimensioned by number of threads to avoid race condition.

    void main (){ int i, nthreads; double pi, sum[NUM_THREADS];

    step = 1.0/(double) num_steps;omp_set_num_threads(NUM_THREADS);p_ _ _ ( _ );

    #pragma omp parallel{

    int i, id,nthrds;double x;

    Only one thread should copy the number of threads to the global value to make sure multiple threadsdouble x;

    id = omp_get_thread_num();nthrds = omp_get_num_threads();if (id == 0) nthreads = nthrds;

    value to make sure multiple threads writing to the same address don’t conflict.

    for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {x = (i+0.5)*step;sum[id] += 4.0/(1.0+x*x);

    }

    This is a common trick in SPMD programs to create a cyclic distribution of loop

    117

    }}

    for(i=0, pi=0.0;i

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    118

    Fortran and OpenMP Compiler Notes

  • False sharingg If independent data elements happen to sit on the same

    h li h d ill h h licache line, each update will cause the cache lines to “slosh back and forth” between threads.This is called “false sharing”.g

    If you promote scalars to an array to support creation of an SPMD program, the array elements are contiguous in memory and hence share cache linescontiguous in memory and hence share cache lines.Result … poor scalability

    Solution: When updates to an item are frequent, work with local copies

    of data instead of an array indexed by the thread ID.Pad arrays so elements you use are on distinct cache lines.

    119

    Pad arrays so elements you use are on distinct cache lines.

  • #include

    Exercise 3: SPMD Pi without false sharing#include omp.hstatic long num_steps = 100000; double step;#define NUM_THREADS 2void main (){ d bl i t 1 0/(d bl ) t{ double pi; step = 1.0/(double) num_steps;

    omp_set_num_threads(NUM_THREADS);#pragma omp parallel{

    Create a scalar local to each thread to Create a scalar local to each thread to {

    int i, id,nthrds; double x, sum;id = omp_get_thread_num();nthrds = omp_get_num_threads();if (id == 0) nthreads = nthrds;

    eac ead oaccumulate partial sums.

    eac ead oaccumulate partial sums.

    if (id == 0) nthreads = nthrds; id = omp_get_thread_num();

    nthrds = omp_get_num_threads();for (i=id, sum=0.0;i< num_steps; i=i+nthreads){ No array, so No array, so

    x = (i+0.5)*step;sum += 4.0/(1.0+x*x);

    }#pragma omp critical Sum goes “out of scope” beyond the parallelSum goes “out of scope” beyond the parallel

    y,no false sharing.

    y,no false sharing.

    120

    #pragma omp criticalpi += sum * step;

    }}

    Sum goes out of scope beyond the parallel region … so you must sum it in here. Must protect summation into pi in a critical region so updates don’t conflict

    Sum goes out of scope beyond the parallel region … so you must sum it in here. Must protect summation into pi in a critical region so updates don’t conflict

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    121

    Fortran and OpenMP Compiler Notes

  • Exercise 4: solution #include static long num steps = 100000; double step;static long num_steps 100000; double step;#define NUM_THREADS 2void main (){ i t i d bl i 0 0

    For good OpenMP implementations, reduction is more{ int i; double x, pi, sum = 0.0;

    step = 1.0/(double) num_steps;omp_set_num_threads(NUM_THREADS);

    reduction is more scalable than critical.

    #pragma omp parallel for private(x) reduction(+:sum)for (i=0;i< num_steps; i++){

    x = (i+0 5)*step;x (i+0.5) step;sum = sum + 4.0/(1.0+x*x);

    }i t * Note: we created a parallel

    i private by default

    122

    pi = step * sum;}

    pprogram without changing any code and by adding 4

    simple lines!

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    123

    Fortran and OpenMP Compiler Notes

  • Matrix multiplication#pragma omp parallel for private(tmp, i, j, k)

    for (i=0; i

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    125

    Fortran and OpenMP Compiler Notes

  • Exercise 6 solution Compiler will warn you if you have missed some

    variables

    #pragma omp parallel for default (none) \shared(x,f,npart,rcoff,side) \( , , p , , )reduction(+:epot,vir) \schedule (static 32)schedule (static,32)for (int i=0; i

  • Exercise 6 solution (cont.)........#pragma omp atomic

    f[j] -= forcex;# i#pragma omp atomic

    f[j+1] -= forcey;#pragma omp atomic All updates to f

    must be atomicf[j+2] -= forcez;}

    }

    must be atomic

    #pragma omp atomicf[i] += fxi;

    #pragma omp atomicp g pf[i+1] += fyi;

    #pragma omp atomicf[i+2] += fzi;f[i+2] + fzi;

    }}

  • Exercise 6 with orphaning p g

    #pragma omp single{

    vir = 0.0;;epot = 0.0;

    }

    Implicit barrier needed to avoid race condition with update of reduction variables

    at end of the for construct}#pragma omp for reduction(+:epot,vir) \

    schedule (static 32)schedule (static,32)for (int i=0; i

  • Exercise 6 with array reductiony

    ftemp[myid][j] -= forcex;ftemp[myid][j+1] -= forcey;ftemp[myid][j+2] -= forcez;p[ y ][j ] ;

    }}}ftemp[myid][i] += fxi;ftemp[myid][i+1] += fyi;

    Replace atomics with accumulation into array

    with extra dimensionftemp[myid][i+1] += fyi;ftemp[myid][i+2] += fzi;

    }}

  • Exercise 6 with array reductiony

    ….Reduction can be done in

    #pragma omp for for(int i=0;i

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    131

    Fortran and OpenMP Compiler Notes

  • Linked lists with tasks (intel taskq) See the file Linked_intel_taskq.c

    #pragma omp parallel {

    #pragma intel omp taskq{while (p != NULL) {

    #pragma intel omp task captureprivate(p)k( )processwork(p);

    p = p->next;}}

    }}

    Array, Static, 1 Intel taskq

    132

    }One Thread 45 seconds 48 secondsTwo Threads 28 seconds 30 seconds

    Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2

  • Linked lists with tasks (OpenMP 3) See the file Linked_omp3_tasks.c

    #pragma omp parallel {

    #pragma omp single{ Creates a task with its

    p=head;while (p) {

    # t k fi t i t ( )

    own copy of “p” initialized to the value of “p” when the task is #pragma omp task firstprivate(p)

    processwork(p);p = p >next;

    pdefined

    p = p->next;}

    }

    133

    }}

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    134

    Fortran and OpenMP Compiler Notes

  • Linked lists without tasks See the file Linked_omp25.c

    while (p != NULL) {p = p->next;count++;

    Count number of items in the linked list

    }p = head;for(i=0; inext;

    }#pragma omp parallel

    Copy pointer to each node into an array

    #pragma omp parallel {

    #pragma omp for schedule(static,1)for(i=0; i

  • Linked lists without tasks: C++ STL See the file Linked_cpp.cpp

    std::vector nodelist;for (p = head; p != NULL; p = p->next)

    nodelist.push_back(p); Copy pointer to each node into an array

    C t b f it i th li k d li tint j = (int)nodelist.size();#pragma omp parallel for schedule(static,1)

    f (i t i 0 i < j ++i)

    Count number of items in the linked list

    for (int i = 0; i < j; ++i)processwork(nodelist[i]); Process nodes in parallel with a for loop

    C++, default sched. C++, (static,1) C, (static,1)One Thread 37 seconds 49 seconds 45 seconds

    136

    One Thread 37 seconds 49 seconds 45 secondsTwo Threads 47 seconds 32 seconds 28 seconds

    Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    137

    Fortran and OpenMP Compiler Notes

  • Computers and random numbers We use “dice” to make random numbers:

    Given previous values, you cannot predict the next value.There are no patterns in the series … and it goes on forever.There are no patterns in the series … and it goes on forever.

    Computers are deterministic machines … set an initial state, run a sequence of predefined instructions, and you get a deterministic answeryou get a deterministic answerBy design, computers are not random and cannot produce

    random numbers. However with some very clever programming we can However, with some very clever programming, we can

    make “pseudo random” numbers that are as random as you need them to be … but only if you are very careful.Wh d I ? R d b d i t ti ti l Why do I care? Random numbers drive statistical methods used in countless applications:Sample a large space of alternatives to find statistically good

    answers (Monte Carlo methods)

    138

    answers (Monte Carlo methods).

  • Monte Carlo Calculations:U i R d b t l t h blUsing Random numbers to solve tough problems Sample a problem domain to estimate areas, compute

    probabilities, find optimal values, etc. Example: Computing π with a digital dart board: Example: Computing π with a digital dart board:

    Throw darts at the circle/square.Ch f f lli i i l i

    2 * r Chance of falling in circle is

    proportional to ratio of areas:Ac = r2 * πA = (2*r) * (2*r) = 4 * r2As = (2*r) * (2*r) = 4 * r2P = Ac/As = π /4

    Compute π by randomly choosing points, count the fraction that falls inpoints, count the fraction that falls in the circle, compute pi.

    N= 10 π = 2.8

    N=100 π = 3 16

    139

    N=100 π = 3.16

    N= 1000 π = 3.148

  • Parallel Programmers love Monte Carlo algorithms Embarrassingly parallel: thealgorithms

    #include “omp.h”static long num_trials = 10000;int main ()

    Embarrassingly parallel: the parallelism is so easy its

    embarrassing.Add two lines and you have a

    ll lint main (){

    long i; long Ncirc = 0; double pi, x, y;double r = 1.0; // radius of circle. Side of squrare is 2*r

    parallel program.

    doub e 0; // ad us o c c e S de o squ a e sseed(0,-r, r); // The circle and square are centered at the origin#pragma omp parallel for private (x, y) reduction (+:Ncirc)for(i=0;i

  • Linear Congruential Generator (LCG)g ( ) LCG: Easy to write, cheap to compute, portable, OK quality

    random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;random_last = random_next;

    If you pick the multiplier and addend correctly, LCG has a period of PMOD.

    Picking good LCG parameters is complicated, so look it up g g p p , p(Numerical Recipes is a good source). I used the following: MULTIPLIER = 1366 ADDEND = 150889 PMOD = 714025

    141

  • LCG codestatic long MULTIPLIER = 1366;static long ADDEND = 150889;static long PMOD = 714025;static long PMOD 714025;long random_last = 0;double random (){

    Seed the pseudo random sequence by setting

    random_lastlong random_next;

    random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;random last = random next;random_last = random_next;

    return ((double)random_next/(double)PMOD);}}

    142

  • Running the PI_MC program with LCG generator

    Log10 number of samples

    0.1

    11 2 3 4 5 6

    LCG one thread

    Log 10 R

    Run the same program the

    0.01

    LCG - one thread

    LCG, 4 threads,trail 1LCG 4 threads

    Relative e

    program the same way and get different

    answers!

    0.0001

    0.001LCG 4 threads,trial 2LCG, 4 threads,trial 3

    error

    That is not acceptable!

    Issue: my LCG

    0.00001

    ygenerator is not

    threadsafe

    143Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-core laptop (Intel T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP.

  • LCG code: threadsafe versionstatic long MULTIPLIER = 1366;static long ADDEND = 150889;static long PMOD = 714025;

    random_last carries state between random number computations,static long PMOD 714025;

    long random_last = 0;#pragma omp threadprivate(random_last)double random ()

    number computations,

    To make the generator threadsafe, make

    random last{long random_next;

    random next = (MULTIPLIER * random last + ADDEND)% PMOD;

    random_last threadprivate so each

    thread has its own copy.random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;random_last = random_next;

    return ((double)random next/(double)PMOD);(( ) _ ( ) );}

    144

  • Thread safe random number generatorsgLog10 number of samples Thread safe

    version gives the

    Log

    version gives the same answer each time you

    run the program.0 1

    11 2 3 4 5 6 LCG - one

    threadg10 R

    elat

    run the program.

    But for large number of

    samples its0.01

    0.1LCG 4 threads,trial 1LCT 4 threads,tive error

    samples, its quality is lower

    than the one thread result!

    0.001trial 2LCG 4 threads,trial 3r thread result!

    Why?0.00001

    0.0001 LCG 4 threads,thread safe

    145

  • Pseudo Random Sequences Random number Generators (RNGs) define a sequence of pseudo-random

    numbers of length equal to the period of the RNG

    In a typical problem, you grab a subsequence of the RNG range

    Seed determines starting pointSeed determines starting point

    Grab arbitrary seeds and you may generate overlapping sequences E g three sequences last one wraps at the end of the RNG period E.g. three sequences … last one wraps at the end of the RNG period.

    Thread 1Thread 2

    Thread 3

    146

    Overlapping sequences = over-sampling and bad statistics … lower quality or even wrong answers!

  • Parallel random number generatorsg Multiple threads cooperate to generate and use

    random numbers. Solutions:

    Replicate and PrayReplicate and PrayGive each thread a separate, independent

    generatorHave one thread generate all the numbers.

    If done right, can generate the Have one thread generate all the numbers.

    Leapfrog … deal out sequence values “round robin” as if dealing a deck of cards.

    Block method … pick your seed so each threads gets a distinct contiguous block

    gsame sequence

    regardless of the number of threads …threads gets a distinct contiguous block.

    Other than “replicate and pray”, these are difficult to implement. Be smart … buy a math library that does it right.

    Nice for debugging, but

    not really neededneeded

    scientifically.

    Intel’s Math kernel Library supports

    147

    y ppall of these methods.

  • MKL Random number generators (RNG)g ( ) MKL includes several families of RNGs in its vector statistics library. Specialized to efficiently generate vectors of random numbers

    #define BLOCK 100double buff[BLOCK]; VSLStreamStatePtr stream

    Select type of VSLStreamStatePtr stream;

    vslNewStream(&ran_stream, VSL_BRNG_WH, (int)seed_val);

    Initialize a stream or pseudo

    RNG and set seed

    vdRngUniform (VSL_METHOD_DUNIFORM_STD, stream, BLOCK, buff, low, hi)

    random numbers

    vslDeleteStream( &stream ); Fill buff with BLOCK pseudo rand. nums, uniformly distributed with

    l b t l d hi

    148

    values between lo and hi.Delete the stream when you are done

  • Wichmann-Hill generators (WH)g ( ) WH is a family of 273 parameter sets each defining a non-

    overlapping and independent RNG. Easy to use, just make each stream threadprivate and initiate

    RNG stream so each thread gets a unique WG RNG.

    VSLStreamStatePtr stream;

    #pragma omp threadprivate(stream)

    vslNewStream(&ran stream VSL BRNG WH+Thrd ID (int)seed);vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);

    149

  • Independent Generator for each threadthread

    Log10 number of samplesNotice that

    11 2 3 4 5 6Log

    once you get beyond the high error,

    ll l0.1 WH onethreadWH 2

    g10 R

    elat

    small sample count range,

    adding threads doesn’t0.01 WH, 2

    threadsWH, 4th d

    tive error

    doesn’t decrease quality of random

    0.001 threads

    r random sampling.

    150

    0.0001

  • Leap Frog method Interleave samples in the sequence of pseudo random numbers:Interleave samples in the sequence of pseudo random numbers:

    Thread i starts at the ith number in the sequenceStride through sequence, stride length = number of threads.

    R lt th f l dl f th b

    #pragma omp single

    Result … the same sequence of values regardless of the number of threads.

    #pragma omp single{ nthreads = omp_get_num_threads();

    iseed = PMOD/MULTIPLIER; // just pick a seedpseed[0] = iseed;

    lt MULTIPLIER

    One thread computes offsets

    d t id dmult_n = MULTIPLIER;for (i = 1; i < nthreads; ++i){

    iseed = (unsigned long long)((MULTIPLIER * iseed) % PMOD);

    and strided multiplier

    ( g g g)(( ) );pseed[i] = iseed;mult_n = (mult_n * MULTIPLIER) % PMOD;

    }

    LCG with Addend = 0 just to keep things simple

    151

    }random_last = (unsigned long long) pseed[id];

    Each thread stores offset starting point into its threadprivate “last random” value

  • Same sequence with many threads.q y We can use the leapfrog method to generate the

    same answer for any number of threads

    Steps One thread 2 threads 4 threads

    1000 3.156 3.156 3.156

    10000 3.1168 3.1168 3.1168

    100000 3.13964 3.13964 3.13964

    1000000 3 140348 3 140348 3 1403481000000 3.140348 3.140348 3.140348

    10000000 3.141658 3.141658 3.141658

    152

    Used the MKL library with two generator streams per computation: one for the x values (WH) and one for the y values (WH+1). Also used the leapfrog method to deal out iterations among threads.

  • Appendices Sources for Additional information Sources for Additional information Solutions to exercises

    Exercise 1: hello worldExercise 2: Simple SPMD Pi programExercise 3: SPMD Pi without false sharingExercise 4: Loop level PiExercise 4: Loop level PiExercise 5: Matrix multiplicationExercise 6: Molecular dynamicsExercise 7: linked lists with tasksExercise 8: linked lists without tasksExercise 9: Monte Carlo Pi and random numbers

    Flush and the OpenMP Memory model. the producer consumer pattern

    153

    Fortran and OpenMP Compiler Notes

  • OpenMP memory modelp y OpenMP supports a shared memory model. All threads share an address space, but it can get complicated:

    Shared memory

    cache1 cache2 cache3 cacheNa

    . . .proc1 proc2 proc3 procN

    a A memory model is defined in terms of:

    Coherence: Behavior of the memory system when a single address is accessed by multiple threads

    154

    address is accessed by multiple threads.Consistency: Orderings of reads, writes, or synchronizations

    (RWS) with various addresses and by multiple threads.

  • OpenMP Memory Model: Basic Terms

    Source code

    Program order

    Wa Wb Ra Rb . . .

    compiler

    C d d

    Executable code

    Code order

    Wb Rb Wa Ra . . .

    RW’s in any semantically

    equivalent order

    private view

    thread thread

    private viewthreadprivatethreadprivatea ab b

    155

    memorya b

    Commit order

  • Consistency: Memory Access Re-orderingy y g

    Re-ordering:Compiler re-orders program order to the code orderMachine re-orders code order to the memory commit order

    At a given point in time the “private view” seen by a At a given point in time, the “private view” seen by a thread may be different from the view in shared memory.

    Consistency Models define constraints on the orders of Reads (R), Writes (W) and Synchronizations (S) … i.e. how do the values “seen” by a thread change as you… i.e. how do the values seen by a thread change as you

    change how ops follow (→) other ops. Possibilities include:

    R R W W R W R S S S W S

    156

    – R→R, W→W, R→W, R→S, S→S, W→S

  • Consistencyy

    Sequential Consistency:In a multi-processor, ops (R, W, S) are sequentially

    consistent if:Th i i d f h– They remain in program order for each processor.

    – They are seen to be in the same overall order byThey are seen to be in the same overall order by each of the other processors.

    Program order = code order = commit order Relaxed consistency:

    Remove some of the ordering constraints for memory ops (R W S)

    157

    memory ops (R, W, S).

  • OpenMP and Relaxed Consistencyp y

    OpenMP defines consistency as a variant of weak co