me964 high performance computing for engineering applications

ME964High Performance Computing for Engineering Applications

Parallel Programming PatternsDiscussion of Midterm Project

Oct. 21, 2008

Before we get started…

Last Time Parallel Programming patterns

The “Finding Concurrency Design Space” The “Algorithm Structure Design Space”

Today Parallel Programming patterns

Discuss the “Supporting Structures Design Space” Discussion of the Midterm Project

Assigned today, due on 11/18 or 11/09 (based on level of difficulty)

Other issues: I will be in Milwaukee on Th, Mik will be the lecturer - tracing the

evolution of the hardware support for the graphics pipeline

2

Key Parallel Programming Steps

1) Find the concurrency in the problem

2) Structure the algorithm so that concurrency can be exploited

3) Implement the algorithm in a suitable programming environment

4) Execute and tune the performance of the code on a parallel system

3

What’s Comes Next?

Finding Concurrency

Implementation Mechanisms

Supporting Structures

Algorithm Structure

Focus on thisfor a while

4

Before We Dive In…

You hover in this design space when you ponder whether an MPI or an SPMD or a event driven simulation environment is what it takes to optimally design your algorithm

We are not going to focus on this process too much since it’s clear that we are stuck with SPMD All that we’ll be doing is to compare SPMD with other parallel programming

paradigms to understand when SPMD is good and when it is bad

Also, we will not cover the “Implementation Mechanisms” design space since we don’t need to wrestle with how to spawn and join threads, how to do synchronization, etc. This would be Chapter 6 of the book

5

Supporting Structures ~Program and Data Models~

Fork/Join

Master/Worker

SPMD

Program Models

Loop ParallelismDistributed Array

Shared Queue

Shared Data

Data Models

These are not necessarily mutually

exclusive. 6

Program Structuring Patterns

SPMD (Single Program, Multiple Data) All PE’s (Processor Elements) execute the same program in

parallel Each PE has its own data Each PE uses a unique ID to access its portion of data Different PE can follow different paths through the same code This is essentially the CUDA Grid model

Master/Worker

Loop Parallelism

Fork/Join7

Program Structuring Patterns(the rest of the pack)

Master/Worker A Master thread sets up a pool of worker threads and a bag of tasks Workers execute concurrently, removing tasks until done

Loop Parallelism Loop iterations execute in parallel FORTRAN do-all (truly parallel), do-across (with dependence) Very common in OpenMP

Fork/Join Most general way of creation of threads (the POSIX standard) Can be regarded as a very low level approach in which you use the OS

to manage parallelism

8

Review: Algorithm Structure

Algorithm Design

Organize by Task

Organize by Data

Organize by Data Flow

Linear Recursive Linear Recursive

TaskParallelism

Divide andConquer

GeometricDecomposition

RecursiveData

Regular Irregular

Pipeline Event Driven

9

Algorithm Structures vs.

Supporting Structures

Task Parallel. Divide/Conquer Geometric Decomp.

Recursive Data Pipeline Event-based

SPMD ☺☺☺☺ ☺☺☺ ☺☺☺☺ ☺☺ ☺☺☺ ☺☺

Loop Parallel ☺☺☺☺ ☺☺ ☺☺☺

Master/Worker ☺☺☺☺ ☺☺ ☺ ☺ ☺ ☺

Fork/Join ☺☺ ☺☺☺☺ ☺☺ ☺☺☺☺ ☺☺☺☺

Four smilies is the best (Source: Mattson, et al.) 10

Supporting Structures vs. Program Models

OpenMP MPI CUDA

SPMD ☺☺☺ ☺☺☺☺ ☺☺☺☺Loop

Parallel ☺☺☺☺ ☺Master/Slave

☺☺ ☺☺☺

Fork/Join ☺☺☺

11

More on SPMD

Dominant coding style of scalable parallel computing MPI code is mostly developed in SPMD style Many OpenMP code is also in SPMD (next to loop parallelism) Particularly suitable for algorithms based on data parallelism,

geometric decomposition, divide and conquer.

Main advantage Tasks and their interactions visible in one piece of source code,

no need to correlated multiple sources

SPMD is by far the most commonly used pattern for structuring parallel programs.

12

Typical SPMD Program Phases

Initialize Establish localized data structure and communication channels

Obtain a unique identifier Each thread acquires a unique identifier, typically in the range from 0 to

N-1, where N is the number of threads. Both OpenMP and CUDA have built-in support for this.

Distribute Data Decompose global data into chunks and localize them, or Sharing/replicating major data structure using thread ID to associate

subset of the data to threads

Run the core computation More details in next slide…

Finalize Reconcile global data structure, prepare for the next major iteration 13

Core Computation Phase

Thread IDs are used to differentiate behavior of threads

Use thread ID in loop index calculations to split loop iterations among threads

Use conditions based on thread ID to branch to their specific actions

Both can have very different performance results and code complexity depending on the

way they are done. 14

A Simple Example

Assume The computation being parallelized has 1,000,000 iterations.

Sequential code:

Num_steps = 1000000;

for (i=0; i< num_steps, i++) {…}

15

SPMD Code Version 1

Assign a chunk of iterations to each thread The last thread also finishes up the remainder iterations

num_steps = 1000000;

i_start = my_id * (num_steps/num_threads);i_end = i_start + (num_steps/num_threads);if (my_id == (num_threads-1)) i_end = num_steps;

for (i = i_start; i < i_end; i++) {….}Reconciliation of results across threads if necessary.

16

Problems with Version 1

The last thread executes more iterations than others

The number of extra iterations is up to the total number of threads – 1 This is not a big problem when the number of threads is

small When there are thousands of threads, this can create

serious load imbalance problems.

Also, the extra if statement is a typical source of “branch divergence” in CUDA programs.

17

SPMD Code Version 2 Assign one more iterations to some of the threads

int rem = num_steps % num_threads;

i_start = my_id * (num_steps/num_threads);i_end = i_start + (num_steps/num_threads);

if (rem != 0) { if (my_id < rem) {

i_start += my_id;i_end += (my_id +1);

} else {

i_start += rem;i_end += rem;

}.

Less load imbalance

More branch divergence.

18

SPMD Code Version 3

Use cyclic distribution of iteration

num_steps = 1000000;

for (i = my_id; i < num_steps; i+= num_threads) {….}

Less load imbalance

Loop branch divergence in the last Warp

Data padding further eliminates divergence.19

Comparing the Three Versions

ID=0 ID=1 ID=3ID=2

ID=0 ID=1 ID=3ID=2

ID=0 ID=1 ID=3ID=2

Version 1

Version 2

Version 3

Padded version1 1 may be bestfor some data access patterns.20

Data Styles

Shared Data All threads share a major data structure This is what CUDA supports

Shared Queue All threads see a “thread safe” queue that maintains ordering of

data communication Very relevant in conjunction with the Master/Worker scenarios

Distributed Array Decomposed and distributed among threads Limited support in CUDA Shared Memory Typically an issue in NUMA layouts

21

22

End: Parallel Programming Patterns

Begin: Overview Reminder of Semester

The Reminder of the Semester There will be one more lecture on CUDA (in November) There will be three lectures on MPI There will be about four to five guest lectures

On massively parallel hardware design On grid computing (the Condor project) On the N-body problem

Midterm exam on 11/20

Midterm Project (assigned today, due on 11/17 or sooner) I’ll explain why “sooner” shortly

Final Project (due on last Friday, Dec. 19, at 11:59 PM)23

Midterm Project The topic: Produce a Collision Detection “engine”

Deal with spheres or ellipsoids

There will be two levels of difficulty: The entry level

You implement a simple “brute force” approach Deal only with spheres of various radii Due Date: November 9, 11:59 PM

The advanced level Apply a sophisticated design to produce a parallel algorithm that is very efficient Deal with ellipsoids

Note that spheres are a particular case of ellipsoids Due Date: November 17, 11:59 PM Most likely make this engine a part of your Final Project

Independent of the level embraced, you’ll be presenting your work on November 18, during regular class hours You’ll have 10 minutes to make a PowerPoint presentation. The presentations are

part of the project (to be submitted for grading) Please observe the Nov. 17 deadline for presentations for both levels

24

Final Project

Work on something related to your research

I can assist you with selecting a topic

I will help you with the design

There is a “default” Final Project, in case you can’t choose

The Final Project: Further Comments

There is a connection between the Midterm and Final Projects:

It is up to you to decide what you want to work on for the Final Project

However, if you don’t know what to work on, I’ll assign you a Granular Dynamics problem (this is the “default” Final Project)

An important part of this problem is the collision detection stage

26 Granular Dynamics Problem

The Final Project: Concluding Remarks

To conclude, the default Final Project will look like this:

Final Project = Midterm Project + Some Extra Headache Granular Material Dynamics

27

Keep in mind: If you plan to do the default final project, think ahead and

implement a collision detection engine that is suitable to a scenario where the bodies slightly change their location each time you have to do the contact detection This is because in granular flow the particles move in space Maybe you can run a “cheaper” collision detection after an initial

“exhaustive” one You’ll need to look at problems with millions of colliding bodies

28

End: Overview Reminder of Semester

Begin: Midterm Project

The Collision Problem(The Midterm Project)

Problem Statement: A large set of spheres or ellipsoids is given to you Provide collision state information for each pair of bodies that are colliding

The bodies that you are dealing with are of spherical shape Spheres are of arbitrary radius R, where Rlow < R < Rhigh

The bound values Rlow and Rhigh are given to you Their location is given to you NOTE: The ellipsoid case is similar (skipped here the discussion)

You will have to validate your results against “Bullet”, which is an open source game development tool.

29

Proposed Workflow

Generate an initial distribution of the spheres (part of Project 1)

Load data (part of Project 2)

Run collision detection on the device (part of Project 2)

Save results in a file (part of Project 2)

Run collision detection using the Bullet library (part of Project 3)

The utility will tell you if you “passed” or “failed” (part of Project 3)

30

The proposed workflow below will be captured in a Visual Studio Solution (with three Projects) that will be provided to you (zipped file).You are responsible for the part in red font below.

Project template Solution contains three projects

First project is used to generate the input data (distribution of the spheres along with their radius)

Second project loads input data and checks for collisions This is the project you modify The main driver calls a function that invokes the collision detection

kernel Pretty much like you’ve had to do in your HW

Third project does the Bullet part followed up by the validation

Uses ‘C’ function notation Main file uses C++ classes and objects Allows C++ and C code to work together 31

Spheres: Possible Contact Cases

Convention: Color code for object A: bluish Color code for object B: greenish

Sphere contacts sphere One contact point Normal - from center of B to center of A

Sphere interpenetrates sphere Two contact points

Points on objects furthest in Normal - from center of B to center of A

B

A

A

B

32

Possible Contact Cases (Contd.)

Sphere exactly on top of sphere Two points returned

Points on surface that intersect the x-axis Normal is on +x axis

Sphere inside sphere Two points returned

Points on surface that intersect the x-axis Normal is on +x axis

Spheres are not in contact Nothing to report, ignore

B

A

BA

33BA

Bullet Physics Library

Bullet is a cross platform open source physics library.

Can simulate many types of objects Rigid bodies (concave, convex, triangular mesh, etc.) Cloth, soft bodies

Supports joints and constraints

Geared towards game development. Speed vs. accuracy: comes on the side of speed

Bullet’s collision detection will be used to verify your results

34

Screenshot, Bullet

35

Relevant VariablesAll variables retuned per contact by Bullet

btVector3 localPointA btVector3 localPointB btVector3 positionWorldOnB btVector3 positionWorldOnA btVector3 normalWorldOnB btScalar distance1btVector3 lateralFrictionDir1 btVector3 lateralFrictionDir2int partId0 int partId1 int index0 int index1 btScalar combinedFriction btScalar combinedRestitution void * userPersistentData btScalar appliedImpulse bool lateralFrictionInitialized int lifeTime btScalar appliedImpulseLateral1 btScalar appliedImpulseLateral2

Variables you need to return and calculatebtVector3 positionWorldOnAbtVector3 positionWorldOnB btVector3 normalWorldOnB int objectIdA (note: this is the Id associated with A)int objectIdB (note: this is the Id associated with B)

Additional parameters for Final Project*:

localPointAlocalPointBbtVector3 lateralFrictionDir1 btVector3 lateralFrictionDir2distance1

36

Variable DescriptionpositionWorldOnB (float3) Absolute Collision location on object ApositionWorldOnA (float3) Absolute Collision location on object BnormalWorldOnB (float3) Collision Normal Vector from B to AobjectIdA (int) Object A IDobjectIdB (int) Object B ID (Id starts from 0)

Other parameters that need to be implemented for Final Project (if you take this route):

localPointA (float3) Collision point w/ respect to center of object AlocalPointB (float3) Collision point w/ respect to center of object Bdistance1 (float) Distance between collision pointslateralFrictionDir1 (float3) Vector orthogonal to normal and Dir2lateralFrictionDir2 (float3) Vector orthogonal to normal and Dir1

37

The Passed/Failed Issue The comparison is going to be made based on

The contact location expressed in global reference frame There will be an epsilon value that controls this test

The contact normal Your normal and Bullet’s normal should be within 10±

For comparison purposes, your contacts should be ordered such that

I still owe you two things: How are body A and body B chosen? What to do with this collision case:

38

Common Sense Advice

Donald Knuth (1974): Premature optimization is the root of all evil

Get it to run first, add the bells and whistles later on…

Embrace a Crawl – Walk – Run approach

39

Collision Detection ~ Some Possible Approaches ~

Brute force – simplest and slowest Ok to implement, unless you want to make this your final project, in

which case you’ll have to implement one of the other three methods below

It relies on an exhaustive search: N(N-1)/2 tests (given N bodies)

Baraff – sorting based (the method is sometimes called “culling”

Harada – rely on voxel and texture, not recommended

Scott Le Grand – spatial subdivision method

40

References The methods of Scott Le Grand and Harada are explained in

great detail in GPU Gems 3 (you can borrow it from me)

I also have the following references (found them on the web, contact me if you need them):

Naga, K. G., R. Stephane, C. L. Ming, and M. Dinesh, 2003: CULLIDE: interactive collision detection between complex models in large environments using graphics hardware. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eurographics Association.

Baraff, D., 1997: An Introduction to Physically Based Modeling: Rigid Body Simulation II—Nonpenetration Constraints. SIGGRAPH Course Notes.

41

Brute Force

Iterate over all combinations of object pairs and perform distance checks on them

Store pairs that are colliding into list and compute required information

Iterate over each pair and calculate additional contact data.

42

Baraff – Sorting axis

Every time an object p begins on the axis add p to stack Every time an object p ends on the axis remove p from stack Each time an object is added to stack compare it to those still

on stack p1 check with p3 and p6

p5 check with p3

An object is colliding with another if they overlap on all three axes.

Multi GPU relevant: yes, it can speed up the problem by simultaneously culling based on more axes

p3

p6 p5

p1 p2

p4

b3 b6 e6 e1 b5 b2 e3 b4 e5 e4 e2b1

43

Example: Why this is tricky at times

44

(The issue of false positives)

Harada , GPU Gems 3 Mostly meant to deal with spheres, and of the same radii

“Voxel” collision method Voxel ´ 3d pixel

Spheres are used to represent other objects Fill object’s space using spheres

Use many 2D textures to store data. Textures make up 3d grid Optimize by using texture memory Easier to coalesce reads and writes

45

Harada (Contd.)

Each voxel in texture contains up to 4 object id’s Stored in RGBA values Objects and cells need to be sized so

that this is true Constant radius is preferred but some

variation can exist Too much variation could cause

instances with >4 objects in one cell’s space

Test each cell with its 26 neighbors

Iterate over all cells in this way

46

Scott Le Grand, GPU Gems 3 Spatial subdivision method

Divide space into three dimensional grid

Collision state of spheres cannot be modified by different threads Cells numbered by processing order (1, 2,…)

First all #1 cells are processed, then #2 cells and so on Insures that two threads do not overwrite collision data for one

sphere

Spheres get assigned a home cell id and overlapping “phantom” cell id’s Array gets sorted by cell number and type

47

Scott Le Grand (Contd.)

From this list determine which cells have multiple objects inside These cells will be checked for collisions

Task is simplified because cell id list is already sorted by cell number By checking for a change in cell id the number of objects in that cell

can be determined. If cell only has phantom objects it will not be checked

Phantom objects - objects whose centroid is not in that cell but still intersects it Cell id must have at least one home id and >1 objects to be checked

Multi GPU relevant: yes, it can increase the size of the problem being addressed 48

me964 high performance computing for engineering applications

Documents