Download - “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

1

CHAPTER 2 PARALLEL PROGRAMMING BACKGROUND“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.”

2

TRADITIONAL PARALLEL MODELSSerial ModelSISDParallel ModelsSIMDMIMDMISD*

S = SingleM = MultipleD = DataI = Instruction

3

VOCABULARY & NOTATION (2.1)

Task vs. Data: tasks are instructions that operate on data; modify or create new

Parallel computation multiple tasksCoordinate, manage,

DependenciesData: task requires data from another taskControl: events/steps must be ordered (I/O)

4

TASK MANAGEMENT – FORK-JOIN

Fork: split control flow, creating new control flow

Join: control flows are synchronized & merged

5

GRAPHICAL NOTATION – FIG. 2.1

Task Data Fork Join Dependency

6

STRATEGIES (2.2)Data Parallelism Best strategy for Scalable ParallelismP. that grows as data set/problem size grows

Split data set over set of processors with task processing each set

More Data More Tasks

7

STRATEGIES

Control Parallelism orFunctional DecompositionDifferent program functions run in parallelNot scalable – best speedup is constant factor

As data grows, parallelism doesn’tMay be less/no overhead

8

REGULAR VS. IRREGULAR PARALLELISMRegular: tasks are similar with predictable dependenciesMatrix multiplication

Irregular: tasks are different in ways that create unpredictable dependenciesChess program

Many problems contain combinations

9

HARDWARE MECHANISMS (2.3)

Most important 2Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition

Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism

10

BRANCH STATEMENTS Detrimental

to Parallelism• Locality• Pipelining• HOW?

11

MASKING - ALL CONTROL PATHS ARE EXECUTED BUT RESULTS ARE MASKED OUT – NOT USED

if (a&1)a = 3*a + 1

elsea=a/2

if/else contains branch statementsMasking: Both parts are executed in parallel, keep only one result

p = (a&1)t = 3*A + 1if (p) a = tt = a/2if (!p) a = t

No branches – single control of flowMasking works as if it were coded this way

12

MACHINE MODELS (2.4)

CoreFunctional UnitsRegistersCache memory – multiple levels

14

CACHE MEMORY

Blocks (cache lines) – amount fetchedBandwidth – amount transferred concurrently

Latency – time to complete transferCache Coherence – consistency among copies

15

VIRTUAL MEMORYMemory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessingSwaps PagesHW maps logical to physical addressData locality important to efficiencyPage Fault Thrashing

16

PARALLEL MEMORY ACCESS

Cache (multiple)NUMA – Non-Uniform Memory AccessPRAM – Parallel Random Access Memory ModelTheoretical ModelAssumes - Uniform memory access times

17

PERFORMANCE ISSUES (2.4.2)

Data LocalityChoose code segments that fit in cacheDesign to use data in close proximity Align data with cache lines (blocks)Dynamic Grain Size – good strategy

18

PERFORMANCE ISSUES

Arithmetic IntensityLarge number of on-chip compute operations for every off-chip memory access

Otherwise, communication overhead is high

Related – Grain size

19

FLYNN’S CATEGORIES

Serial ModelSISD

Parallel ModelsSIMD –

Array processorVector processor

MIMD Heterogeneous computer

ClustersMISD* - not useful

20

CLASSIFICATION BASED ON MEMORY

Shared Memory – each processor accesses a common memoryAccess issuesNo message passingPC usually has small local memory

Distributed Memory – each processor has a local memory

Send explicit messages between processors

21

EVOLUTION (2.4.4)

GPU – Graphics acceleratorsNow general purpose

Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)

Heterogeneous – different (hardware working together)

Host Processor – for distribution, I/O, etc.

22

PERFORMANCE (2.5)

Various interpretations of PerformanceReduce Total Time for computationLatency

Increasing Rate at which series of results are computedThroughput

Reduce Power Consumption*Performance Target

23

LATENCY & THROUGHPUT (2.5.1)Latency: time to complete a taskThroughput: rate at which tasks are completeUnits per time (e.g. jobs per hour)

24

OMIT SECTION 2.5.3 – POWER

25

SPEEDUP & EFFICIENCY (2.5.2)

Sp = T1 / Tp T1: time to complete on 1 processor

Tp: time to complete on p processors

REMEMBER: “time” means number of instructions

E = Sp / P

= T1 / P*Tp

E = 1 is “perfect”

Linear Speedup – occurs when algorithm runs P-times faster on P processors

26

SUPERLINEAR SPEEDUP (P.57)

Efficiency > 1Very RareOften due to HW variations (cache)Working in parallel may eliminate some work that is done when serial

27

AMDAHL & GUSTAFSON-BARSIS (2.5.4, 2.5.5)Amdahl: speedup is limited by amount of serial work required

G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases

See examples

28

WORK

Total operations (time) for taskT1 = WorkP * Tp = Work T1 = P * Tp ?? Rare due to ???

29

WORK-SPAN MODEL (2.5.6)Describes Dependencies among Tasks & allows for estimated timesRepresents Tasks: DAG (Figure 2.8) Critical Path – longest pathSpan - minimum time of Critical Path

Assumes Greedy Task Scheduling – no wasted resources, time

Parallel Slack – excess parallelism, more tasks than can be scheduled at once

30

WORK-SPAN MODEL

Speedup <= Work/Span

Upper Bound: ??No more than…

31

ASYMPTOTIC COMPLEXITY (2.5.7)

Comparing Algorithms!!Time Complexity: defines execution time growth in terms of input size

Space Complexity: defines growth of memory requirements in terms of input size

Ignores constantsMachine independent

32

BIG OH NOTATION (P.66)

Big OH of F(n) – Upper BoundO(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No

*Memorize

33

BIG OMEGA & BIG THETA

Big Omega – Functions that define Lower Bound

Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds

34

CONCURRENCY VS. PARALLEL

Parallel work actually occurring at same timeLimited by number of processors

Concurrent tasks in progress at same time but not necessarily executing“Unlimited”

Omit 2.58 & most of 2.59

35

PITFALLS OF PARALLEL PROGRAMMING (2.6)

Pitfalls = Issues that can cause problemsSynchronization – often required Too little non-determinismToo much reduces scaling, increases time & may cause deadlock

36

RACE CONDITIONS (2.6.1)

Situation in which final results depend upon order tasks complete work

Occurs when concurrent tasks share memory location & there is a write operation

Unpredictable – don’t always cause errorsInterleaving: instructions from 2 or more tasks are executed in an alternating manner

37

RACE CONDITIONS ~ EXAMPLE 2.2Task AA = XA += 1X = A

Task BB = XB += 2X = B

Assume X is initially 0.

What are the possible results?

So, Tasks A & B are not REALLY independent!

38

RACE CONDITIONS ~ EXAMPLE 2.3Task AX = 1A = Y

Task BY = 1B = X

Assume X & Y are initially 0.

What are the possible results?

39

SOLUTIONS TO RACE CONDITIONS (2.6.2)

Mutual Exclusion, Locks, Semaphores, Atomic OperationsMechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start

Does not always solve the problem – may still depend upon which task executes first

40

DEADLOCK (2.6.3)

Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP

Recommendations for avoidanceAvoid mutual exclusionHold at most 1 lock at a timeAcquire locks in same order

41

DEADLOCK – NECESSARY & SUFFICIENT CONDITIONS

1. Mutual Exclusion Condition: The resources involved are non-shareable.Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released.

2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources.Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes.

3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted.Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition

The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

42

STRANGLED SCALING (2.6.4)

Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section

Notes1 large lock is faster but blocks other processes

Time consideration for set/release of many locks

Example: lock row of matrix, not entire matrix

43

LACK OF LOCALITY (2.6.5)

Two Assumptions for good localityTemporal Locality – access same location soon

Spatial Locality – access nearby location soon

Reminder: Cache Line – block that is retrievedCurrently – Cache miss ~~ 100 cycles

44

LOAD IMBALANCE (2.6.6)

Uneven distribution of work over processors

Related to decomposition of problemFew vs Many Tasks – what are implications?

45

OVERHEAD (2.6.7)

Always in parallel processingLaunch, synchronize

Small vs larger processors ~ Implications???

~the end of chapter 2~

Download - “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules

Top Related