![Page 1: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/1.jpg)
1
CHAPTER 2 PARALLEL PROGRAMMING BACKGROUND“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.”
![Page 2: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/2.jpg)
2
TRADITIONAL PARALLEL MODELSSerial ModelSISDParallel ModelsSIMDMIMDMISD*
S = SingleM = MultipleD = DataI = Instruction
![Page 3: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/3.jpg)
3
VOCABULARY & NOTATION (2.1)
Task vs. Data: tasks are instructions that operate on data; modify or create new
Parallel computation multiple tasksCoordinate, manage,
DependenciesData: task requires data from another taskControl: events/steps must be ordered (I/O)
![Page 4: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/4.jpg)
4
TASK MANAGEMENT – FORK-JOIN
Fork: split control flow, creating new control flow
Join: control flows are synchronized & merged
![Page 5: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/5.jpg)
5
GRAPHICAL NOTATION – FIG. 2.1
Task Data Fork Join Dependency
![Page 6: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/6.jpg)
6
STRATEGIES (2.2)Data Parallelism Best strategy for Scalable ParallelismP. that grows as data set/problem size grows
Split data set over set of processors with task processing each set
More Data More Tasks
![Page 7: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/7.jpg)
7
STRATEGIES
Control Parallelism orFunctional DecompositionDifferent program functions run in parallelNot scalable – best speedup is constant factor
As data grows, parallelism doesn’tMay be less/no overhead
![Page 8: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/8.jpg)
8
REGULAR VS. IRREGULAR PARALLELISMRegular: tasks are similar with predictable dependenciesMatrix multiplication
Irregular: tasks are different in ways that create unpredictable dependenciesChess program
Many problems contain combinations
![Page 9: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/9.jpg)
9
HARDWARE MECHANISMS (2.3)
Most important 2Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition
Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism
![Page 10: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/10.jpg)
10
BRANCH STATEMENTS Detrimental
to Parallelism• Locality• Pipelining• HOW?
![Page 11: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/11.jpg)
11
MASKING - ALL CONTROL PATHS ARE EXECUTED BUT RESULTS ARE MASKED OUT – NOT USED
if (a&1)a = 3*a + 1
elsea=a/2
if/else contains branch statementsMasking: Both parts are executed in parallel, keep only one result
p = (a&1)t = 3*A + 1if (p) a = tt = a/2if (!p) a = t
No branches – single control of flowMasking works as if it were coded this way
![Page 12: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/12.jpg)
12
MACHINE MODELS (2.4)
CoreFunctional UnitsRegistersCache memory – multiple levels
![Page 13: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/13.jpg)
13
![Page 14: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/14.jpg)
14
CACHE MEMORY
Blocks (cache lines) – amount fetchedBandwidth – amount transferred concurrently
Latency – time to complete transferCache Coherence – consistency among copies
![Page 15: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/15.jpg)
15
VIRTUAL MEMORYMemory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessingSwaps PagesHW maps logical to physical addressData locality important to efficiencyPage Fault Thrashing
![Page 16: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/16.jpg)
16
PARALLEL MEMORY ACCESS
Cache (multiple)NUMA – Non-Uniform Memory AccessPRAM – Parallel Random Access Memory ModelTheoretical ModelAssumes - Uniform memory access times
![Page 17: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/17.jpg)
17
PERFORMANCE ISSUES (2.4.2)
Data LocalityChoose code segments that fit in cacheDesign to use data in close proximity Align data with cache lines (blocks)Dynamic Grain Size – good strategy
![Page 18: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/18.jpg)
18
PERFORMANCE ISSUES
Arithmetic IntensityLarge number of on-chip compute operations for every off-chip memory access
Otherwise, communication overhead is high
Related – Grain size
![Page 19: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/19.jpg)
19
FLYNN’S CATEGORIES
Serial ModelSISD
Parallel ModelsSIMD –
Array processorVector processor
MIMD Heterogeneous computer
ClustersMISD* - not useful
![Page 20: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/20.jpg)
20
CLASSIFICATION BASED ON MEMORY
Shared Memory – each processor accesses a common memoryAccess issuesNo message passingPC usually has small local memory
Distributed Memory – each processor has a local memory
Send explicit messages between processors
![Page 21: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/21.jpg)
21
EVOLUTION (2.4.4)
GPU – Graphics acceleratorsNow general purpose
Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s)
Heterogeneous – different (hardware working together)
Host Processor – for distribution, I/O, etc.
![Page 22: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/22.jpg)
22
PERFORMANCE (2.5)
Various interpretations of PerformanceReduce Total Time for computationLatency
Increasing Rate at which series of results are computedThroughput
Reduce Power Consumption*Performance Target
![Page 23: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/23.jpg)
23
LATENCY & THROUGHPUT (2.5.1)Latency: time to complete a taskThroughput: rate at which tasks are completeUnits per time (e.g. jobs per hour)
![Page 24: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/24.jpg)
24
OMIT SECTION 2.5.3 – POWER
![Page 25: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/25.jpg)
25
SPEEDUP & EFFICIENCY (2.5.2)
Sp = T1 / Tp T1: time to complete on 1 processor
Tp: time to complete on p processors
REMEMBER: “time” means number of instructions
E = Sp / P
= T1 / P*Tp
E = 1 is “perfect”
Linear Speedup – occurs when algorithm runs P-times faster on P processors
![Page 26: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/26.jpg)
26
SUPERLINEAR SPEEDUP (P.57)
Efficiency > 1Very RareOften due to HW variations (cache)Working in parallel may eliminate some work that is done when serial
![Page 27: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/27.jpg)
27
AMDAHL & GUSTAFSON-BARSIS (2.5.4, 2.5.5)Amdahl: speedup is limited by amount of serial work required
G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases
See examples
![Page 28: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/28.jpg)
28
WORK
Total operations (time) for taskT1 = WorkP * Tp = Work T1 = P * Tp ?? Rare due to ???
![Page 29: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/29.jpg)
29
WORK-SPAN MODEL (2.5.6)Describes Dependencies among Tasks & allows for estimated timesRepresents Tasks: DAG (Figure 2.8) Critical Path – longest pathSpan - minimum time of Critical Path
Assumes Greedy Task Scheduling – no wasted resources, time
Parallel Slack – excess parallelism, more tasks than can be scheduled at once
![Page 30: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/30.jpg)
30
WORK-SPAN MODEL
Speedup <= Work/Span
Upper Bound: ??No more than…
![Page 31: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/31.jpg)
31
ASYMPTOTIC COMPLEXITY (2.5.7)
Comparing Algorithms!!Time Complexity: defines execution time growth in terms of input size
Space Complexity: defines growth of memory requirements in terms of input size
Ignores constantsMachine independent
![Page 32: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/32.jpg)
32
BIG OH NOTATION (P.66)
Big OH of F(n) – Upper BoundO(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No
*Memorize
![Page 33: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/33.jpg)
33
BIG OMEGA & BIG THETA
Big Omega – Functions that define Lower Bound
Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds
![Page 34: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/34.jpg)
34
CONCURRENCY VS. PARALLEL
Parallel work actually occurring at same timeLimited by number of processors
Concurrent tasks in progress at same time but not necessarily executing“Unlimited”
Omit 2.58 & most of 2.59
![Page 35: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/35.jpg)
35
PITFALLS OF PARALLEL PROGRAMMING (2.6)
Pitfalls = Issues that can cause problemsSynchronization – often required Too little non-determinismToo much reduces scaling, increases time & may cause deadlock
![Page 36: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/36.jpg)
36
RACE CONDITIONS (2.6.1)
Situation in which final results depend upon order tasks complete work
Occurs when concurrent tasks share memory location & there is a write operation
Unpredictable – don’t always cause errorsInterleaving: instructions from 2 or more tasks are executed in an alternating manner
![Page 37: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/37.jpg)
37
RACE CONDITIONS ~ EXAMPLE 2.2Task AA = XA += 1X = A
Task BB = XB += 2X = B
Assume X is initially 0.
What are the possible results?
So, Tasks A & B are not REALLY independent!
![Page 38: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/38.jpg)
38
RACE CONDITIONS ~ EXAMPLE 2.3Task AX = 1A = Y
Task BY = 1B = X
Assume X & Y are initially 0.
What are the possible results?
![Page 39: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/39.jpg)
39
SOLUTIONS TO RACE CONDITIONS (2.6.2)
Mutual Exclusion, Locks, Semaphores, Atomic OperationsMechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start
Does not always solve the problem – may still depend upon which task executes first
![Page 40: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/40.jpg)
40
DEADLOCK (2.6.3)
Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP
Recommendations for avoidanceAvoid mutual exclusionHold at most 1 lock at a timeAcquire locks in same order
![Page 41: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/41.jpg)
41
DEADLOCK – NECESSARY & SUFFICIENT CONDITIONS
1. Mutual Exclusion Condition: The resources involved are non-shareable.Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released.
2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources.Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes.
3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted.Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition
The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.
![Page 42: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/42.jpg)
42
STRANGLED SCALING (2.6.4)
Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section
Notes1 large lock is faster but blocks other processes
Time consideration for set/release of many locks
Example: lock row of matrix, not entire matrix
![Page 43: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/43.jpg)
43
LACK OF LOCALITY (2.6.5)
Two Assumptions for good localityTemporal Locality – access same location soon
Spatial Locality – access nearby location soon
Reminder: Cache Line – block that is retrievedCurrently – Cache miss ~~ 100 cycles
![Page 44: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/44.jpg)
44
LOAD IMBALANCE (2.6.6)
Uneven distribution of work over processors
Related to decomposition of problemFew vs Many Tasks – what are implications?
![Page 45: “By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules](https://reader035.vdocuments.us/reader035/viewer/2022070403/56649f2f5503460f94c48f38/html5/thumbnails/45.jpg)
45
OVERHEAD (2.6.7)
Always in parallel processingLaunch, synchronize
Small vs larger processors ~ Implications???
~the end of chapter 2~