an introduction to parallel programming ing. andrea marongiu ([email protected])
Post on 19-Dec-2015
221 views
TRANSCRIPT
![Page 1: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/1.jpg)
An Introduction ToPARALLEL PROGRAMMINGIng. Andrea Marongiu([email protected])
![Page 2: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/2.jpg)
The Multicore Revolution is Here! More instruction-level parallelism hard to find
Very complex designs needed for small gain Thread-level parallelism appears live and well
Clock frequency scaling is slowing drastically Too much power and heat when pushing envelope
Cannot communicate across chip fast enough Better to design small local units with short paths
Effective use of billions of transistors Easier to reuse a basic unit many times
Potential for very easy scaling Just keep adding processors/cores for higher (peak)
performance
![Page 3: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/3.jpg)
Vocabulary in the Multi Era
AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor
SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor
![Page 4: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/4.jpg)
Vocabulary in the Multi Era
Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design.
Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.
![Page 5: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/5.jpg)
Future Embedded Systems
![Page 6: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/6.jpg)
The First Software Crisis
60’s and 70’s: PROBLEM: Assembly Language
Programming Need to get abstraction and portability
without losing performance SOLUTION: High-level Languages
(Fortran and C) Provided “common machine language” for
uniprocessors
![Page 7: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/7.jpg)
The Second Software Crisis
80’s and 90’s: PROBLEM: Inability to build and
maintain complex and robust applications requiring multi-million lines of code developed by hundred programmers Need to composability, malleability and
maintainability SOLUTION: Object-Oriented
Programming (C++ and Java) Better tools and software engineering
methodology (design patterns, specification, testing)
![Page 8: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/8.jpg)
The Third Software Crisis
Today: PROBLEM: Solid boundary between
hardware and software High-level languages abstract away the
hardware Sequential performance is left behind by
Moore’s Law SOLUTION: What’s under the hood?
Language features for architectural awareness
![Page 9: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/9.jpg)
The Software becomes the Problem, AGAIN
Parallelism required to gain performance Parallel hardware is “easy” to design Parallel software is (very) hard to write
Fundamentally hard to grasp true concurrency Especially in complex software
environments Existing software assumes single-
processor Might break in new and interesting ways Multitasking no guarantee to run on
multiprocessor
![Page 10: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/10.jpg)
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
![Page 11: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/11.jpg)
Coverage
More, less powerful (and power-hungry) cores to achieve same performance?
![Page 12: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/12.jpg)
Coverage
Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67
![Page 13: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/13.jpg)
Amdahl’s Law
p = fraction of work that can be parallelized n = the number of processors
![Page 14: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/14.jpg)
Implications of Amdahl’s Law
Speedup tends to 1/(1-p) as number of processors tends to infinity
Parallel programming is worthwhile when programs have a lot of work that is parallel in nature
Overhead
![Page 15: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/15.jpg)
Overhead of Parallelism
Given enough parallel work, this is the biggest barrier to getting desired speedup
Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation
Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
![Page 16: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/16.jpg)
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
![Page 17: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/17.jpg)
Communication/Synchronization Only few programs are “embarassingly”
parallel Programs have sequential parts and
parallel parts Need to orchestrate parallel execution
among processors Synchronize threads to make sure
dependencies in the program are preserved Communicate results among threads to
ensure a consistent view of data being processed
![Page 18: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/18.jpg)
Communication/Synchronization Shared Memory
Communication is implicit. One copy of data shared among many threads
Atomicity, locking and synchronization essential for correctness
Synchronization is typically in the form of a global barrier
Distributed memory Communication is explicit
through messages Cores access local
memory Data distribution and
communication orchestration is essential for performance
Synchronization is implicit in messages
Overhead
![Page 19: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/19.jpg)
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
![Page 20: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/20.jpg)
Granularity
Granularity is a qualitative measure of the ratio of computation to communication
Computation stages are typically separated from periods of communication by synchronization events
![Page 21: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/21.jpg)
Granularity
Fine-grain Parallelism Low computation to
communication ratio Small amounts of
computational work between communication stages
Less opportunity for performance enhancement
High communication overhead
Coarse-grain Parallelism High computation to
communication ratio Large amounts of
computational work between communication events
More opportunity for performance increase
Harder to load balance efficiently
![Page 22: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/22.jpg)
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
![Page 23: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/23.jpg)
The Load Balancing Problem
Processors that finish early have to wait for the processor with the largest amount of work to complete Leads to idle time, lowers utilization
Particularly urgent with barrier synchronizationCore 1
Core 2
Core 3
Core 4
BALANCED
workloads
Core 1
Core 2
Core 3
Core 4
UNBALANCED
workloads
Slowest core dictates overall execution time
![Page 24: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/24.jpg)
Static Load Balancing
Programmer make decisions and assigns a fixed amount of work to each processing core a priori
Works well for homogeneous multicores All core are the same Each core has an equal amount of work
Not so well for heterogeneous multicore Some cores may be faster than others Work distribution is uneven
![Page 25: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/25.jpg)
Dynamic Load Balancing
Workload is partitioned in small tasks. Available tasks for processing are pushed in a work-queue
When one core finishes its allocated task, it takes on further work from the queue. The process continues until all tasks are assigned to some core for processing.
Ideal for codes where work is uneven, and in heterogeneous multicore
Core 1
Core 2
Core 3
Core 4
![Page 26: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/26.jpg)
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
![Page 27: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/27.jpg)
Memory Access Latency
Uniform Memory Access (UMA) – Shared Memory Centrally located shared memory All processors are equidistant (access times)
Non-Uniform Access (NUMA) Shared memory – Processors have the same address
space data is directly accessible by all, but cost depends on the distance Placement of data affects performance
Distributed memory – Processors have private address spaces Data access is local, but cost of messages depends on the distance Communication must be efficiently architected
![Page 28: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/28.jpg)
Locality of Memory Accesses(UMA Shared Memory)
Parallel computation is serialized due to memory contention and lack of bandwidth
![Page 29: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/29.jpg)
Locality of Memory Accesses(UMA Shared Memory)
Distribute data to relieve contention and increase effective bandwidth
![Page 30: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/30.jpg)
Locality of Memory Accesses(NUMA Shared Memory)
int main()
{
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
}INTERCONNECT
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
SHAREDMEMORY
Once parallel tasks have been assigned to different processors..
![Page 31: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/31.jpg)
int main()
{
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
}
Locality of Memory Accesses(NUMA Shared Memory)
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
A B
..phisical placement of data can have a great impact on performance!
Memory reference cost =Bus latency +Off-chip memory latency (100 cycles)
![Page 32: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/32.jpg)
int main()
{
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
}
Locality of Memory Accesses(NUMA Shared Memory)
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2Memory reference cost =Bus latency +On-chip memory latency (2-20 cycles)
![Page 33: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/33.jpg)
int main()
{
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
}
Locality of Memory Accesses(NUMA Shared Memory)
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2Memory reference cost =Local Memory Latency (1 cycle)
![Page 34: An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu (a.marongiu@unibo.it)](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d2e5503460f94a05ef8/html5/thumbnails/34.jpg)
Locality in Communication(Message Passing)