parallel processor organizations
DESCRIPTION
PARALLEL PROCESSOR ORGANIZATIONS. Jehan-François Pâris [email protected]. Chapter Organization. Overview Writing parallel programs Multiprocessor Organizations Hardware multithreading Alphabet soup (SISD, SIMD, MIMD, …) Roofline performance model. OVERVIEW. The hardware side. - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/2.jpg)
Chapter Organization
• Overview• Writing parallel programs• Multiprocessor Organizations• Hardware multithreading• Alphabet soup (SISD, SIMD, MIMD, …)• Roofline performance model
![Page 3: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/3.jpg)
OVERVIEW
![Page 4: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/4.jpg)
The hardware side
• Many parallel processing solutions– Multiprocessor architectures
• Two or more microprocessor chips• Multiple architectures
– Multicore architectures• Several processors on a single chip
![Page 5: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/5.jpg)
The software side
• Two ways for software to exploit parallel processing capabilities of hardware– Job-level parallelism
• Several sequential processes run in parallel• Easy to implement (OS does the job!)
– Process-level parallelism• A single program runs on several processors
at the same time
![Page 6: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/6.jpg)
WRITING PARALLEL PROGRAMS
![Page 7: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/7.jpg)
Overview
• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or
password guessing• Much more difficult for other applications
– Communication overhead among sub-tasks– Amdahl's law– Balancing the load
![Page 8: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/8.jpg)
Amdahl's Law
• Assume a sequential process takes
– tp seconds to perform operations that could be performed in parallel
– ts seconds to perform purely sequential operations
• The maximum speedup will be
(tp + ts )/ts
![Page 9: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/9.jpg)
Balancing the load
• Must ensure that workload is equally divided among all the processors
• Worst case is when one of the processors does much more work than all others
![Page 10: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/10.jpg)
Example (I)
• Computation partitioned among n processors• One of them does 1/m of the work with m < n
– That processor becomes a bottleneck
• Maximum expected speedup: n
• Actual maximum speedup: m
![Page 11: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/11.jpg)
Example (II)
• Computation partitioned among 64 processors• One of them does 1/8 of the work
• Maximum expected speedup: 64
• Actual maximum speedup: 8
![Page 12: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/12.jpg)
A last issue
• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs
![Page 13: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/13.jpg)
Rene Descartes
• Seventeenth-century French philosopher• Invented
– Cartesian coordinates – Methodical doubt
• [To] never to accept anything for true which I did not clearly know to be such
• Proposed a scientific method based on four precepts
![Page 14: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/14.jpg)
Method's third rule
• The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.
![Page 15: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/15.jpg)
MULTI PROCESSOR ORGANIZATIONS
![Page 16: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/16.jpg)
Shared memory multiprocessors
PU
Cache
PU
Cache
PU
Cache
Interconnection network
RAM I/O
…
![Page 17: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/17.jpg)
Shared memory multiprocessor
• Can offer– Uniform memory access to all processors
(UMA)• Easiest to program
– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory
![Page 18: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/18.jpg)
Computer clusters
PU
Cache
RAM
PU
Cache
RAM
PU
Cache
RAM
Interconnection network
…
![Page 19: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/19.jpg)
Computer clusters
• Very easy to assemble• Can take advantage of high-speed LANs
– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through
message passing
![Page 20: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/20.jpg)
Message passing (I)
• If processor P wants to access data in the main memory of processor Q it must– Send a request to Q– Wait for a reply
• For this to work, processor Q must have a thread– Waiting for message from other processors– Sending them replies
![Page 21: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/21.jpg)
Message passing (II)
• In a shared memory architecture, each processor can directly access all data
• A proposed solution– Distributed shared memory offers to the
users of a cluster the illusion of a single address space for their shared data
– Still has performance issues
![Page 22: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/22.jpg)
When things do not add up
• Memory capacity is very important for big computing applications– If the data can fit into main memory, the
computation will run much faster• A company replaced
– Single shared memory computer with 32GB of RAM
![Page 23: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/23.jpg)
A problem
• A company replaced – Single shared memory computer with 32GB of
RAM– Four “clustered” computers with 8GB each
• More I/O than ever• What did happen?
![Page 24: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/24.jpg)
The explanation
• Assume OS occupies one GB of RAM– The old shared-memory computer still had 31
GB of free RAM– Each of the clustered computer has 7 GB of
free RAM• The total RAM available to the program went
down from 31 GB to 47 = 28 GB!
![Page 25: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/25.jpg)
Grid computing
• The computers are distributed over a very large network– Sometimes computer time is donated
• Volunteer computing• Seti@Home
– Works well with embarrassingly parallel workloads• Searches in a n-dimensional space
![Page 26: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/26.jpg)
HARDWARE MULTITHREADING
![Page 27: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/27.jpg)
General idea
• Let the processor switch to another thread of computation while them current one is stalled
• Motivation:– Increased cost of cache misses
![Page 28: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/28.jpg)
Implementation
• Entirely controlled by the hardware– Unlike multiprogramming
• Requires a processor capable of– Keeping track of the state of each thread
• One set of registers—including PC– for each concurrent thread
– Quickly switching among concurrent threads
![Page 29: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/29.jpg)
Approaches
• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads
![Page 30: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/30.jpg)
Approaches
• Coarse-grained multithreading– Switches between threads whenever a long
stall is detected– Easier to implement – Cannot eliminate all stalls
![Page 31: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/31.jpg)
Approaches
• Simultaneous multi-threading:– Takes advantage of the possibility of modern
hardware to perform different tasks in parallel for instructions of different threads
– Best solution
![Page 32: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/32.jpg)
ALPHABET SOUP
![Page 33: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/33.jpg)
Overview
• Used to describe processor organizations where– Same instructions can be applied to– Multiple data instances
• Encountered in– Vector processors in the past– Graphic processing units (GPU)– x86 multimedia extension
![Page 34: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/34.jpg)
Classification
• SISD:– Single instruction, single data– Conventional uniprocessor architecture
• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture
![Page 35: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/35.jpg)
Classification
• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data
• Think of adding two vectors
for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];
![Page 36: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/36.jpg)
Vector computing
• Kind of SIMD architecture– Used by Cray computers
• Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU
• Requires– Vector registers able to store
multiple values– Special vector instructions: say lv, addv, …
![Page 37: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/37.jpg)
Benchmarking
• Two factors to consider– Memory bandwidth
• Depends on interconnection network– Floating-point performance
• Best known benchmark is LINPACK
![Page 38: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/38.jpg)
Roofline model
• Takes into account– Memory bandwidth– Floating-point performance
• Introduces arithmetic intensity– Total number of floating point operations in a
program divided by total number of bytes transferred to main memory
– Measured in FLOPS/byte
![Page 39: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/39.jpg)
Roofline model
• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic
Intensity, Peak Floating-Point Performance
![Page 40: PARALLEL PROCESSOR ORGANIZATIONS](https://reader035.vdocuments.us/reader035/viewer/2022062409/5681501b550346895dbe0098/html5/thumbnails/40.jpg)
Roofline model
Peak floating-point performance
Floating-point performance islimited by memory bandwidth