lec22 parallel
TRANSCRIPT
INTRODUCTION TO PARALLEL PROGRAMMING AND MULTICORES
Turbo Majumder
Not one, but many
• Multicores, multiprocessors: Collaborating processing elements (aka cores) • Off-chip: clusters, supercomputers
1. Tianhe-2: 3,120,000 cores 3. Sequoia: 1,572,864 cores
44. iDataPlex DX360M4, IBM, 38016 cores, located at IITM, Pune
Source: www.top500.org November 2013 list
Parallelism in software
Parallelism in hardware
Parallelism
• Instruction-level
• Thread-level
• Task/job/process-level
• Multiple equivalent functional units
• Out-of-order processing
• Simultaneous multithreading / Hyperthreading ®
• And obviously, multiple cores
Exploiting parallelism
• Parallel software needs to catch up
• Nvidia has CUDA, but others?
• Difficulties
• Problem partitioning
• Coordination amongst cores
• Communication overhead
Intel Xeon Phi: 64 cores on a chip Nvidia GTX Titan: 2688 stream PEs Tilera TILE-Gx72: 72 cores on a chip
Images courtesy: www.anandtech.com; www.tilera.com
Revisiting Amdahl’s Law
• Sequential, i.e., non-parallelised or non-parallelisable part is the speedup bottleneck
• Example: 100 processors, 90× speedup? • Tnew = Tparallelisable/100 + Tsequential
•
• Solving: Fparallelisable = 0.999
• Need sequential part to be very, very small • Very difficult to achieve
• Parallelisation efficiency = speedup/(ideal speedup) = 0.9 or 90%
Speedup =1
(1- Fparallelisable)+Fparallelisable /100= 90
Problem scaling • Workload: sum of 10 scalars and 10 × 10 matrix sum
• Assumptions: scalar addition is sequential; each addition cannot be further parallelised
• Speedup from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors • Time = 10 × tadd + 100/10 × tadd = 20 × tadd
• Speedup = 110/20 = 5.5 (Parallelisation efficiency = 55%)
• 100 processors • Time = 10 × tadd + 100/100 × tadd = 11 × tadd
• Speedup = 110/11 = 10 (Parallelisation efficiency = 10%)
• Assumes load can be balanced across processors
Problem scaling (continued)
• What if matrix size is increased to 100 × 100?
• Single processor: Time = (10 + 10000) × tadd
• 10 processors
• Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
• Speedup = 10010/1010 = 9.9 (Parallelisation efficiency = 99%)
• 100 processors
• Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
• Speedup = 10010/110 = 91 (Parallelisation efficiency = 91%)
• Assuming load balanced
Parallel processing paradigms Shared memory • SMP: shared memory multiprocessor
• Hardware provides single physical address space for all processors
• Could also be physically distributed (e.g. shared L2 cache)
• Synchronise shared variables using locks
• Memory access time • UMA (uniform) vs. NUMA (non-uniform)
• Cache coherence
Distributed shared memory (DSM)
• Memory distributed among processors
• Non-uniform memory access/latency (NUMA) • If L2 is the lowest level private cache, L2 miss latency will depend
on exact block address.
• Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks
Example: Computing sums with SMP
• Sum 128,000 numbers on 128 processor UMA • Each processor has ID: 0 ≤ Pn ≤ 127 • Partition 1000 numbers per processor • Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];
• Now need to add these partial sums • Reduction: divide and conquer • Half the processors add pairs, then quarter, … • Need to synchronise between reduction steps
• Also see MapReduce
Example (continued)
half = 128;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Parallel processing paradigms Message passing • Each processor has private physical address space
• Hardware sends/receives messages between processors
Message passing architectures
• Cluster: Network of independent computers • Each has private memory and OS
• Connected using I/O system • E.g., Ethernet/switch
• Suitable for applications with independent tasks • Web servers, databases, simulations, …
• High availability, scalable, affordable
• Problems • Administration cost (prefer virtual machines)
• Low interconnect bandwidth • Compare with processor/memory bandwidth on an SMP
• Router congestion
Example: Computing sums with MP
• Sum 100,000 on 100 processors
• First distribute 100 sets of numbers to each • They compute partial sums
sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];
• Reduction • Half the processors send, other half receive and add
• The quarter send, quarter receive and add, …
• With shared memory, access times were implicit; here distribution and reduction are explicit communication operations (taking time)
Example (continued)
• Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */
• Send/receive also provide synchronisation
• Assumes send/receive take similar time to addition • Will not be true in most cases
Larger parallel systems
• Grid • LSF: qsub, bsub
• Batch job submissions to a cluster; scheduling management done centrally
• e.g., HPC CUDA cluster at IIT Delhi
• Cloud
• Large interlinked data centres
• Infrastructure as a service • Private cloud: e.g. Baadal
• Public cloud: e.g. Amazon Elastic Compute Cloud (Amazon EC2)
• Pricing policies
Taxonomy of parallel computing
• (Michael) Flynn’s taxonomy
• Based on instruction stream and data stream • Between memory (cache) and CPU
• Instruction stream is unidirectional
• Data stream is bidirectional
• Single Instruction Single Data (SISD)
• Classic uniprocessors
• One PE (i.e. one control unit); could have multiple functional units
• Example: Most small microprocessors, CDC6600
Taxonomy of parallel computing
• Single Instruction Multiple Data (SIMD)
• Same instruction, single control unit, single decode
• But multiple data streams being operated on
• Examples: • Vector processors such as Cray X-MP
• Processor arrays such as Connection Machine (CM) (hypercube)
• Most GPUs (e.g. nVIDIA GeForce 6800 GT)
• Partially in Intel’s AVX, AVX2 architectures
• Multiple Instruction Single Data (MISD)
• Multiple processing elements (control units) operating on a single data stream
• Fault tolerant systems; N-version programming
• Examples: very few ever built • C.mmp, systolic arrays (both could also be classified as MIMD),
Taxonomy of parallel computing
• Multiple Instruction Multiple Data (MIMD)
• Several instruction and data streams between processing elements and memory
• Truly parallel systems
• Challenges: • Load balancing
• Coordination/synchronisation
• Communication delay / interconnection bandwidth
• Examples: Most modern multicore general purpose processors to supercomputers
Structural classification
• Mostly for MIMDs
• Shared memory systems • Tightly coupled systems
• Processor Memory Interconnection Network
• I/O Processor Interconnection Network
• Interrupt Signal Interconnection Network
• Distributed memory systems • Loosely coupled systems
• Message passing interconnection network