lec22 parallel

INTRODUCTION TO PARALLEL PROGRAMMING AND MULTICORES

Turbo Majumder

[email protected]

mailto:[email protected]

Not one, but many

• Multicores, multiprocessors: Collaborating processing elements (aka cores) • Off-chip: clusters, supercomputers

1. Tianhe-2: 3,120,000 cores 3. Sequoia: 1,572,864 cores

44. iDataPlex DX360M4, IBM, 38016 cores, located at IITM, Pune

Source: www.top500.org November 2013 list

http://www.top500.org

Parallelism in software

Parallelism in hardware

Parallelism

• Instruction-level

• Thread-level

• Task/job/process-level

• Multiple equivalent functional units

• Out-of-order processing

• Simultaneous multithreading / Hyperthreading ®

• And obviously, multiple cores

Exploiting parallelism

• Parallel software needs to catch up

• Nvidia has CUDA, but others?

• Difficulties

• Problem partitioning

• Coordination amongst cores

• Communication overhead

Intel Xeon Phi: 64 cores on a chip Nvidia GTX Titan: 2688 stream PEs Tilera TILE-Gx72: 72 cores on a chip

Images courtesy: www.anandtech.com; www.tilera.com

http://www.anandtech.com

http://www.tilera.com

Revisiting Amdahl’s Law

• Sequential, i.e., non-parallelised or non-parallelisable part is the speedup bottleneck

• Example: 100 processors, 90× speedup? • Tnew = Tparallelisable/100 + Tsequential

•

• Solving: Fparallelisable = 0.999

• Need sequential part to be very, very small • Very difficult to achieve

• Parallelisation efficiency = speedup/(ideal speedup) = 0.9 or 90%

Speedup =1

(1- Fparallelisable)+Fparallelisable /100= 90

Problem scaling • Workload: sum of 10 scalars and 10 × 10 matrix sum

• Assumptions: scalar addition is sequential; each addition cannot be further parallelised

• Speedup from 10 to 100 processors

• Single processor: Time = (10 + 100) × tadd

• 10 processors • Time = 10 × tadd + 100/10 × tadd = 20 × tadd

• Speedup = 110/20 = 5.5 (Parallelisation efficiency = 55%)

• 100 processors • Time = 10 × tadd + 100/100 × tadd = 11 × tadd

• Speedup = 110/11 = 10 (Parallelisation efficiency = 10%)

• Assumes load can be balanced across processors

Problem scaling (continued)

• What if matrix size is increased to 100 × 100?

• Single processor: Time = (10 + 10000) × tadd

• 10 processors

• Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd

• Speedup = 10010/1010 = 9.9 (Parallelisation efficiency = 99%)

• 100 processors

• Time = 10 × tadd + 10000/100 × tadd = 110 × tadd

• Speedup = 10010/110 = 91 (Parallelisation efficiency = 91%)

• Assuming load balanced

Parallel processing paradigms Shared memory • SMP: shared memory multiprocessor

• Hardware provides single physical address space for all processors

• Could also be physically distributed (e.g. shared L2 cache)

• Synchronise shared variables using locks

• Memory access time • UMA (uniform) vs. NUMA (non-uniform)

• Cache coherence

Distributed shared memory (DSM)

• Memory distributed among processors

• Non-uniform memory access/latency (NUMA) • If L2 is the lowest level private cache, L2 miss latency will depend

on exact block address.

• Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks

Example: Computing sums with SMP

• Sum 128,000 numbers on 128 processor UMA • Each processor has ID: 0 ≤ Pn ≤ 127 • Partition 1000 numbers per processor • Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];

• Now need to add these partial sums • Reduction: divide and conquer • Half the processors add pairs, then quarter, … • Need to synchronise between reduction steps

• Also see MapReduce

Example (continued)

half = 128;

repeat

synch();

if (half%2 != 0 && Pn == 0)

sum[0] = sum[0] + sum[half-1];

/* Conditional sum needed when half is odd;

Processor0 gets missing element */

half = half/2; /* dividing line on who sums */

if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];

until (half == 1);

Parallel processing paradigms Message passing • Each processor has private physical address space

• Hardware sends/receives messages between processors

Message passing architectures

• Cluster: Network of independent computers • Each has private memory and OS

• Connected using I/O system • E.g., Ethernet/switch

• Suitable for applications with independent tasks • Web servers, databases, simulations, …

• High availability, scalable, affordable

• Problems • Administration cost (prefer virtual machines)

• Low interconnect bandwidth • Compare with processor/memory bandwidth on an SMP

• Router congestion

Example: Computing sums with MP

• Sum 100,000 on 100 processors

• First distribute 100 sets of numbers to each • They compute partial sums

sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];

• Reduction • Half the processors send, other half receive and add

• The quarter send, quarter receive and add, …

• With shared memory, access times were implicit; here distribution and reduction are explicit communication operations (taking time)

Example (continued)

• Given send() and receive() operations

limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */

• Send/receive also provide synchronisation

• Assumes send/receive take similar time to addition • Will not be true in most cases

Larger parallel systems

• Grid • LSF: qsub, bsub

• Batch job submissions to a cluster; scheduling management done centrally

• e.g., HPC CUDA cluster at IIT Delhi

• Cloud

• Large interlinked data centres

• Infrastructure as a service • Private cloud: e.g. Baadal

• Public cloud: e.g. Amazon Elastic Compute Cloud (Amazon EC2)

• Pricing policies

Taxonomy of parallel computing

• (Michael) Flynn’s taxonomy

• Based on instruction stream and data stream • Between memory (cache) and CPU

• Instruction stream is unidirectional

• Data stream is bidirectional

• Single Instruction Single Data (SISD)

• Classic uniprocessors

• One PE (i.e. one control unit); could have multiple functional units

• Example: Most small microprocessors, CDC6600


• Single Instruction Multiple Data (SIMD)

• Same instruction, single control unit, single decode

• But multiple data streams being operated on

• Examples: • Vector processors such as Cray X-MP

• Processor arrays such as Connection Machine (CM) (hypercube)

• Most GPUs (e.g. nVIDIA GeForce 6800 GT)

• Partially in Intel’s AVX, AVX2 architectures

• Multiple Instruction Single Data (MISD)

• Multiple processing elements (control units) operating on a single data stream

• Fault tolerant systems; N-version programming

• Examples: very few ever built • C.mmp, systolic arrays (both could also be classified as MIMD),


• Multiple Instruction Multiple Data (MIMD)

• Several instruction and data streams between processing elements and memory

• Truly parallel systems

• Challenges: • Load balancing

• Coordination/synchronisation

• Communication delay / interconnection bandwidth

• Examples: Most modern multicore general purpose processors to supercomputers

Structural classification

• Mostly for MIMDs

• Shared memory systems • Tightly coupled systems

• Processor Memory Interconnection Network

• I/O Processor Interconnection Network

• Interrupt Signal Interconnection Network

• Distributed memory systems • Loosely coupled systems

• Message passing interconnection network

lec22 parallel

Documents