lecture 5: intro to parallel machines and models ...bindel/class/cs5220-s14/lectures/lec05.pdf ·...

Lecture 5:Intro to parallel machines and models

+Locality and parallelism in simulations I

David Bindel

6 Feb 2014

Logistics

I HW 1 in teams of 2–4. Due next Friday!I CMS entry for team formation – enter teams by MondayI Please start early!

I I will be out next Thurs, Feb 13I Guest lecture: Prof. Ken BirmanI I will miss next Thursday office hoursI I probably won’t respond to email for about a week after

Why clusters?

I Clusters of SMPs are everywhereI Commodity hardware – economics! Even supercomputers

now use commodity CPUs (though specializedinterconnects).

I Relatively simple to set up and administer (?)I But still costs room, power, ...I Economy of scale =⇒ clouds?

I Amazon now has HPC instances on EC2I StarCluster project lets you launch your own EC2 clusterI Lots of interesting challenges here

Cluster structure

Consider:I Each core has vector parallelismI Each chip has four cores, shares memory with othersI Each box has two chips, shares memoryI Five instructional nodes, communicate via Ethernet

How did we get here? Why this type of structure? And howdoes the programming model match the hardware?

Parallel computer hardware

Physical machine has processors, memory, interconnect.I Where is memory physically?I Is it attached to processors?I What is the network connectivity?

Parallel programming model

Programming model through languages, libraries.I Control

I How is parallelism created?I What ordering is there between operations?

I DataI What data is private or shared?I How is data logically shared or communicated?

I SynchronizationI What operations are used to coordinate?I What operations are atomic?

I Cost: how do we reason about each of above?

Programming model 6= hardware organization!

Simple example

Consider dot product of x and y .I Where do arrays x and y live? One CPU? Partitioned?I Who does what work?I How do we combine to get a single final result?

Shared memory programming model

Threads

Shared memory

Thread-local storage

Program consists of threads of control.I Can be created dynamicallyI Each has private variables (e.g. local)I Each has shared variables (e.g. heap)I Communication through shared variablesI Coordinate by synchronizing on variablesI Examples: OpenMP, pthreads

Shared memory dot product

Dot product of two n vectors on p � n processors:1. Each CPU evaluates partial sum (n/p elements, local)2. Everyone tallies partial sums

Can we go home now?

Race condition

A race condition:I Two threads access same variable, at least one write.I Access are concurrent – no ordering guarantees

I Could happen simultaneously!

Need synchronization via lock or barrier.

Race to the dot

Consider S += partial_sum on 2 CPU:I P1: Load S

I P1: Add partial_sum

I P2: Load S

I P1: Store new S

I P2: Add partial_sum

I P2: Store new S

Shared memory dot with locks

Solution: consider S += partial_sum a critical sectionI Only one CPU at a time allowed in critical sectionI Can violate invariants locallyI Enforce via a lock or mutex (mutual exclusion variable)

Dot product with mutex:1. Create global mutex l2. Compute partial_sum

3. Lock l4. S += partial_sum5. Unlock l

Shared memory with barriers

I Many codes have phases (e.g. time steps)I Communication only needed at end of phasesI Idea: synchronize on end of phase with barrier

I More restrictive (less efficient?) than small locksI Easier to think through! (e.g. less chance of deadlocks)

I Sometimes called bulk synchronous programming

Shared memory machine model

I Processors and memories talk through a busI Symmetric Multiprocessor (SMP)I Hard to scale to lots of processors (think ≤ 32)

I Bus becomes bottleneckI Cache coherence is a pain

I Example: Quad-core chips on cluster

Multithreaded processor machine

I May have more threads than processors!I Can switch threads on long latency ops

I Cray MTA was an extreme exampleI Similar to hyperthreading

I But hyperthreading doesn’t switch – just schedules multiplethreads onto same CPU functional units

Distributed shared memory

I Non-Uniform Memory Access (NUMA)I Can logically share memory while physically distributingI Any processor can access any addressI Cache coherence is still a painI Example: SGI Origin (or multiprocessor nodes on cluster)

Message-passing programming model

I Collection of named processesI Data is partitionedI Communication by send/receive of explicit messageI Lingua franca: MPI (Message Passing Interface)

Message passing dot product: v1

Processor 1:1. Partial sum s12. Send s1 to P23. Receive s2 from P24. s = s1 + s2


What could go wrong? Think of phones vs letters...

Message passing dot product: v1


Processor 2:1. Partial sum s22. Receive s1 from P13. Send s2 to P14. s = s1 + s2

Better, but what if more than two processors?

MPI: the de facto standard

I Pro: PortabilityI Con: least-common-denominator for mid 80s

The “assembly language” (or C?) of parallelism...but, alas, assembly language can be high performance.

Distributed memory machines

I Each node has local memoryI ... and no direct access to memory on other nodes

I Nodes communicate via network interfaceI Example: our cluster!I Other examples: IBM SP, Cray T3E

The story so farI Even serial performance is a complicated function of the

underlying architecture and memory system. We need tounderstand these effects in order to design data structuresand algorithms that are fast on modern machines. Goodserial performance is the basis for good parallelperformance.

I Parallel performance is additionally complicated bycommunication and synchronization overheads, and byhow much parallel work is available. If a small fraction ofthe work is completely serial, Amdahl’s law bounds thespeedup, independent of the number of processors.

I We have discussed serial architecture and some of thebasics of parallel machine models and programmingmodels.

I Now we want to describe how to think about the shape ofparallel algorithms for some scientific applications.

Reminder: what do we want?

I High-level: solve big problems fastI Start with good serial performanceI Given p processors, could then ask for

I Good speedup: p−1 times serial timeI Good scaled speedup: p times the work in same time

I Easiest to get good speedup from cruddy serial code!

Parallelism and locality

I Real world exhibits parallelism and localityI Particles, people, etc function independentlyI Nearby objects interact more strongly than distant onesI Can often simplify dependence on distant objects

I Can get more parallelism / locality through modelI Limited range of dependency between adjacent time stepsI Can neglect or approximate far-field effects

I Often get parallism at multiple levelsI Hierarchical circuit simulationI Interacting models for climateI Parallelizing individual experiments in MC or optimization

Basic styles of simulation

I Discrete event systems (continuous or discrete time)I Game of life, logic-level circuit simulationI Network simulation

I Particle systemsI Billiards, electrons, galaxies, ...I Ants, cars, ...?

I Lumped parameter models (ODEs)I Circuits (SPICE), structures, chemical kinetics

I Distributed parameter models (PDEs / integral equations)I Heat, elasticity, electrostatics, ...

Often more than one type of simulation appropriate.Sometimes more than one at a time!

lecture 5: intro to parallel machines and models ...bindel/class/cs5220-s14/lectures/lec05.pdf ·...

Documents