parallel programming models in the era of multi-core processors:

45
Parallel Programming models in the era of multi-core processors: Laxmikant Kale http://charm.cs.uiuc.e du Parallel Programming Laboratory Department of Computer Science

Upload: saxton

Post on 08-Jan-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Parallel Programming models in the era of multi-core processors:. Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Requirements. Composibility Respect for locality Dealing with heterogeneity - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Programming models in the era of multi-core processors:

Parallel Programming models in the era of multi-core processors:

Laxmikant Kale

http://charm.cs.uiuc.eduParallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

Page 2: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 2

Requirements

• Composibility• Respect for locality• Dealing with heterogeneity• Dealing with the memory wall• Dealing with dynamic resource variation

– Machine running 2 parallel apps on 64 cores, needs to run a third one

– Shrink and expand the sets of cores assigned to a job

• Dealing with Static resource variation : Fwd Scaling– I.e. Parallel App should run unchanged on the next generation manycore

with twice as many cores

• Above all: Simplicity

Page 3: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 3

Guidelines• A guideline that appeals to me:

– Bottom-up, application-driven development of abstractions

• Aim at a good division of labor between the programmer and System– Automate what the system can do well

– Allow programmer to do what they can do best

Page 4: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 4

Foundation: Adaptive Runtime System

User View

System implementation

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Num. of VPs to match

application logic (not physical cores)

– Separate VPs for different modules

• Message driven execution– Predictability : – Asynchronous reductions

• Dynamic mapping– Heterogeneity

• Vacate, adjust to speed, share– Change set of processors used– Dynamic load balancing

BenefitsFor me, Based on Migratable Objects

Page 5: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 5

What is the cost of Processor Virtualization?

Page 6: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128 256 512 1024 2048

Number of Chunks Per Processor

Tim

e (

Se

co

nd

s) p

er

Ite

rati

on

“Overhead” of Virtualization• Fragmentation cost?

– Cache performance improves

– Adaptive overlap improves

– Difficult to see cost..

• Fixable Problems:– Memory overhead: (larger ghost

areas)

– Fine-grained messaging:))1(*

,max(

)1,max(

)1(1

Pgv

TgT

P

Tg

pT

g

vTT

p

V: overhead per message

Tp: p processor completion time

G: grainsize (computation per message)

Page 7: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 7

Modularity and

Concurrent Composition

Page 8: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 8

Message Driven Execution

Scheduler Scheduler

Message Q Message Q

Virtualization leads to Message Driven Execution

Which leads to Automatic Adaptive overlap of computation and communication

Page 9: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 9

Adaptive overlap and modules

SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, 1995

Page 10: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 10

NAMD: A Production MD program

NAMD• Fully featured program

• NIH-funded development

• Installed at NSF centers

• Large published simulations

• We were able to demonstrate the utility of adaptive overlap, and share the Gordon Bell award in 2002

Collaboration with K. Schulten, R. Skeel, and coworkers

Page 11: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 11

Integration Electrostatics PME/3DFFT

Page 12: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 12

Modularization

• Logical Units decoupled from “Number of processors” – E.G. Oct tree nodes for particle data

– No artificial restriction on the number of processors

• Cube of power of 2

• Modularity:– Software engineering: cohesion and coupling

– MPI’s “are on the same processor” is a bad coupling principle

– Objects liberate you from that:

• E.G. Solid and fluid modules in a rocket simulation

Page 13: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 13

Rocket Simulation• Large Collaboration headed Mike Heath

– DOE supported ASCI center

• Challenge:– Multi-component code, with modules from independent

researchers

– MPI was common base

• AMPI: new wine in old bottle– Easier to convert

– Can still run original codes on MPI, unchanged

Page 14: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 14

AMPI:

7 MPI processes

Page 15: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 15

AMPI:

Real Processors

7 MPI “processes”

Implemented as virtual processors (user-level migratable threads)

Page 16: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 16

Rocket Simulation Components in AMPI

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

Rocsolid

Rocflo

Rocface

RocsolidRocface

Rocsolid

Rocface

Rocsolid

Rocface

RocsolidRocface

Rocsolid

RocfloRocflo Rocflo Rocflo

Page 17: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 17

AMPI and Roc* communications

Rocflo

Rocface

RocsolidRocface

Rocsolid

Rocface

Rocsolid

Rocface

RocsolidRocface

Rocsolid

RocfloRocflo Rocflo Rocflo

Page 18: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 18

Automatic AdaptiveRuntime Optimizations

New Parallel Languages and Enhancements

(MSA, Charisma, ..)

ApplicationsEspecially,

dynamic, irregular and difficult to parallelize ones

How to build better parallel machines

Communication Support (SW/HW)

Migratable Objects model

OS supportMemory

MgmtBigSimResource Management

On Computational Grids

Page 19: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 19

Charm++/AMPI are mature systems

• Available on all parallel machines we know of– Clusters, Vendor supported: IBM, SGI, HP (Q), BlueGene/L, …

• Tools: – Performance analysis/visualization

– Debuggers

– Live visualization

– Libraries and frameworks

• Used by many applications– 17,000+ installations

– NAMD , Rocket simulation, Quantum Chemistry, Space-time meshes, animation graphics, Astronomy, ..

• It is C++, with message (event) driven execution– So, a familiar model for desktop programmers

Page 20: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 20

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

Crack Propagation

Space-time meshes

Computational Cosmology

Rocket Simulation

Protein Folding

Dendritic Growth

Quantum Chemistry LeanCP

Develop abstractions in context of full-scale applications

NAMD: Molecular Dynamics

STM virus simulation

Page 21: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 21

CSE to ManyCore

• The Charm++ model has succeeded in CSE/HPC

• Because:– Resource management, …

• In spite of:– Based on C++, not Fortran, message-driven model,..

• But is an even better fit for desktop programmers– C++, event driven execution

– Predictability of data/code accesses

15% of cycles at NCSA, 20% at PSC, were used on Charm++ apps, in a one year period

Page 22: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 22

Why is it suitable for Multi-cores

• Objects connote and promote locality• Message-driven execution

– A strong principle of prediction for data and code use– Much stronger than Principle of locality

• Can use to scale memory wall:• Prefetching of needed data:

– into scratch pad memories, for example

Scheduler Scheduler

Message Q Message Q

Page 23: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 23

Why Charm++ & Cell?• Data Encapsulation / Locality

– Each message associated with…• Code : Entry Method• Data : Message & Chare Data

– Entry methods tend to access datalocal to chare and message

• Virtualization (many chares per processor)– Provides opportunity to overlap SPE computation with DMA transactions– Helps ensure there is always useful work to do

• Message Queue Peek-Ahead / Predictability– Peek-ahead in message queue to determine future work– Fetch code and data before execution of entry method

S SQ Q

Page 24: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 24

System View on Cell

Work by David Kunzman (with Gengbin Zheng, Eric Bohm, )

Page 25: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 25

Charm++ on Cell Roadmap

Page 26: Parallel Programming models in the era of multi-core processors:

SMP implementation of Charm++• Has existed since the

beginning– An original aim (circa

1988) was to be portable between shared memory and distributed memory machines

– Multimax, sequent, hypercubes

• SMP-nodes version– Exploits shared memory

within a node

• Performance issues..6/20/2008 UPCRC seminar 26

Page 27: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 27

So, I expect Charm++ to be a strong contender for manycore models

BUT: What about the quest for Simplicity?

Charm++ is powerful, but not much simpler than, say, MPI

Page 28: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 28

How to Get to Simple Parallel Programming Models?

• Parallel Programming is much too complex– In part because of resource management issues :

• Handled by Adaptive Runtime Systems

– In a larger part, because of unintended non-determinacy

• Race conditions

• Clearly, we need simple models– But what are willing to give up? (No free lunch)

– Give up “Completeness”!?!

– May be one can design a language that is simple to use, but not expressive enough to capture all needs

Page 29: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 29

Simplicity?

• A collection of “incomplete” languages, backed by a (few) complete ones, will do the trick– As long as they are interoperable

• Where does simplicity come from?– Outlaw non-determinacy!

– Deterministic, Simple, parallel programming models

• With Marc Snir, Vikram Adve, ..

– Are there examples of such paradigms?

• Multiphase shared Arrays : [LCPC ‘04]

• Charisma++ : [LCR ’04]

Page 30: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 30

Shared memory or not

• Smart people on both sides: – Thesis, antithesis

• Clearly, needs a “synthesis”

• “Shared memory is easy to program” has– Only a grain of truth

– But there exists that grain of truth

• We as a community, need to have this debate– Put some armor on, drink friendship potion, but debate the issue

threadbare..

– What do we mean by SAS model and what we like and dislike about it

Page 31: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 31

Multiphase Shared Arrays• Observations:

– General shared address space abstraction is complex– Certain special cases are simple, and cover most uses

• Each array is in one mode at a time– But its mode may change from phase to phase

• Modes– Write-once– Read-only– Accumulate– Owner-computes

• All workers sync, at end of each phase

Page 32: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 32

MSA: • In the simple model:• A program consists of

– A collection of Charm threads, and – Multiple collections of data-arrays

• Partitioned into pages (user-specified)

• Execution begins in a “main” – Then all threads are fired in

parallel

• More complex model– Multiple collections of threads– …

AB

C CC C

Page 33: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 35

MSA: Plimpton MD

for timestep = 0 to Tmax { // Phase I : Force Computation: for a section of the interaction matrix for i = i_start to i_end for j = j_start to j_end if (nbrlist[i][j]) { // nbrlist enters ReadOnly mode force = calculateForce(coords[i], atominfo[i], coords[j], atominfo[j]); forces[i] += force; // Accumulate mode forces[j] += -force; } nbrlist.sync(); forces.sync(); coords.sync(); atominfo.sync();

for k = myAtomsbegin to myAtomsEnd // Phase II : Integration coords[k] = integrate(atominfo[k], forces[k]); // WriteOnly mode coords.sync(); atominfo.sync(); forces.sync();

if (timestep %8 == 0) { // Phase III: update neighbor list every 8 steps for i = i_start to i_end for j = j_start to j_end nbrList[i][j] = distance( coords[i],coords[j]) < CUTOFF; nbrList.sync(); coords.sync(); }}

Page 34: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 36

Extensions• Need check for each access: is the page here?

– Pre-fetching, and known-local accesses

• A Twist on ACCUMULATE – Each array element can be a set

– Set Union operation is a valid accumulate operation.

– Example:

• Appending a list of (x,y) points

Page 35: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 37

MSA: Graph Partition

// Phase I: EtoN: RO, NtoE: Accumulatefor i=1 to EtoN.length() for j=1 to EtoN[i].length() { n = EtoN[i][j]; NtoE[n] += i; // Accumulate }EtoN.sync(); NtoE.sync();

// Phase II: NtoE: RO, EtoE: Accumulatefor j = my section of j //foreach pair e1, e2 elementof NtoE[j] for i1 = 1 to NtoE[j].length() for i2 = i1 + 1 to NtoE[j].length() { e1 = NtoE[j][i1]; e2 = NtoE[j][i2]; EtoE[e1] += e2; // Accumulate EtoE[e2] += e1; }EtoN.sync(); NtoE.sync();

Page 36: Parallel Programming models in the era of multi-core processors:

AMPI – Charm++ Workshop 2008

05/02/08

Charisma

Static data flow Suffices for number of applications

Molecular dynamics, FEM, PDE's, etc.

Global data and control flow explicit Unlike Charm++

Page 37: Parallel Programming models in the era of multi-core processors:

AMPI – Charm++ Workshop 2008

05/02/08

The Charisma Programming Model

Arrays of objects Global parameter space (PS)

Objects read from and write into PS

Clean division between Parallel (orchestration) code

Sequential methods

Worker objects

Buffers (PS)

Page 38: Parallel Programming models in the era of multi-core processors:

AMPI – Charm++ Workshop 2008

05/02/08

Example: Stencil Computation

foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y]) <- workers[x,y].produceBorders();end-foreach

foreach x,y in workers (+err) <- workers[x,y].compute(lb[x+1,y], rb[x-1,y], ub[x,y+1], db[x,y-1]);end-foreach

while(err > epsilon) foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y])

<- workers[x,y].produceBorders();

(+err) <- workers[x,y].compute( lb[x+1,y], rb[x-1,y], ub[x,y-1], db[x,y+1] ); end-foreachend-while

Reduction on variable err

Page 39: Parallel Programming models in the era of multi-core processors:

AMPI – Charm++ Workshop 2008

05/02/08

Language Features

Communication patterns P2P, Multicast, Scatter, Gather

Determinism Methods invoked on objects in Program Order

Support for libraries Use external

Create your own

foreach w in workers (p[w]) <- workers[w].produce(); workers[w].consume(p[w-1]);end-foreach

foreach r in workers (p[r]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r])end-foreach

foreach r in workers (p[r,*]) <- workers[r].produce();end-foreachforeach r,c in consumers consumers[r,c].consume(p[r,c]);end-foreach

workers[w]

workers[r]

consumers[r,1] consumers[r,3]consumers[r,2]consumers[r,2]

workers[r]

consumers[r,1] consumers[r,3]

Page 40: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 43

Charisma: Motivation

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefit: load balance, communication optimizations, modularity

– Problem: flow of control buried in asynchronous method invocations

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

Page 41: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 44

Motivation: Car-Parrinello Ab Initio Molecular Dynamics (CPMD)

Charisma presentation

Page 42: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 46

Charm++

TCharm

AMPI

MSACharisma

Multimodule Application

Other Abstractions: GA, CAF, UPC, PPL1

A set of “incomplete” but elegant/simple languages,

backed by a low-level complete one

Page 43: Parallel Programming models in the era of multi-core processors:

Interoperable Multi-paradigm programming

6/20/2008 UPCRC seminar 47

Page 44: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 48

Lets play together

• Multiple programming models need to be investigated– “Survival of the fittest” doesn’t lead to a single species, it leads

to an eco-system.

• Different ones may be good for different algorithms/domains/…

• Allow them to interoperate in a multi-paradigm environment

Page 45: Parallel Programming models in the era of multi-core processors:

6/20/2008 UPCRC seminar 49

Summary

• It is necessary to raise the level of abstraction– Foundation: adaptive runtime system, based on migratable

objects

• Automate resource management

• Composibility

• Interoperability

– Design new Models that avoid data races, and promote locality

– Incorporate good aspects of shared memory model

More info on my group’s work: http://charm.cs.uiuc.edu