csc 447: parallel programming for multi- core and...

22
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Why Parallel Computing? Haidar M. Harmanani Spring 2017 Definitions § What is parallel? Webster: An arrangement or state that permits several operations or tasks to be performed simultaneously rather than consecutively § Parallel computing Using parallel computer to solve single problems faster § Parallel computer Multiple-processor system supporting parallel programming § Parallel programming Programming in a language that supports concurrency explicitly Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 2

Upload: lymien

Post on 20-Aug-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

CSC 447: Parallel Programming for Multi-Core and Cluster SystemsW hy P a r a l l e l C o m p u t i n g ?

H a i d a r M . H a r m a n a n i

S p r i n g 2 0 1 7

Definitions§ What is parallel?– Webster: �An arrangement or state that permits several operations or

tasks to be performed simultaneously rather than consecutively�

§ Parallel computing– Using parallel computer to solve single problems faster

§ Parallel computer– Multiple-processor system supporting parallel programming

§ Parallel programming– Programming in a language that supports concurrency explicitly

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 2

Page 2: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Parallel Processing – What is it?§ A parallel computer is a computer system that uses multiple

processing elements simultaneously in a cooperative manner to solve a computational problem

§ Parallel processing includes techniques and technologies that make it possible to compute in parallel– Hardware, networks, operating systems, parallel libraries, languages,

compilers, algorithms, tools, …

§ Parallel computing is an evolution of serial computing– Parallelism is natural– Computing problems differ in level / type of parallelism

§ Parallelism is all about performance! Really?

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 3

Concurrency§ Consider multiple tasks to be executed in a computer§ Tasks are concurrent with respect to each if– They can execute at the same time (concurrent execution)– Implies that there are no dependencies between the tasks

§ Dependencies– If a task requires results produced by other tasks in order to execute correctly,

the task’s execution is dependent– If two tasks are dependent, they are not concurrent– Some form of synchronization must be used to enforce (satisfy) dependencies

§ Concurrency is fundamental to computer science– Operating systems, databases, networking, …

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 4

Page 3: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Concurrency and Parallelism§ Concurrent is not the same as parallel! Why?§ Parallel execution– Concurrent tasks actually execute at the same time– Multiple (processing) resources have to be available

§ Parallelism = concurrency + “parallel” hardware– Both are required– Find concurrent execution opportunities– Develop application to execute in parallel– Run application on parallel hardware

§ Is a parallel application a concurrent application?§ Is a parallel application run with one processor parallel? Why or why not?

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 5

Concurrency vs. Parallelism– Concurrency: two or more threads are in progress at the

same time:

– Parallelism: two or more threads are executing at the same time

CSC 447: Parallel Programming for Multi-Core and Cluster Systems 6

Thread 1Thread 2

Thread 1Thread 2

Spring 2017

Page 4: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Why Parallel Computing Now?§ Researchers have been using parallel

computing for decades: – Mostly used in computational science and

engineering– Problems too large to solve on one computer;

use 100s or 1000s

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 7

Nature

Observation

TheoryPhysical

ExperimentationNumericalSimulation

ClassicalScience

Modern ScientificMethod

Why Parallel Computing Now?§ Many companies in the 80s/90s “bet” on parallel computing and failed– Computers got faster too quickly for there to be a large market

§ Why an undergraduate course on parallel programming?– Because the entire computing industry has bet on parallelism– There is a desperate need for parallel programmers

§ There are 3 ways to improve performance:– Work Harder– Work Smarter– Get Help

§ Computer Analogy– Using faster hardware– Optimized algorithms and techniques used to solve computational tasks– Multiple computers to solve a particular task

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 8

Page 5: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Technology Trends: Microprocessor Capacity

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 9

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Technology Trends: Microprocessor Capacity

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 10

Microprocessorshavebecomesmaller,denser,andmorepowerful.

Micros

Minis

Mainframes

Speed(logscale)

Time

Supercomputers

Page 6: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

50 Years of Speed Increases

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 11

ENIAC

350flops

Today

>1trillionflops

CPUs 1 Million Times Faster§ Faster clock speeds

§ Greater system concurrency–Multiple functional units– Concurrent instruction execution– Speculative instruction execution

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 12

Page 7: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Systems 1 Billion Times Faster§ Processors are 1 million times faster

§ Combine thousands of processors

§ Parallel computer–Multiple processors– Supports parallel programming

§ Parallel computing = Using a parallel computer to execute a program faster

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 13

Microprocessor Transistors and Clock Rate

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 14

Why bother with parallel programming? Just wait a year or two…

i4004

i80286i80386

i8080

i8086

R3000R2000

R10000Pentium

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005Year

Tran

sistors

0.1

1

10

100

1000

1970 1980 1990 2000Year

Cloc

k Rate

(MHz

)

Page 8: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Limit #1: Power density

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 15

400480088080

8085

8086

286 386486

Pentium®P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

PowerDen

sity(W

/cm

2 )

HotPlate

NuclearReactor

RocketNozzle

Sun�sSurface

Scalingclockspeed(businessasusual)willnotwork

IntelVPPatrickGelsinger (ISSCC2001):�Ifscalingcontinuesatpresentpace,by2005,highspeedprocessorswouldhavepowerdensity ofnuclearreactor,by2010,arocketnozzle,andby2015,surfaceofsun.�

Device scaling

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 16

dopingincreasedbyafactorofS

scale

IncreasingthechanneldopingdensitydecreasesthedepletionwidthÞ improvesisolationbetweensourceanddrainduringOFFstatusÞ permitsdistancebetweenthesourceanddrainregionstobescaled

(scaleddownbyL)

(veryidealisticNMOStransistor)

Page 9: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Parallelism Saves Power§ Exploit explicit parallelism for reducing power

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 17

Power = C * V2 * F Performance = Cores * F

Capacitance Voltage Frequency

• Usingadditionalcores– Increasedensity(=moretransistors=morecapacitance)– Canincreasecores(2x)andperformance(2x)– Orincreasecores(2x),butdecreasefrequency(1/2):sameperformanceat¼thepower

Power = 2C * V2 * F Performance = 2Cores * FPower = 2C * V2/4 * F/2 Performance = 2Cores * F/2

• Additionalbenefits– Small/simplecoresàmorepredictableperformance

Power = (C * V2 * F)/4 Performance = (Cores * F)*1

Limit #2: Hidden Parallelism Tapped Out§ Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 18

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

25%/year

52%/year

??%/year

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

• ½ due to transistor density• ½ due to architecture changes, e.g., Instruction

Level Parallelism (ILP)

• VAX:25%/year1978to1986• RISC+x86:52%/year1986to2002

Page 10: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Limit #2: Hidden Parallelism Tapped Out§ Superscalar (SS) designs were the state of the art; many forms of

parallelism not visible to programmer– multiple instruction issue– dynamic scheduling: hardware discovers parallelism between instructions– speculative execution: look past predicted branches– non-blocking caches: multiple outstanding memory ops

§ You may have heard of these in CSC320, but you haven’t needed to know about them to write software

§ Unfortunately, these sources have been used up

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 19

Uniprocessor Performance (SPECint) Today

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 20

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. VA

X-11

/780)

25%/year

52%/year

??%/year

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Þ Sea change in chip design: multiple �cores� or processors per chip

3X

2x every 5 years?

Page 11: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Limit #3: Chip Yield

• Moore’s (Rock’s) 2nd law: fabrication costs go up

• Yield (% usable chips) drops• Parallelism can help

• More smaller, simpler processors are easier to design and validate

• Can use partially working chips:

• E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 21

Manufacturing costs and yield problems limit use of density

Limit #4: Speed of Light (Fundamental)

§ Consider the 1 Tflop/s sequential machine:– Data must travel some distance, r, to get from memory to

CPU.– To get 1 data element per cycle, this means 1012 times per

second at the speed of light, c = 3x108 m/s. o Thus r < c/1012 = 0.3 mm.

§ Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:– Each bit occupies about 1 square Angstrom, or the size of a

small atom.

§ No choice but parallelism

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 22

r=0.3mm1Tflop/s,1Tbytesequentialmachine

Page 12: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Revolution is Happening Now§ Chip density is continuing

increase ~2x every 2 years– Clock speed is not– Number of processor cores

may double instead

§ There is little or no hidden parallelism (ILP) to be found

§ Parallelism must be exposed to and managed by software

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 23

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 24

Page 13: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Tunnel Vision by Experts§ “On several recent occasions, I have been asked whether parallel computing

will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”o Ken Kennedy, CRPC Directory, 1994

§ “640K [of memory] ought to be enough for anybody.”o Bill Gates, chairman of Microsoft,1981.

§ “There is no reason for any individual to have a computer in their home”o Ken Olson, president and founder of Digital Equipment Corporation, 1977.

§ “I think there is a world market for maybe five computers.”o Thomas Watson, chairman of IBM, 1943.

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 25

Why Parallelism (2016)?§ All major processor vendors are producing multicore chips– Every machine will soon be a parallel machine– All programmers will be parallel programmers???

§ New software model– Want a new feature? Hide the “cost” by speeding up the code first– All programmers will be performance programmers???

§ Some may eventually be hidden in libraries, compilers, and high level languages– But a lot of work is needed to get there

§ Big open questions:– What will be the killer apps for multicore machines– How should the chips be designed, and how will they be programmed?

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 26

Page 14: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Phases of Supercomputing (Parallel) Architecture§ Phase 1 (1950s): sequential instruction execution§ Phase 2 (1960s): sequential instruction issue– Pipeline execution, reservations stations– Instruction Level Parallelism (ILP)

§ Phase 3 (1970s): vector processors– Pipelined arithmetic units– Registers, multi-bank (parallel) memory systems

§ Phase 4 (1980s): SIMD and SMPs§ Phase 5 (1990s): MPPs and clusters– Communicating sequential processors

§ Phase 6 (>2000): many cores, accelerators, scale, …

CSC 447: Parallel Programming for Multi-Core and Cluster Systems 27Spring 2017

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 28

Page 15: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Parallel Programming Work Flow§ Identify compute intensive parts of an application

§ Adopt scalable algorithms

§ Optimize data arrangements to maximize locality

§ Performance Tuning

§ Pay attention to code portability and maintainability

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 29

So what is the problem?

Writing (fast) parallel programs is hard!

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 30

Page 16: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Principles of Parallel Computing§ Finding enough parallelism (Amdahl’s Law)

§ Granularity

§ Locality

§ Load balance

§ Coordination and synchronization

§ Performance modeling

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 31

All of these things makes parallel programming even harder than sequential programming.

Finding Enough Parallelism (Amdahl’s Law)§ Suppose only part of an application seems parallel§ Amdahl’s law– let s be the fraction of work done sequentially, so (1-s) is fraction

parallelizable– P = number of processors

§ Even if the parallel part speeds up perfectly performance is limited by the sequential part

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 32

Speedup(P)=Time(1)/Time(P)

<=1/(s+(1-s)/P)

<=1/s

Page 17: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Finding Enough Parallelism (Amdahl’s Law)

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 33

Granularity§ Given enough parallel work, this is the biggest barrier to getting desired

speedup§ Parallelism overheads include:– Cost of starting a thread or process– Cost of communicating shared data– Cost of synchronizing– Extra (redundant) computation

§ Each of these can be in the range of milliseconds (=millions of flops) on some systems

§ Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 34

Page 18: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 35

Locality and Parallelism

Large memories are slow, fast memories are small

Storage hierarchies are large and fast on average

Parallel processors, collectively, have large, fast cache◦ the slow accesses to “remote” data we call “communication”

Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

Conventional Storage Hierarchy

Load Balance/Imbalance§ Load imbalance is the time that some processors in the

system are idle due to– insufficient parallelism (during that phase)– unequal size tasks

§ Examples of the latter– adapting to “interesting parts of a domain”– tree-structured computations – fundamentally unstructured problems

§ Algorithm needs to balance load

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 36

Page 19: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Load Balance

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 37

good bad!

Synchronization§ Need to manage the sequence of work and the tasks

performing§ Often requires "serialization" of segments of the

program§ Various types of synchronization maybe involved– Locks/Semaphores– Barrier– Synchronous Communication Operations

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 38

Page 20: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Performance Modeling§ Analyzing and tuning parallel program performance is

more challenging than for serial programs.

§ There is a need for parallel program performance analysis and tuning.

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 39

So how do we do parallel computing?

Page 21: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Strategy 1: Extend Compilers§ Focus on making sequential programs parallel§ Parallelizing compiler– Detect parallelism in sequential program– Produce parallel executable program

§ Advantages– Can leverage millions of lines of existing serial programs– Saves time and labor– Requires no retraining of programmers– Sequential programming easier than parallel programming

§ Disadvantages– Parallelism may be irretrievably lost when programs written in sequential languages– Performance of parallelizing compilers on broad range of applications still up in air

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 41

Strategy 2: Extend Language§ Add functions to a sequential language– Create and terminate processes– Synchronize processes– Allow processes to communicate

§ Advantages– Easiest, quickest, and least expensive– Allows existing compiler technology to be leveraged– New libraries can be ready soon after new parallel computers are available

§ Disadvantages– Lack of compiler support to catch errors– Easy to write programs that are difficult to debug

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 42

Page 22: CSC 447: Parallel Programming for Multi- Core and …harmanani.github.io/classes/csc447/Notes/Lecture02.pdf · CSC 447: Parallel Programming for Multi-Core and ... Core and Cluster

Strategy 3: Add a Parallel Programming Layer§ Lower layer– Core of computation– Process manipulates its portion of data to produce its portion of

result

§ Upper layer– Creation and synchronization of processes– Partitioning of data among processes

§ A few research prototypes have been built based on these principles

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 43

Strategy 4: Create a Parallel Language§ Develop a parallel language “from scratch”– occam is an example

§ Add parallel constructs to an existing language– Fortran 90– High Performance Fortran– C*

§ Advantages– Allows programmer to communicate parallelism to compiler– Improves probability that executable will achieve high performance

§ Disadvantages– Requires development of new compilers– New languages may not become standards– Programmer resistance

Spring 2017 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 44