welcome to comp4300/8300 – parallel systems · 2020-02-21 · welcome to comp4300/8300 –...

23
Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki, Flickr) COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J I II × 1

Upload: others

Post on 09-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Welcome to COMP4300/8300 – Parallel Systems

UltraSPARC T2(Niagara-2)multicore chip layout

(courtesy of T. Okazaki, Flickr)

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 1

Page 2: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Lecture Overview

l parallel computing concepts and scales

l sample application areas

l parallel programming’s rise, decline and rebirth:

n the role of Moore’s Law and Dennard Scaling

n the multicore revolution and a new design era

l the Top 500 supercomputers and challenges for ‘petascale computing’

l why parallel programming is hard

l course contact

l assumed knowledge and assessment

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 2

Page 3: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Parallel Computing: Concept and Rationale

The idea:

l split your computation into bits that can be executed simultaneously

Motivation:

l Speed, Speed, Speed · · · at a cost-effective price

n if we didn’t want it to go faster we wouldn’t bother with the hassles of parallel

programming!

l reduce the time to solution to acceptable levels

n no point in taking 1 week to predict tomorrow’s weather!

n simulations that take months are NOT useful in a design environment

Parallelism is when the different components of a computation execute together. It is a

subset of concurrency where the components can execute in any order.

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 3

Page 4: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Parallelization

Split program up and run parts simultaneously on different processors

l on p computers, the time to solution should (ideally!) be reduced by 1/p

l parallel programming: the art of writing the parallel code

l parallel computer: the hardware on which we run our parallel code

This course will discuss both!

Beyond raw compute power, other motivations may include

l enabling more accurate simulations in the same time (finer grids)

l providing access to huge aggregate memories

(e.g. atmosphere model requiring ≥ 8 Intel nodes to run on)

l providing more and/or better input/output capacity

l hiding latency (although this is, strictly speaking, concurrency)

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 4

Page 5: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Scales of Parallelism

l within a CPU/core: pipelined instruction execution, multiple instruction issue

(superscalar), other forms of instruction level parallelism, SIMD units*

l within a chip: multiple cores*, hardware multithreading*, accelerator units* (with

multiple cores), transactional memory*

l within a node: multiple sockets* (CPU chips), interleaved memory access (multiple

DRAM chips), disk block striping / RAID (multiple disks)

l within a SAN (system area network): multiple nodes* (clusters, typical

supercomputers), parallel filesystems

l within the internet: grid computing*, distributed workflows*

*requires significant parallel programming effort

What programming paradigms are typically applied to each feature?

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 5

Page 6: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Sample Application Areas: Grand Challenge Problems

l fluid flow problems

n weather prediction and climate change, ocean

flown aerodynamic modeling for cars, planes, rockets

etc

l structural mechanics

n building, bridge, car etc strength analysisn car crash simulation

l speech and character recognition, image processing

l visualization, virtual reality

l semiconductor design, simulation of new chips

l structural biology, design of drugs

l human genome mapping

l financial market analysis and simulation

l data mining, machine learning, games

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 6

Page 7: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Example: World Climate Modeling

l atmosphere divided into 3D regions or cells

l complex mathematical equations describe conditions in each cell, e.g. pressure,

temperature, wind speed, etc

n conditions in each cell change according to conditions in neighbouring cells

n updates repeated many times to model the passage of time

n cells are affected by more distant cells the longer the forecast

l assume

n cells are 1×1×1 mile to a height of 10 miles⇒ 5×108 cells

n 200 floating point operations (FLOPS) to update each cell⇒ 1011 FLOPS per

timestep

n a timestep represents 10 minutes and we want 10 days⇒ total of 1015 FLOPS

l on a 100 MFLOP computer this would require 100 days

l on a 1 TFLOP computer this would take just 10 minutes

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 7

Page 8: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

The (Rocky) Rise of Parallel Computing

l early ideas: 1946: Cellular Automota, John von Neumann; 1958: SOLOMON (10241-bit processors), Daniel Slotnick; 1962: Burroughs D825 4-CPU SMP

l 1967: Gene Amdahl proposes Amdahl’s Law, debates with Slotnick at AFIPS Conf.

l 1970’s: vector processors become the mainstream supercomputers (e.g. Cray-1);

a few ‘true’ parallel computers are built

l 1980’s: small-scale parallel vector processors dominant (e.g. Cray X-MP)

When a farmer needs more ox power to plow his field, he doesn’t get a bigger ox, he getsanother one. Then he puts the oxen in parallel. Grace Hopper, 1985-7

If you were plowing a field, which would you rather use? Two strong oxen or 1024chickens? Seymour Cray, 1986

l 1988: Reevaluating Amdahl’s Law, John Gustafson

l late 80s+: large-scale parallel computers begin to emerge: Computing Surface(Meiko), QCD machine (Columbia Uni), iPSC/860 hypercube (Intel)

l 90’s–00’s: shared and distributed memory parallel computers used for servers(small-scale) and supercomputers (large-scale)

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 8

Page 9: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Moore’s Law & Dennard Scaling

Two “laws” underpin the exponential increase in performance of (serial) processors from

the 1970’s

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 9

Page 10: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Moore’s Law and Dennard Scaling Undermine Parallel Computingl || computing looked promising in the 90’s, but many companies failed due to the

‘free lunch’ from the combination of Moore’s Law and Dennard Scalingn Why parallelize my codes? Just wait two years, and the processors will be 4

times faster!On several recent occasions, I have been asked whether parallel computing will soon be

relegated to the trash heap . . . Ken Kennedy, CRPC Director, 1994

l demography of Parallel Computing (mid 90’s, origin unknown)

Prossessorspeed (R)

Degree ofParllelism (P)

(P,R+dR)

(P+dP,R)

(P+dP,R−dR)(P−dP,R−dR)

(P−dP,R+dR) (P+dP,R+dR)

Heretics

LudditesFanatics

Agnostics

TrueBelievers

Luke−warmBelievers

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 10

Page 11: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Chip Size and Clock Speed

The ‘free lunch’ continued into the early 2000’s, allowing the construction of faster andmore complex (serial) processors: systems so fast that they could not communicateacross the chip in a single clock cycle!

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 11

Page 12: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

End of Dennard Scaling and Other Free Lunch Ingredients

l even with Dennard

Scaling, we saw a

exponential increase in

the power density of

chips 1985–2000

n 2000 Intel chip

equivalent to a

hotplate, would

have⇒ a rocket

nozzle by 2010!

l then Dennard Scaling

ceased around 2005!!

l instruction level

parallelism (ILP) also

reached its limits

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 12

Page 13: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

The Multicore Revolution

l vendors began putting multiple CPUs (cores) on a chip, and stopped increasingclock frequency

n 2004: Sun releases dual-core Sparc IV, heralding the start of the multicore era

l (dynamic) power of a chip is given by: P = QCV 2 f , V is the voltage, Q is thenumber of transistors, C is a transistor’s capacitance and f is the clock frequency

n but on a given chip, f ∝ V , so P ∝ Q f 3!

l (ideal) parallel performance for p cores is given by R = p f , but Q ∝ p

n double p⇒ double R, but also Pn double p, halve f ⇒ maintain R, but quarter P!

l doubling the number of cores is better than doubling clock speed!

l Moore’s Law (increase Q at constant QC) is expected to continue (till ???), so wecan gain in R at constant P

l cores can be much simpler (hence utilize less ILP), but there can be many

l chip design and testing costs are significantly reduced

l parallelism now must be exposed by software: The Free Lunch Is Over!

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 13

Page 14: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

A New Chip Development Era

1960-2010 2010-?few transistors no shortage of transistors

no power limitations severe power limitations

maximize transistor utility minimize energygeneralize customize

We are now seeing:

l (customized) accelerators, generally manycore with low clock frequency

n e.g. Graphics Processing Units (GPUs), customized for fast numerical

calculations

l ‘dark silicon’: need to turn off parts of chip to reduce power

l hardware-software codesign: speed via specialization

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 14

Page 15: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

The Top 500 Most Powerful Computers: June 2018

The Top 500 provides an interesting window to these hardware trends and issues.

(http://www.top500.org/resources/presentations/ (51st TOP500))

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 15

Page 16: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Top500: Performance Trends

(http://www.top500.org/resources/presentations/ (51st TOP500))

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 16

Page 17: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

The Top500: Multicore Emergence

(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/)

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 17

Page 18: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Petascale and Beyond: Challenges and Opportunities

Level Characteristic Challenges/Opportunities

as a whole

sheer number of nodes:

l Tianhe-2 has

equivalent of 5M cores,

TaihuLight 16M!

l programming

language/environment

l fault tolerance

within a nodeheterogeneity:

l Summit, Tianhe-2 use

CPUs and GPUs

l what to use when

l co-location of data with

the unit processing it

within a chip

energy minimization:

l already processors

have frequency and

voltage scaling

l minimize data size and

movement including

use of just enough

precision

l specialized cores

In RSCS we are working in all these areas.

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 18

Page 19: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Software: Why Parallel Programming is Hard

l writing (correct and efficient) parallel programs is hard!

n hard to expose enough parallelism; hard to debug!

l getting (close to ideal) speedup is hard! Overheads include:

n communicating shared data (e.g. cache line invalidations and resulting reloads)n synchronization (barriers and locks)n need for redundant computations; balancing load evenly

l also, not all of the application may be parallelizable:

Amdahl’s Law: given a fraction f of ‘fast’ computation, at rate Rf, and Rs being the‘slow’ computation rate, the overall rate is: R = (1− f

Rs+ f

Rf)−1

n interpreted for parallel execution with p processors:f is the fraction of non-serial computation, which (ideally) executes at the rateRf = pRs

n e.g. with f = 0.9, R = 10Rs at p = ∞ !n counterargument (Gustafson’s Law): 1− f is not fixed; but decreases with the

data size N e.g. 1− f ∝ N−12

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 19

Page 20: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Health Warning!

l the course is normally run every other year

l its a 4000/8000-level course, its supposed to:

n be more challenging than a 3000-level course!n you may (will) be exposed to ‘bleeding edge’ technologiesn be less well structuredn have a greater expectation on you for self-directed learningn have more student participationn be fun!

l it assumes you have done some background in concurrency (e.g. COMP2310);2000-level mathematics is not really needed

l it will require strong programming skills – in C!

l Nathan Robertson, 2002 honours student: Parallel systems and thread safety atMedicare: 2/16 understood it - the other guy was a $70/hr contractor

l attendance at lectures and pracs is strongly recommended when possible (eventhough not assessed)

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 20

Page 21: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Health Warning – Online Participation!We will endeavour to make reasonable efforts to support this for students impacted bytravel restrictions.

l requirements (I): satisfactory (latency and bandwidth) http(s) access to:

n course website (lecture notes etc), gitlab (assignment submission)n wattle (lecture recordings and individual/group chat via Connect)n NCI documentation and user accountsn StReAMS (marks) and piazza (forum)

l requirements (II): you have a good SSH client (e.g. Cygwin) and outgoing SSH isnot being blocked

n most crucially to gadi.nci.org.au (if it gets as far as asking for a

username/password, should be OK)

n later, will need access to a GPU server (e.g. stugpu2.anu.edu.au; should be able to

work out another solution in time)

l access to all of the above from external countries is believed possible – if not,contact the course convenor ASAP!

l propose to hold 2×1 hour weekly chat sessions via Connect/wattle for remotepracticals

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 21

Page 22: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Course Contact

l course web site: http://cs.anu.edu.au/courses/comp4300

(we will use wattle only for lecture recordings and assignment solutions)

l course coordinator & lecturer/tutor:

Peter Strazdins, CSIT N217, 6125-5140, comp4300 (@anu.edu.au)

n please make your subject line meaningful!

(e.g. ass1: trouble meeting deadline)

l discussion forum accessible by Piazza

l course schedule

n note: Practicals start in week 3! Register now via StReAMs

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 22

Page 23: Welcome to COMP4300/8300 – Parallel Systems · 2020-02-21 · Welcome to COMP4300/8300 – Parallel Systems UltraSPARC T2 (Niagara-2) multicore chip layout (courtesy of T. Okazaki,

Proposed Assessment Scheme and Texts

l see the assessment web page: 2 assignments 20% each, MSE 15%, Final Exam

45% (note provisions for students impacted by travel restrictions)

l some reading will be essential for the course.

Recommended texts (2 copies ordered for Hancock Short-Term Loan):

n Introduction to Parallel Computing, 2nd Ed., Grama et al. Available online (free

for ANU students!)

n Principles of Parallel Programming, Lyn & Snyder

n Introduction to High Performance Computing for Scientists and Engineers,

Hager & Wellein. Available online.

Other references: see the references web page

Announcement: seminar by Dr Jeffrey Vetter (Oak Ridge National Laboratories),

Preparing for Extreme Heterogeneity in High Performance Computing, 11am Tuesday

Feb 25, Hanna Neumann Blg room 1.33

COMP4300/8300 L1: Introduction to Parallel Systems 2020 JJ J • I II × 23