ecmwf slide 1 introduction to parallel computing george mozdzynski march 2004

34
ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

Upload: ryan-maynard

Post on 27-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 1

Introduction to Parallel Computing

George Mozdzynski

March 2004

Page 2: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 2

Outline

What is parallel computing?

Why do we need it?

Types of computer

Parallel Computing today

Parallel Programming Languages

OpenMP and Message Passing

Terminology

Page 3: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 3

What is Parallel Computing?

The simultaneous use of more than one processor or computer to solve a problem

Page 4: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 4

Why do we need Parallel Computing?

Serial computing is too slow

Need for large amounts of memory not

accessible by a single processor

Page 5: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 5

An IFS operational TL511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster 1600 1.3 GHz system (total 1920 CPUs).

How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4

Page 6: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 6

Ans. About 8 days

This PC would need about 25 Gbytes of memory.

8 days is too long for a 10 day forecast!

2-3 hours is too long …

Page 7: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 7

CPUs Wall time

64 11355

128 5932

192 4230

256 3375

320 2806

384 2338

448 2054

512 1842

Amdahl’s Law:

Wall Time = S + P/NCPUS

IFS Forecast Model (TL511L60)

Serial =574 secs

Parallel=690930 secs

(Calculated using Excel’s LINEST function)

(Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).

Page 8: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 8

IFS Forecast Model (TL511L60)

CPUs Wall time SpeedUp Efficiency 1 691504 1 100.0

64 11355 61 95.2

128 5932 117 91.1

192 4230 163 85.1

256 3375 205 80.0

320 2806 246 77.0

384 2338 296 77.0

448 2054 337 75.1

512 1842 375 73.3

Page 9: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 9

IFS Forecast Model, TL511L60

0

64

128

192

256

320

384

448

512

576

64 128 192 256 320 384 448 512

IBM Power 4 CPUs

Sp

eed

Up Observed

Estimated

Ideal

Page 10: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 10

Extrapolating Performance

0256512768102412801536179220482304

256 512 768 1024 1280 1536 1792 2048

IBM Power 4 CPUs

Sp

eed

Up

Ideal

Estimated

IFS model would be inefficient on large numbers of CPUs. But OK up to 512.

Page 11: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 11

Types of Parallel ComputerP=ProcessorM=MemoryS=Switch

Shared Memory Distributed Memory

P

M

P … P

M

P

M

S

Page 12: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 12

IBM Cluster 1600 ( at ECMWF)

P=ProcessorM=MemoryS=Switch

S

P

M

P … P

M

P …

Node Node

Page 13: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 13

IBM Cluster 1600’s at ECMWF (hpca + hpcb)

Page 14: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 14

ECMWF supercomputers

1979 CRAY 1A Vector

CRAY XMP-2CRAY XMP-4CRAY YMP-8CRAY C90-16

Fujitsu VPP700Fujitsu VPP5000

2002 IBM p690 Scalar + MPI +Shared Memory Parallel

}

} Vector + MPI Parallel

Vector + Shared Memory Parallel

Page 15: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 15

ECMWF’s first Supercomputer

CRAY-1A

1979

Page 16: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 16

Where have 25 years gone?

Page 17: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 17

Types of ProcessorDO J=1,1000

A(J)=B(J) + C

ENDDO

LOAD B(J)FADD CSTORE A(J)INCR JTEST

SCALAR PROCESSOR

VECTOR PROCESSOR

LOADV B->V1FADDV B,C->V2STOREV V2->A

Single instruction processes one element

Single instruction processes many elements

Page 18: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 18

Parallel Computing Today

Vector SystemsNEC SX6CRAY X-1Fujitsu VPP5000

Scalar SystemsIBM Cluster 1600FujitsuPRIMEPOWER HPC2500HP Integrity rx2600 Itanium2

Cluster Systems (typically installed by an Integrator)Virgina Tech, Apple G5 / InfinibandNCSA, Dell PowerEdge 1750, P4 Xeon / MyrinetLLNL, MCR Linux Cluster Xeon / QuadricsLANL, Linux Networx AMD Opteron / Myrinet

Page 19: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 19

The TOP500 project

started in 1993

Top 500 sites reported

Report produced twice a year

- EUROPE in JUNE

- USA in NOV

Performance based on LINPACK benchmark

http://www.top500.org/

Page 20: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 20

Top 500 Supercomputers

Page 21: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 21

Where is ECMWF in Top 500

Rmax

Rpeak

Rmax – Gflop/sec using Linpack Benchmark

Rpeak – Peak Hardware Gflop/sec (that will never be reached!)

Page 22: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 22

What performance do Meteorological Applications achieve?

Vector computers- About 30 to 50 percent of peak performance

- Relatively more expensive

- Also have front-end scalar nodes

Scalar computers- About 5 to 10 percent of peak performance

- Relatively less expensive

Both Vector and Scalar computers are being used in Met Centres around the world

Is it harder to parallelize than vectorize?- Vectorization is mainly a compiler responsibility

- Parallelization is mainly the user’s responsibility

Page 23: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 23

http://www.top500.org/ORSC/2003/

Overview of Recent Supercomputers

Aad J. van der Steen

and

Jack J. Dongarra

Page 24: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 24

Page 25: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 25

Page 26: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 26

Parallel Programming Languages?

• High Performance Fortran (HPF) • directive based extension to Fortran• works on both shared and distributed memory systems• not widely used (more popular in Japan?)• not suited to applications using irregular grids• http://www.crpc.rice.edu/HPFF/home.html

• OpenMP• directive based• support for Fortran 90/95 and C/C++• shared memory programming only• http://www.openmp.org

Page 27: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 27

Most Parallel Programmers use…

Fortran 90/95, C/C++ with MPI for communicating between tasks (processes)

- works for applications running on shared and distributed memory systems

Fortran 90/95, C/C++ with OpenMP

- For applications that need performance that is satisfied by a single node (shared memory)

Hybrid combination of MPI/OpenMP

- ECMWF’s IFS uses this approach

Page 28: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 28

the myth of automatic parallelization(2 common versions)

Compilers can do anything (but we may have to wait a while)

- Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source

Compilers can’t do anything (now or never)

- Automatic parallelization is useless. It’ll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome

Page 29: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 29

Terminology

Cache, Cache line

NUMA

false sharing

Data decomposition

Halo, halo exchange

FLOP

Load imbalance

Synchronization

Page 30: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 30

THANKYOU

Page 31: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 31

Cache

P

M

C

P=ProcessorC=CacheM=Memory

M

P

C1 C1

C2

P

Page 32: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 32

IBM node = 8 CPUs + 3 levels of $

P

C1 C1

C2

P P

C1 C1

C2

P P

C1 C1

C2

PP

C1 C1

C2

P

C3

Memory

Page 33: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 33

Cache is …

Small and fast memory

Cache line typically 128 bytes

Cache line has state (copy,exclusive owner)

Coherency protocol

Mapping, sets, ways

Replacement strategy

Write thru’ or not

Important for performance

- Single stride access of always the best!!!

- Try to avoid writes to same cache line from different Cpus

But don’t lose sleep over this

Page 34: ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 34

IFS blocking in grid space( IBM p690 / TL159L60 )

Optimal use of cache / subroutine call overhead