ecmwf slide 1 introduction to parallel computing george mozdzynski march 2004

ECMWF Slide 1

Introduction to Parallel Computing

George Mozdzynski

March 2004

ECMWF Slide 2

Outline

What is parallel computing?

Why do we need it?

Types of computer

Parallel Computing today

Parallel Programming Languages

OpenMP and Message Passing

Terminology

ECMWF Slide 3

What is Parallel Computing?

The simultaneous use of more than one processor or computer to solve a problem

ECMWF Slide 4

Why do we need Parallel Computing?

Serial computing is too slow

Need for large amounts of memory not

accessible by a single processor

ECMWF Slide 5

An IFS operational TL511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster 1600 1.3 GHz system (total 1920 CPUs).

How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4

ECMWF Slide 6

Ans. About 8 days

This PC would need about 25 Gbytes of memory.

8 days is too long for a 10 day forecast!

2-3 hours is too long …

ECMWF Slide 7

CPUs Wall time

64 11355

128 5932

192 4230

256 3375

320 2806

384 2338

448 2054

512 1842

Amdahl’s Law:

Wall Time = S + P/NCPUS

IFS Forecast Model (TL511L60)

Serial =574 secs

Parallel=690930 secs

(Calculated using Excel’s LINEST function)

(Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).

ECMWF Slide 8

IFS Forecast Model (TL511L60)

CPUs Wall time SpeedUp Efficiency 1 691504 1 100.0

64 11355 61 95.2

128 5932 117 91.1

192 4230 163 85.1

256 3375 205 80.0

320 2806 246 77.0

384 2338 296 77.0

448 2054 337 75.1

512 1842 375 73.3

ECMWF Slide 9

IFS Forecast Model, TL511L60

0

64

128

192

256

320

384

448

512

576

64 128 192 256 320 384 448 512

IBM Power 4 CPUs

Sp

eed

Up Observed

Estimated

Ideal

ECMWF Slide 10

Extrapolating Performance

0256512768102412801536179220482304

256 512 768 1024 1280 1536 1792 2048

IBM Power 4 CPUs

Sp

eed

Up

Ideal

Estimated

IFS model would be inefficient on large numbers of CPUs. But OK up to 512.

ECMWF Slide 11

Types of Parallel ComputerP=ProcessorM=MemoryS=Switch

Shared Memory Distributed Memory

P

M

P … P

M

P

M

S

…

ECMWF Slide 12

IBM Cluster 1600 ( at ECMWF)

P=ProcessorM=MemoryS=Switch

…

S

P

M

P … P

M

P …

Node Node

ECMWF Slide 13

IBM Cluster 1600’s at ECMWF (hpca + hpcb)

ECMWF Slide 14

ECMWF supercomputers

1979 CRAY 1A Vector

CRAY XMP-2CRAY XMP-4CRAY YMP-8CRAY C90-16

Fujitsu VPP700Fujitsu VPP5000

2002 IBM p690 Scalar + MPI +Shared Memory Parallel

}

} Vector + MPI Parallel

Vector + Shared Memory Parallel

ECMWF Slide 15

ECMWF’s first Supercomputer

CRAY-1A

1979

ECMWF Slide 16

Where have 25 years gone?

ECMWF Slide 17

Types of ProcessorDO J=1,1000

A(J)=B(J) + C

ENDDO

LOAD B(J)FADD CSTORE A(J)INCR JTEST

SCALAR PROCESSOR

VECTOR PROCESSOR

LOADV B->V1FADDV B,C->V2STOREV V2->A

Single instruction processes one element

Single instruction processes many elements

ECMWF Slide 18

Parallel Computing Today

Vector SystemsNEC SX6CRAY X-1Fujitsu VPP5000

Scalar SystemsIBM Cluster 1600FujitsuPRIMEPOWER HPC2500HP Integrity rx2600 Itanium2

Cluster Systems (typically installed by an Integrator)Virgina Tech, Apple G5 / InfinibandNCSA, Dell PowerEdge 1750, P4 Xeon / MyrinetLLNL, MCR Linux Cluster Xeon / QuadricsLANL, Linux Networx AMD Opteron / Myrinet

ECMWF Slide 19

The TOP500 project

started in 1993

Top 500 sites reported

Report produced twice a year

- EUROPE in JUNE

- USA in NOV

Performance based on LINPACK benchmark

http://www.top500.org/

ECMWF Slide 20

Top 500 Supercomputers

ECMWF Slide 21

Where is ECMWF in Top 500

Rmax

Rpeak

Rmax – Gflop/sec using Linpack Benchmark

Rpeak – Peak Hardware Gflop/sec (that will never be reached!)

ECMWF Slide 22

What performance do Meteorological Applications achieve?

Vector computers- About 30 to 50 percent of peak performance

- Relatively more expensive

- Also have front-end scalar nodes

Scalar computers- About 5 to 10 percent of peak performance

- Relatively less expensive

Both Vector and Scalar computers are being used in Met Centres around the world

Is it harder to parallelize than vectorize?- Vectorization is mainly a compiler responsibility

- Parallelization is mainly the user’s responsibility

ECMWF Slide 23

http://www.top500.org/ORSC/2003/

Overview of Recent Supercomputers

Aad J. van der Steen

and

Jack J. Dongarra

ECMWF Slide 24

ECMWF Slide 25

ECMWF Slide 26

Parallel Programming Languages?

• High Performance Fortran (HPF) • directive based extension to Fortran• works on both shared and distributed memory systems• not widely used (more popular in Japan?)• not suited to applications using irregular grids• http://www.crpc.rice.edu/HPFF/home.html

• OpenMP• directive based• support for Fortran 90/95 and C/C++• shared memory programming only• http://www.openmp.org

ECMWF Slide 27

Most Parallel Programmers use…

Fortran 90/95, C/C++ with MPI for communicating between tasks (processes)

- works for applications running on shared and distributed memory systems

Fortran 90/95, C/C++ with OpenMP

- For applications that need performance that is satisfied by a single node (shared memory)

Hybrid combination of MPI/OpenMP

- ECMWF’s IFS uses this approach

ECMWF Slide 28

the myth of automatic parallelization(2 common versions)

Compilers can do anything (but we may have to wait a while)

- Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source

Compilers can’t do anything (now or never)

- Automatic parallelization is useless. It’ll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome

ECMWF Slide 29

Terminology

Cache, Cache line

NUMA

false sharing

Data decomposition

Halo, halo exchange

FLOP

Load imbalance

Synchronization

ECMWF Slide 30

THANKYOU

ECMWF Slide 31

Cache

P

M

C

P=ProcessorC=CacheM=Memory

M

P

C1 C1

C2

P

ECMWF Slide 32

IBM node = 8 CPUs + 3 levels of $

P

C1 C1

C2

P P

C1 C1

C2

P P

C1 C1

C2

PP

C1 C1

C2

P

C3

Memory

ECMWF Slide 33

Cache is …

Small and fast memory

Cache line typically 128 bytes

Cache line has state (copy,exclusive owner)

Coherency protocol

Mapping, sets, ways

Replacement strategy

Write thru’ or not

Important for performance

- Single stride access of always the best!!!

- Try to avoid writes to same cache line from different Cpus

But don’t lose sleep over this

ECMWF Slide 34

IFS blocking in grid space( IBM p690 / TL159L60 )

Optimal use of cache / subroutine call overhead

ecmwf slide 1 introduction to parallel computing george mozdzynski march 2004

Documents

ecmwf slide

s slide

long slide

dongarra slide

hpcb slide

approach slide

problem slide

p node slide