an overview of high performance computing

1

11/29/2005 1

An Overview of High Performance An Overview of High Performance ComputingComputingJack Dongarra

University of Tennesseeand

Oak Ridge National Laboratory

00 2

OverviewOverview

♦Look at fastest computersFrom the Top500

♦Some of the changes that face usHardware SoftwareAlgorithms

2

00 3

Technology Trends: Microprocessor Technology Trends: Microprocessor Chip CapacityChip Capacity

2X transistors/Chip Every 1.5 years “Moore’s Law”

Microprocessors have become smaller, denser, and more powerful.Not just processors, bandwidth, storage, etc. 2X memory and processor speed and ½ size, cost, & power every 18 months.

Gordon Moore (co-founder of Intel) Electronics Magazine, 1965

Number of devices/chip doubles every 18 months

00 4

H. Meuer, H. Simon, E. Strohmaier, & JDH. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerfulComputers in the World

- Yardstick: Rmax from LINPACK MPPAx=b, dense problem

- Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June

- All data available from www.top500.org

Size

Rate

TPP performance

3

00 5

Performance DevelopmentPerformance Development2.3 PF/s

1.167 TF/s

59.7 GF/s

280.6 TF/s

0.4 GF/s

1.646 TF/s

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Fujitsu

'NWT' NAL

NEC

Earth Simulator

Intel ASCI Red

Sandia

IBM ASCI White

LLNL

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

IBM

BlueGene/L

My Laptop

00 6

Architecture/Systems ContinuumArchitecture/Systems Continuum

♦ Custom processor with custom interconnect

Cray X1NEC SX-8IBM RegattaIBM Blue Gene/L

♦ Commodity processor with custom interconnect

SGI AltixIntel Itanium 2

Cray XT3, XD1AMD Opteron

♦ Commodity processor with commodity interconnect

Clusters Pentium, Itanium, Opteron, AlphaGigE, Infiniband, Myrinet, Quadrics

NEC TX7IBM eServerDawning

Loosely Coupled

Tightly Coupled ♦ Best processor performance for

codes that are not “cache friendly”

♦ Good communication performance♦ Simpler programming model♦ Most expensive

♦ Good communication performance♦ Good scalability

♦ Best price/performance (for codes that work well with caches and are latency tolerant)

♦ More complex programming model0%

20%

40%

60%

80%

100%

Jun-

93

Dec

-93

Jun-

94

Dec

-94

Jun-

95

Dec

-95

Jun-

96

Dec

-96

Jun-

97

Dec

-97

Jun-

98

Dec

-98

Jun-

99

Dec

-99

Jun-

00

Dec

-00

Jun-

01

Dec

-01

Jun-

02

Dec

-02

Jun-

03

Dec

-03

Jun-

04

Custom

Commod

Hybrid

4

00 7

Commodity ProcessorsCommodity Processors

♦ Intel Pentium Nocona3.6 GHz, peak = 7.2 Gflop/sLinpack 100 = 1.8 Gflop/sLinpack 1000 = 4.2 Gflop/s

♦ Intel Itanium 21.6 GHz, peak = 6.4 Gflop/sLinpack 100 = 1.7 Gflop/sLinpack 1000 = 5.7 Gflop/s

♦ AMD Opteron2.6 GHz, peak = 5.2 Gflop/sLinpack 100 = 1.6 Gflop/sLinpack 1000 = 3.9 Gflop/s

00 8

Architectures / SystemsArchitectures / Systems

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

SIMD

Single Proc.

Cluster

Constellations

SMP

MPP

Cluster: Commodity processors & Commodity interconnect

Constellation: # of procs/node nodes in the system

(360)

5

00 9

Interconnects / SystemsInterconnects / Systems

0

100

200

300

400

50019

93

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Others

Cray Interconnect

SP Switch

Crossbar

Quadrics

Infiniband

Myrinet

Gigabit Ethernet

N/A

(249)(101)

00 10

Processor TypesProcessor Types

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

SIMD

Sparc

Vector

MIPS

Alpha

HP

AMD

IBM Power

Intel

6

00 11

Processors Used in Each Processors Used in Each of the 500 Systemsof the 500 Systems

Intel IA-3241%

Intel EM64T16%

IBM Power15%

AMD x86_6411%

Intel IA-649%

HP PA-RISC3%

Cray2%

HP Alpha1%

NEC1%

Sun Sparc1%

Hitachi SR80000%

91% = 66% Intel 15% IBM 11% AMD

00 12

26th List: The TOP1026th List: The TOP10

96

51202002JapanEarth Simulator Center35.86Earth-Simulator

SX-5NEC7

4

52002005USAOak Ridge National Lab20.53JaguarCray XT3 AMDCray10

122882005NetherlandsASTRONUniversity Groningen27.45eServer Blue GeneIBM

48002005SpainBarcelona Supercomputer Center27.91MareNostrum

PPC 970/MyrinetIBM85

108802005USASandia36.19Red StormCray XT3 AMDCray6

10

80002005USASandia38.27Thunderbird

Pentium/InfinibandDell5

101602004USANASA Ames51.87ColumbiaAltix, Itanium/InfinibandSGI4

3

102402005USADOE/NNSA/LLNL63.39ASC PurplePower5 p575IBM3

409602005USAIBM Thomas Watson91.29BGWeServer Blue GeneIBM2

1310722005USADOE/NNSA/LLNL280.6BlueGene/LeServer Blue GeneIBM1

#ProcYearCountryInstallation SiteRmax[TF/s]ComputerManufacturer

7

00 13

Customer Segments / PerformanceCustomer Segments / Performance

Research

Industry

Academic

VendorClassified

Government

0%10%20%30%40%50%60%70%80%90%

100%19

93

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

00 14

Countries / PerformanceCountries / Performance

0%10%

20%30%40%

50%60%70%80%

90%100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Others

France

China

Germany

UK

Japan

US 68%

6.0%

5.4%

3.0%

2.6%

1.0%

8

00 15

Asian Countries / SystemsAsian Countries / Systems

0

20

40

60

80

100

12019

93

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Others

India

Korea

China

Japan 21

17

7

4

17

00 16

Concurrency Levels of the Top500Concurrency Levels of the Top500

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

64k-128k32k-64k16k-32k8k-16k4k-8k2049-40961025-2048513-1024257-512129-25665-12833-6417-329-165-83-421

9

00 17

Concurrency Levels of the Top500Concurrency Levels of the Top500

Minimum

Average

Maximum

1

10

100

1,000

10,000

100,000

1,000,000

Jun-9

3

Jun-9

4

Jun-9

5

Jun-9

6

Jun-9

7

Jun-9

8

Jun-9

9

Jun-0

0

Jun-0

1

Jun-0

2

Jun-0

3

Jun-0

4

Jun-0

5

# pr

oces

sors

50

1462

131K

00 18

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

4 processors

Node Board(32 chips, 4x4x2)

16 Compute Cards64 processors

System(64 racks, 64x32x32)

131,072 procsRack(32 Node boards, 8x8x16)

2048 processors

2.8/5.6 GF/s4 MB (cache)

5.6/11.2 GF/s1 GB DDR

90/180 GF/s16 GB DDR

2.9/5.7 TF/s0.5 TB DDR

180/360 TF/s32 TB DDR

IBM IBM BlueGeneBlueGene/L/L131,072 Processors (#1131,072 Processors (#1--64K and #264K and #2--40K)40K)

“Fastest Computer”BG/L 700 MHz 131K proc64 racksPeak: 367 Tflop/sLinpack: 281 Tflop/s77% of peak

BlueGene/L Compute ASIC

Full system total of 131,072 processors

The compute node ASICs include all networking and processor functionality. Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded cores (note that L1 cache coherence is not maintained between these cores).(13K sec about 3.6 hours; n=1.8M

1.6 MWatts (1600 homes)43,000 ops/s/person

10

00 19

Performance ProjectionPerformance Projection

My Laptop

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

10 Pflop/s

1 Eflop/s

100 Pflop/s

DARPA HPCS

00 20

A A PetaFlopPetaFlop Computer by the End of the Computer by the End of the DecadeDecade

♦ 10 Companies working on a building a Petaflop system by the end of the decade.

CrayIBMSunDawningGalacticLenovoHitachi NECFujitsuBull

Japanese Japanese ““Life SimulatorLife Simulator”” (10 (10 Pflop/sPflop/s))

} Chinese Chinese CompaniesCompanies

}

}

11

00 21

Flops per Gross Domestic ProductFlops per Gross Domestic ProductBased on the November 2005 Top500 Based on the November 2005 Top500

0

20

40

60

80

100

120

140

160

180

200

Isrea

lUSA

New Zealand

Switzerla

ndRussia UK

Netherlan

ds

Australia

n

South Korea

Saudi A

rabia

China

Japa

n

Irelan

dSpain

German

y

Taiwan

Canada

Brazil

00 22

0

1000

2000

3000

4000

5000

6000

United Stat

es

Switzerl

and

Israel

New Zealan

d

United King

dom

Netherlan

ds

Austra

liaJa

pan

German

ySpa

in

Canada

Korea,

South

Sweden

Saudia

Arabia

France

Taiwan

Italy

Brazil

Mexico

China

Russia

India

KFlop/sKFlop/s per Capita (Flops/Pop)per Capita (Flops/Pop)Based on the November 2005 Top500Based on the November 2005 Top500

WETA Digital (Lord of the Rings)

Hint: Peter Jackson had something to do with this

Has nothing to do with the 47.2 million sheep in NZ

12

00 23

Fuel Efficiency: Fuel Efficiency: GFlopsGFlops/Watt/Watt

Top 20 systemsBased on processor power rating only (3,>100,>800)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9BlueGene

/LBlue G

ene

ASC Purple

p5 1

.9 GHz

Columbia

- SGI A

ltix 1.

5 GHz

Thunde

rbird - P

entium 3.

6 GHz

Red Storm

Cray XT3,

2.0 GHz

Earth-Sim

ulator

MareNos

trum PPC 97

0, 2.2

GHz

Blue Gene

Jaguar

- Cray

XT3, 2

.4 GHz

Thunde

r - Int

el Ita

nium2 1.4G

HzBlue G

ene

Blue Gene

Cray XT3,

2.6 G

Hz

Apple XServ

e, 2.0

GHz

Cray X1E (4

GB)Cray

X1E (2

GB)

ASCI Q - A

lpha 1.25 G

Hz

IBM p5 575

1.9 G

Hz

System X 2.

3 GHz Apple

XServe/

SGI Altix

3700

1.6 G

Hz

GFl

ops/

Wat

t

00 24

10,00010,000

1,0001,000

100100

1010

11

‘‘7070 ‘‘8080 ‘‘9090 ‘‘0000 ‘‘1010

Pow

er D

ensi

ty (W

/cm

Pow

er D

ensi

ty (W

/cm

22 ))

4004400480088008

80808080

8085808580868086

286286 386386

486486

PentiumPentium®®

Hot PlateHot Plate

Nuclear ReactorNuclear Reactor

Rocket NozzleRocket Nozzle

SunSun’’s Surfaces Surface

Intel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)

Cube relationship between the cycle time and power.

13

00 25

Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate& Transistor & Transistor

DensityDensity

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Intel Yonah will double the processing power on a per watt basis.

00 26

3 GHz, 8 Cores3 GHz, 4 Cores

2 Cores2 Cores

4 Cores4 Cores

8 Cores8 Cores

1 Core1 Core

Free

Lun

ch F

or T

radi

tiona

l Sof

twar

e(It

just

runs

twic

e as

fast

eve

ry 1

8 m

onth

sw

ith n

o ch

ange

to th

e co

de!)

Ope

ratio

ns p

er s

econ

d fo

r ser

ial c

ode No Free Lunch For Traditional

Software(Without highly concurrent software it won’t get any faster!)

Additional operations per second if code can take advantage of concurrency

24 G

Hz,

1 C

ore

12 G

Hz,

1 C

ore

6 G

Hz

1 C

ore

3 GHz2 Cores

3GH

z1

Cor

e

14

00 27

CPU Desktop Trends 2004CPU Desktop Trends 2004--20102010

2004 2005 2006 2007 2008 2009 2010

Cores Per Processor ChipHardware Threads Per Chip

0

50

100

150

200

250

300

Year

♦ Relative processing power will continue to double every 18 months

♦ 256 logical processors per chip in late 2010

00 28

Commodity Processor TrendsCommodity Processor TrendsBandwidth/Latency is the Critical Issue, not FLOPSBandwidth/Latency is the Critical Issue, not FLOPS

28 ns= 94,000 FP ops= 780 loads

50 ns= 1600 FP ops= 170 loads

70 ns= 280 FP ops= 70 loads

(5.5%) DRAM latency

27 GWord/s= 0.008 word/flop

3.5 GWord/s= 0.11 word/flop

1 GWord/s= 0.25 word/flop23%Front-side bus

bandwidth

3300 GFLOP/s 32 GFLOP/s 4 GFLOP/s 59% Single-chipfloating-point performance

Typical valuein 2020



Annual increase

Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

Got Bandwidth?

15

00 29

Fault Tolerance: MotivationFault Tolerance: Motivation

♦ Trends in HPC:High end systems with thousand of processors

♦ Increased probability of a node failureMost systems nowadays are robust

♦ MPI widely accepted in scientific computingProcess faults not tolerated in MPI model

Mismatch between hardware and (non fault-tolerant) programming paradigm of MPI.

00 30

Reliability of Leading-Edge HPC Systems

MTBI: 9.7 hours.3,016PittsburghLemieux

MTBF: 5.0 hours (’01) and 40 hours (’03).Leading outage sources: storage, CPU, 3rd-party HW.

8,192LLNL ASCI White

MTBI: 6.5 hours.Leading outage sources: storage, CPU, memory.

8,192LANL ASCI Q

ReliabilityCPUsSystem

MTBI: mean time between interrupts = wall clock hours / # downtime periodsMTBF: mean time between failures (measured)

♦ 100K processor systemsare herewe have fundamental challenges in dealing with machines of this size … and little in the way of programming support

16

00 31

Future Challenge: Developing Future Challenge: Developing the Ecosystem for HPCthe Ecosystem for HPC

From the NRC Report on “The Future of Supercomputing”:♦ Hardware, software, algorithms, tools, networks, institutions,

applications, and people who solve supercomputing applications can be thought of collectively as an ecosystem

♦ Research investment in HPC should be informed by the ecosystem point of view - progress must come on a broad front of interrelated technologies, rather than in the form of individual breakthroughs.

A supercomputer ecosystem is a

continuum of computing platforms,

system software, algorithms, tools,

networks, and the people who know

how to exploit them to solve

computational science applications.

00 32

Real Crisis With HPC Is With The SoftwareReal Crisis With HPC Is With The Software♦ Our ability to configure a hardware system capable of

1 PetaFlop (1015 ops/s) is without question just a matter of time and $$.

♦ A supercomputer application and software are usually much more long-lived than a hardware

Hardware life typically five years at most…. Apps 20-30 yearsFortran and C are the main programming models (still!!)

♦ The REAL CHALLENGE is SoftwareProgramming hasn’t changed since the 70’sHUGE manpower investment

MPI… is that all there is?Often requires HERO programmingInvestments in the entire software stack is required (OS, libs, etc.)

♦ Software is a major cost component of modern technologies.The tradition in HPC system procurement is to assume that the software is free… SOFTWARE COSTS (over and over)

17

00 33

Summary of Current Unmet NeedsSummary of Current Unmet Needs♦ Performance / Portability♦ Fault tolerance ♦ Memory bandwidth/Latency♦ Adaptability: Some degree of autonomy to self optimize,

test, or monitor. Able to change mode of operation: static or dynamic

♦ Better programming models Global shared address space Visible locality

♦ Maybe coming soon (incremental, yet offering real benefits):Global Address Space (GAS) languages: UPC, Co-Array Fortran, Titanium, Chapel)

“Minor” extensions to existing languagesMore convenient than MPIHave performance transparency via explicit remote memory references

00 34

Collaborators / SupportCollaborators / Support

♦ Top500 TeamErich Strohmaier, NERSCHans Meuer, MannheimHorst Simon, NERSC

http://www.top500.org/

18

00 35

00 36

Next StepsNext Steps♦ Software to determine the checkpointing interval and number of

checkpoint processors from the machine characteristics.Perhaps use historical information.MonitoringMigration of task if potential problem

♦ Local checkpoint and restart algorithm.Coordination of local checkpoints.Processors hold backups of neighbors.

♦ Have the checkpoint processes participate in the computation and do data rearrangement when a failure occurs.

Use p processors for the computation and have k of them hold checkpoint.

♦ Generalize the ideas to provide a library of routines to do the diskless check pointing.

♦ Look at “real applications” and investigate “Lossy” algorithms.

19

00 37

Operating Systems / SystemsOperating Systems / Systems

0

100

200

300

400

50019

98

1999

2000

2001

2002

2003

2004

2005

Others

FreeBSD

N/A

BSD Based

Linux

Unix

00 38

Clusters / SystemsClusters / Systems

0

50

100

150

200

250

300

350

1997

1998

1999

2000

2001

2002

2003

2004

2005

Others

Intel Itanium

Sun Fire

Alpha Cluster

Sun Cluster

Dell Cluster

AMD Cluster

HP AlphaServer

Intel Pentium

HP Cluster

IBM Cluster

20

00 39

00 40

21

00 41

Japanese:Japanese:TightlyTightly--Coupled Heterogeneous SystemCoupled Heterogeneous System

♦ Would like to get to 10 PetaFlop/s by 2011♦ Scalable, fits any computer center

Size, cost, ratio of components ♦ Easy and low-cost to develop new component♦ Scale merit of components

SwitchPresentPresent

Faster interconnect

Vector Node

Scalar Node

MD Node

Slower connectionFaster interconnect

Faster interconnect

Future systemFuture system

Vector Node

Scalar Node

FPGA Node

MD Node

Faster interconnect

00 42

How Big Is Big?How Big Is Big?♦ Every 10X brings new challenges

64 processors was once considered largeit hasn’t been “large” for quite a while

1024 processors is today’s “medium” size8096 processors is today’s “large”

we’re struggling even here

♦ 100K processor systemsare in constructionwe have fundamental challenges in dealing with machines of this size … and little in the way of programming support

22

00 43

Interconnects / SystemsInterconnects / Systems

0

100

200

300

400

50019

93

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Cray Interconnect

SP Switch

Crossbar

Others

Infiniband

Quadrics

Gigabit Ethernet

Myrinet

N/A

28%

24%

Myrinet and GigE > 50% of market

00 44

Real Crisis With HPC Is With The Real Crisis With HPC Is With The Software Software

♦ Programming is stuckArguably hasn’t changed since the 60’s

♦ It’s time for a changeComplexity is rising dramatically

highly parallel and distributed systemsFrom 10 to 100 to 1000 to 10000 to 100000 of processors!!

multidisciplinary applications♦ A supercomputer application and software are usually

much more long-lived than a hardwareHardware life typically five years at most.Fortran and C are the main programming models

♦ Software is a major cost component of modern technologies.

The tradition in HPC system procurement is to assume that the software is free.

♦ We have too few ideas about how to solve this problem.

23

00 45

Processor TypesProcessor Types

0

100

200

300

400

50019

93

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

SIMD

Sparc

MIPS

Intel

HP PA-Risc

HP Alpha

IBM Power

Other scalar

Vector

00 46

TodayToday’’s Processorss Processors

♦ pipelining (superscalar, OOO, VLIW, branch prediction, predication)

♦ simultaneous multithreading (SMT, Hyper-Threading, multi-core)

♦ SIMD vector instructions (VIS, MMX/SSE, AltiVec)

♦ caches and the memory hierarchy ♦ Intel added 36 instructions per year to

IA-32, or 3 instructions per month!

24

00 47

0

200

400

600

800

1000

1200

1400

1600

China

Brazil

Italy

Mexico

France

Korea,

South

Saudia

Arab

ia

German

y

Belarus

Switzerl

and

Canad

aSpa

inJa

pan

Israe

l

United

Kingdo

m

New Zea

land

United

States

KFlop/sKFlop/s per Capita (Flops/Pop)per Capita (Flops/Pop)Based on the November 2004 Top500 onlyBased on the November 2004 Top500 only

WETA Digital (Lord of the Rings)

Hint: Peter Jackson had something to do with this

Has nothing to do with the 47.2 million sheep in NZ

00 48

TodayToday’’s CPU Architectures CPU Architecture

Moore’s Law for Power Consumption

Heat is becoming an unmanageable problem

25

00 49

Self Adapting Numerical SoftwareSelf Adapting Numerical Software

♦ The process of arriving at an efficient solution involves many decisions by an expert.

Algorithm decisionsData decisionsManagement of the computing environmentProcessor specific tuning

Proceedings of the IEEE, V: 93 #: 2 Feb. 2005

Issue on Program Generation,

Optimization, and Platform Adaptation

Complex set of interaction between Users’ applicationsAlgorithmProgramming languageCompilerMachine instructionHardware

Many layers of translation from the application to the hardware. Changing with each generation of hardware.

an overview of high performance computing

Documents