parallel codes and high performance computing: …calcul.math.cnrs.fr/img/pdf/hybridco.pdf · main...

41
HPC and Multi-GPU Architectures Software problem HPC nowadays Memory bottleneck Developer approach Future Scenarios Present Situation Optimization User viewpoint Frequent mistakes Performance evaluation S_GPU Performances GPU Practical cases Conclusion Ècole “Programmation Hybride”: une étape vers le many-cœurs ´ L’ÉSCANDILLE –AUTRANS,FRANCE Parallel Codes and High Performance Computing: Massively parallelism and Multi-GPU Luigi Genovese L_Sim – CEA Grenoble October 10, 2012 Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Upload: phungdung

Post on 19-Mar-2018

222 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Ècole “Programmation Hybride”:une étape vers le many-cœurs

L’ÉSCANDILLE – AUTRANS, FRANCE

Parallel Codes and High Performance Computing:Massively parallelism and Multi-GPU

Luigi Genovese

L_Sim – CEA Grenoble

October 10, 2012

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 2: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

A basis for nanosciences: the BigDFT project

STREP European project: BigDFT(2005-2008)Four partners, 15 contributors:CEA-INAC Grenoble (T.Deutsch), U. Basel (S.Goedecker),U. Louvain-la-Neuve (X.Gonze), U. Kiel (R.Schneider)

Aim: To develop an ab-initio DFT codebased on Daubechies Wavelets, to beintegrated in ABINIT.BigDFT 1.0 −→ January 2008. . . Not only a DFT adventure.

In this presentationPresent HPC scenario

Developers’ and users’ challenges

Outcomes and general considerations

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 3: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Ab initio calculations with DFT

Several advantages4 Ab initio: No

adjustable parameters

4 DFT: Quantummechanical(fundamental)treatment

Main limitations8 Approximated approach

8 Requires high computerpower, limited to fewhundreds atoms in mostcases

Wide range of applications: nanoscience, biology, materials

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 4: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Outline

1 Parallel computing and architecturesFrom past to present: softwareHPC nowadaysMemory bottleneck

2 (DFT) Developer point of viewFuture ScenariosPresent SituationOptimization

3 User viewpointFrequent mistakesPerformance evaluationA (old) example S_GPU library

4 PerformancesRecent situation: Evaluating GPU gainPractical cases

5 Conclusion and Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 5: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

What is Parallel Computing?

Easy to say. . .Simultaneous use of multiple compute resourcesto solve a computational problem

. . . but not so easy to implementA problem is broken in multiple parts which can besolved concurrently

Each part is associated to a series of instructions

Instruction from each part are executed simultaneouslyon different Compute Processing Units

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 6: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

The Compute Processing Unit(s)

A computing machine (node) is made of:Control Unit

Arithmetic Logic Unit

Memory Unit

They might exist in different ratio of different architectures

After all, they are transistors

What does technology offer us?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 7: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Moore’s law

40 years of improvementsTransistor counts double everytwo years. . .

. . . but how?

Power is the limiting factor (around 100 W nowadays)

Power ∝ Frequency3 * Clock rate is limitedMultiple slower devices preferable than one superfast device* More performance with less power→ software problem?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 8: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Why software problem?

The power cost of frequency

Cores Hz (Flop/s) W Flop/s/WSuperscalar 1 1.5 × 1.5 × 3.3 × 0.45

Multicore 2 0.75 × 1.5 × 0.8 × 1.88

Exercise:Take a given computational problem

Write a code at a time t0.Solve the problem on a computer.

Freeze your code and wait some time t1− t0Take a new computer at time t1.Solve again the same problem.

What happens to your performances?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 9: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

HPC thumb-rules have changed

Frequency-dominated era4 Parallelism is not improved by the architecture

4 Frequency increases→ No. Flop/s increases

* Code runs faster

Manycore era4 Parallelism is dramatically changed in the architecture

4 Frequency decreases

* Code runs slower

6 The code should be changed

The parallel behaviour of a code (oversimplification)Capacity computing: many independent jobs

Capability computing: single job, parallel intensive

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 10: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

How to parallelize your data?

Distributed Memory

Private Memory

Processors operateindependently

Data transfer should beprogrammed explicitly(MPI)

Relies (also) on networkperformances

Shared Memory

Memory is common to allprocessors

Threads operateconcurrently on data

Relies on bandwidthperfomances

Memory operations are crucial for parallelism

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 11: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

The cost of the memory transfer

1W × 1 Year = 1$ (neglecting cooling and storage)

Some facts about memory transfer: Memory bandwidth40 GB/s (CPU); 20 GB/s (RAM); 3.5 GB/s (interconnect)

Bandwidth evolves less faster than computational power:4 ∼90’s (Math co-processor): 1 Flop/s each 4 Bytes transferred6 Nowadays: 62 Flop/s per Bytes transferred

The cost in energy of data movementComputation: a FMA costs now 100 pJ (10 pJ in the future)

Move data in RAM costs 4.8 nJ (1.92 nJ)

Communicating data (MPI) costs 7.5 nJ (2.5 nJ)

* Moore’s law revisited:Thread number executions will double each year

A complicated scenario for HPC (with ab initio) codes

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 12: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Top500

World Top Supercomputer Ranking

Some considerationsFaster than Moore’s law (doubles every 14 months)

* In 8 years top 1 goes off the list

Hybrid (CPU/GPU) architectures are emerging

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 13: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Distribute the data on hybrid supercomputer

How a code can be executed on hybrid CPU-GPUarchitectures?

Network

CPU

GPU

GPUCPU

GPU

GPU

CPU

CPU

Data transfer is stillMPI-based

Only on-boardcommunication between

GPU and CPU

Data distribution should depend on the presence of GPUs onthe nodes→ Multilevel parallelization required

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 14: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Hybrid Supercomputing nowadays

GPGPU on SupercomputersTraditional architectures are somehow saturatingMore cores/node, memories (slightly) larger but not faster

Architectures of Supercomputers are becoming hybrid3 out to 5 Top Supercomputers are hybrid machines

Extrapolation: In 2015, No. 500 will become petafloppicLikely it will be a hybrid machine

Codes should be conceived differently# MPI processes is limited for a fixed problem size

Performances increase only by enhancing parallelism

Further parallelisation levels should be added (OpenMP,GPU)

Does (electronic structure calculations) codes are suitable?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 15: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Future scenarios for the supercomputing

Exascale is foreseen for 2018: we cannot waitSimulation: limited money (200 M$) and power (20 MW)How can you get exascale (1000 times more powerful)?

100 times more memory

100× for bandwidth

Interrupt time 10 times smaller (one each day)

The Blue-Gene like scenario100 thousands nodes with 1000 cores on it

The GPU-like scenario10 thousands node with 10 thousands "cores"

First observations:Huge thread concurrencyFault resiliance might become crucial

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 16: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Future problems from the supercomputing

Architectures change way faster than codesNumber of CPU hours is increasing

* Should not be scared in asking Mhours

Which scientific codes are exempted?Might we ignore this? What is the price to pay?

Produce new science by preserving system size(possible only for new scientific domains)

Stop coding→ Parallelism is not altered

“Easy” things have already been done→ life is hard

This approach cannot last: a (yet) new challengeArchitectures for HPC are market driven

Low-power is now dominating (smartphones)

BigDFT on ARM architecture: 1/30 of Power, 10 times slower!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 17: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

How far is petaflop (for DFT)?

At present, with traditional architecturesRoutinely used DFT calculations are:

Few dozens (hundreds) of processors

Parallel intensive operations (blocking communications,60-70 percent efficiency)

Not freshly optimised (legacy codes, monster codes)

* Optimistic estimation: 5 GFlop/s per core × 2000 cores ×0.9 = 9 TFlop/s = 200 times less than Top 500’s #3!

It is such asDistance Earth-Moon = 384 MmDistance Earth-Mars = 78.4 Gm = 200 times more

Moon is reached. . . can we go to Mars? (. . . in 2015?)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 18: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Using GPUs in a given (DFT) code

Developer and user dilemmasDoes my code fits well? For which systems?

How much does porting costs?

Should I always use GPUs?

How can I interpret results?

Evaluating GPU convenienceThree levels of evaluation

1 Bare speedups: GPU kernels vs. CPU routinesDoes the operations are suitable for GPU?

2 Full code speedup on one processAmdahl’s law: are there hot-spot operations?

3 Speedup in a (massively?) parallel environmentThe MPI layer adds an extra level of complexity

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 19: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Case study: 1D convolutions (BigDFT code)Initially, naive routines (FORTRAN?)

y(j, I) =U

∑`=L

h`x(I + `, j)

Easy to write anddebug

Define referenceresults

do j=1,ndatdo i=0,n1tt=0.d0do l=lowfil,lupfil

tt=tt+x(i+l,j)*h(l)enddoy(j,i)=tt

enddo

enddo

Optimisation can then start (Ex. X5550,2.67 GHz)

Method GFlop/s % of peak SpeedUpNaive (FORTRAN) 0.54 5.1 1/(6.25)

Current (FORTRAN) 3.3 31 1Best (C, SSE) 7.8 73 2.3

OpenCL (Fermi) 97 20 29 (12.4)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 20: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

How to optimize?

A trade-off between benefit and effort

FORTRAN based4 Relatively accessible (loop unrolling)

4 Moderate optimisation can be achieved relatively fast

6 Compilers fail to use vector engine efficiently

Push optimisation at the bestAbout 20 different patterns have been studied for one1D convolution

Tedious work, huge code −→ Maintainability?

* Automatic code generation?

Consider new programming paradigmsNew coding approaches are most welcome

→ Kronos’ OpenCL standard

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 21: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

GPU-ported operations in BigDFT (double precision)

Convolutions Kernels* (OpenCL (re)written)4 Fully functional (all BC)4 Based on the former

CUDA version4 A 10 to 60 speedup

0

10

20

30

40

50

60

70

Magicfilter

Magicfilter_reverse

Magicfilter_grow

Magicfilter_shrink

Kinetic

Kinetic_k

Analysis

Synthesis

Synthesis_grow

Analysis_shrink

Uncom

press

Com

press

Rat

io to

CP

U

Kernels

Performances of CPU vs NVIDIA vs AMD

CPUNVIDIA

AMD

2

4

6

8

10

12

14

16

18

20

0 10 20 30 40 50 60 70

GP

U s

peed

up (

Dou

ble

prec

.)

Wavefunction size (MB)

locdenlocham

precond

GPU BigDFT sectionsGPU speedups between 5and 20, depending of:4 Wavefunction size4 CPU-GPU Architecture

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 22: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Interpretation of HPC behaviour:

Evaluate the code behaviour:A not so easy task (especially nowadays)Frequent mistakes:

Parallel efficiency is not walltime

Scalability is not only communication-drivenPerformance evaluation is a multicritierion evaluationprocess

Best scalability (Machine point of view)Best acceleration efficiency (Vendor point of view)Best walltime (User point of view)

But also robustness, fault tolerance, best Flop/W ratio

Anticipated messagesFar from trivial situation:

No golden rule

HPC Optimal Stategies should be interpreted

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 23: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Amdahl’s law

A basic conceptThe speedup with N cores depends of the parallel fraction(P) of the code:

speedup =1

PN +(1−P)

It represents the limits to the scalability of a given code

An important definition

Parallel Efficiency = Time(Nref)Time(N)

NNref

Often used as a benchmarkof a code in a parallel environment

Lots of factors involvedScalability of the problem

Communication performances

Computational cost of operations

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 24: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Intranode bandwidth problem

0

20

40

60

80

100

1x1 2x1 4x1 8x1 1x2 2x2 4x2 1x4 2x4 1x8 1

2

3

4

5

6

7P

erc

en

t

Sp

ee

du

p

MPI proc x OMP threads

B80 Cage, Free BC, 120 Orbitals, CCRT Titane: 2 x 4-core Intel Xeon X5570

CommsLinAlgConvNL PSPPSolverXCOtherSpeedupEfficiency (%)

Scalability does not depend only on communicationAmdahl’s law is a upper limit!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 25: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Task repartition for a small system (ZnO, 128 atoms)

0

10

20

30

40

50

60

70

80

90

100

16 24 32 48 64 96 144 192 288 576 1

10

100

1000P

erce

nt

Sec

onds

(lo

g. s

cale

)

No. of cores

1 Th. OMP per core

CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)

What are ideal conditions for acceleration (e.g. GPU)To-be-accelerated routines should take the majority of the time

What happens to parallel efficiency?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 26: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Parallelisation and architectures

Same code, same runs. Which is the best?

0

10

20

30

40

50

60

70

80

90

100

16 24 32 48 64 96 144 192 288 576 1

10

100

1000P

erce

nt

Sec

onds

(lo

g. s

cale

)

No. of cores

1 Th. OMP per core

CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)

0

10

20

30

40

50

60

70

80

90

100

16 24 32 48 64 96 144 192 288 576 1

10

100

1000

Per

cent

Sec

onds

(lo

g. s

cale

)

No. of cores

1 Th. OMP per core

CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)

CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5)

Titane is 2.3 to 1.6 times faster than Rosa!

Degradation of parallel performances: why?1 Calculation power has increased more than networking2 Better libraries (MKL)

* Walltime reduced, but lower parallel efficiency

This will always happen while using GPU!Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 27: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Architectures, libraries, networking

Same runs, same sources; different user conditions

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

0 100 200 300 400 500 600

Cp

u H

ou

rs

No. of cores

Titane MpiBull2 no MKLJade MPT

Titane MpiBull2, MKLTitane OpenMPI, MKL

Rosa Cray, istanbul

Differencesup to afactor of 3!

A case-by-case studyConsideration are often system-dependent, a thumb rule notalways exists.

* Know your code!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 28: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

A (even more) frequent mistake

Example: two DFT codes running on the same system.Naive question: Which one is faster?

The running conditionsMachine generation (CPU, cores, cache,. . . )

Parallel environment (MPI procs, OMP threads, GPUs)

Binaries (libraries, compiler,. . . )

Network vs. Computation performance

The code conditions (DFT example)Basis set (formalism, cut-off, . . . )

Self-Consistentcy (Input Guess, minimization scheme)

How this question should be posed?Which is lowest time-to-solution possible for this systemon a given machine?

Which is the fastest machine for this system?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 29: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Data repartition on a node

Non-hybrid caseGPU not used→ homogeneous repartition

MPI 0 MPI 1 MPI 2 MPI 3

DATA DATA DATA DATA

GPU

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 30: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Data repartition on a node

“Naive” repartitionAll the cores use the GPU at the same time

MPI 0 MPI 1 MPI 2 MPI 3

DATA DATA DATA DATA

GPU

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 31: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Data repartition on a nodeInhomogeneous repartition

Only one node use the GPU with more data

MPI 0 MPI 1 MPI 2 MPI 3

DATA DATA DATA

DATA

GPU

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 32: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Data repartition on a node

The S_GPU approachS_GPU library manages GPU resource within the node

MPI 0 MPI 1 MPI 2 MPI 3

DATA DATA DATA DATA

GPU

S_GPU

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 33: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

S_GPU library (M. Ospici, LIG / Bull / CEA)

De-synchronisation of operationsTwo semaphores are activated for each card on the node:

Data transfer (CPU→ GPU and GPU→ CPU)

Calculation on the GPU

Each operation (e.g. convolution of a wavefunction) isassociated to a stream.

Operation overlapCalculation and data transfer of different stream may overlapOperation are scheduled on a first come - first served basis

Several advantagesThe time for memory transfers is saved

Heavy calculation can be passed to the card one - by -one, avoiding scheduling problems

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 34: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Example of a time chart

The GPU can be viewed as a shared co-processor

MPI 3

MPI 2

MPI 1

MPI 0

Time

CPU → GPU Calculation GPU → CPU

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 35: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Convenience of S_GPU approach (end of 2009)

Different tests thanks to BigDFT flexibilityWe have performed many tests, with different ratiosGPU/CPU on the same node

Speedup on the full code (examples)S_GPU is the best compromise speedup/easinessExamples:

CPU -GPU 8 - 1 8 - 2 4-2 2-2

S_GPU 1.96 3.69 3.73 5.09Inhomogeneous (best) 2.08 2.64 2.32 2.40

Full code tested on Multi-GPU platformsCINES -Iblis48 GPU, Prototype calculations

CCRT - TitaneUp to 196 GPU (Grand challenge 2009)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 36: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

Case study: BigDFT in hybrid codes

Acceleration of the full BigDFT codeConsiderable gain may be achieved for suitable systemsAmdahl’s law should always be considered

Resources can be used concurrently (OpenCL queues)More MPI processes may share the same card!

0

20

40

60

80

100

CPU

-mkl

CPU

-mkl-m

pi

CUDA

CUDA-m

kl

OCL-cublas

OCL-m

kl

CUDA-m

pi

CUDA-m

kl-mpi

OCL-cublas-m

pi

OCL-m

kl-mpi

0

2

4

6

8

10

Perc

ent

Speedup

Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid

CommsLinAlgConvCPUOtherSpeedup

(Lower time-to-solution for a given architecture)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 37: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

The time-to-solution problem I: Efficiency

Good example: 4 C at, surface BC, 113 Kpts

Parallel efficiency of 98%, convolutions largely dominate.

Node:2× Fermi + 8 ×Westmere8 MPI processes

# GPU added 2 4 8

SpeedUp (SU) 5.3 9.8 11.6# MPI equiv. 44 80 96

Acceler. Eff. 1 .94 .56

0

20

40

60

80

100

0 20 40 60 80 100

Sp

ee

dU

p

No. of MPI proc

idealCPU+MPI

GPU

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 38: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

The time-to-solution problem II:Robustness

Not so good example: A too small system

0

20

40

60

80

100

1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 0

1

2

3

4

5

6

7

Per

cent

Spe

edup

with

GP

U

No. of MPI proc

Titane, ZnO 64 at.: CPU vs. Hybrid

CommsLinAlgConvCPUOtherEfficiency (%)

Speedup

Hybrid code (rel.)CPU code

8 CPU efficiency is poor (calculation is too fast)

8 Amdahl’s law not favorable (5x SU at most)

4 GPU SU is almost independent of the size

4 The hybrid code always goes faster

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 39: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

A look in near future: science with HPC codes

A concerted set of actionsImprove codes functionalities for present-day and nextgeneration supercomputers

Test and develop new formalisms

* Transform challenges in opportunities (needs work!)

The Mars missionIs Petaflop performance possible?

Multilevel parallelization→ one order of magnitude

Bigger systems, heavier methods→ (more than) oneorder of magnitude bigger

Two challenges comes from HPCConceive unprecedented things on new machines

Preserve and maintain to-date functionalities on futuremachines

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 40: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

A rapidly evolving situation

Architecture evolutionsManycore era (multilevel parallelisation)

Memory traffic as the limiting factor

Software evolutionsSuperposition of parallelization layers

Optimization issues: maintainability vs. robustness

Users abilityArchitecture dimensioning: adapt the runs to the system

Performance evaluation approach

And it is not going better:New set of architectures (GPU, MIC, BG/Q,. . . )New development paradigms(MPI, OpenMP, OpenCL,. . . )HPC codes must follow(HPC projects, Users how-to,. . . )

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese

Page 41: Parallel Codes and High Performance Computing: …calcul.math.cnrs.fr/IMG/pdf/HybridCo.pdf · Main limitations 8 Approximated ... single job, parallel intensive Laboratoire de Simulation

HPC andMulti-GPU

ArchitecturesSoftware problem

HPC nowadays

Memory bottleneck

DeveloperapproachFuture Scenarios

Present Situation

Optimization

UserviewpointFrequent mistakes

Performanceevaluation

S_GPU

PerformancesGPU

Practical cases

Conclusion

General considerations

What is desirable? (Does it open new directions?)Performance should lead to improvements

Optimisation effortKnow the code behaviour and featuresCareful performance study of the complete algorithm

Identify and make modular critical sectionsFundamental for mainainability and architecture evolution

Optimisation cost: consider end-user running conditionsRobustness is more important than best performance

Performance evaluation know-howNo general thumb-rule: what means High Performance?A multi-criterion evaluation process

Multi-level parallelisation always to be usedYour code will not (anymore) become faster via hardware

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese