parallel codes and high performance computing: …calcul.math.cnrs.fr/img/pdf/hybridco.pdf · main...
TRANSCRIPT
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Ècole “Programmation Hybride”:une étape vers le many-cœurs
L’ÉSCANDILLE – AUTRANS, FRANCE
Parallel Codes and High Performance Computing:Massively parallelism and Multi-GPU
Luigi Genovese
L_Sim – CEA Grenoble
October 10, 2012
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
A basis for nanosciences: the BigDFT project
STREP European project: BigDFT(2005-2008)Four partners, 15 contributors:CEA-INAC Grenoble (T.Deutsch), U. Basel (S.Goedecker),U. Louvain-la-Neuve (X.Gonze), U. Kiel (R.Schneider)
Aim: To develop an ab-initio DFT codebased on Daubechies Wavelets, to beintegrated in ABINIT.BigDFT 1.0 −→ January 2008. . . Not only a DFT adventure.
In this presentationPresent HPC scenario
Developers’ and users’ challenges
Outcomes and general considerations
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Ab initio calculations with DFT
Several advantages4 Ab initio: No
adjustable parameters
4 DFT: Quantummechanical(fundamental)treatment
Main limitations8 Approximated approach
8 Requires high computerpower, limited to fewhundreds atoms in mostcases
Wide range of applications: nanoscience, biology, materials
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Outline
1 Parallel computing and architecturesFrom past to present: softwareHPC nowadaysMemory bottleneck
2 (DFT) Developer point of viewFuture ScenariosPresent SituationOptimization
3 User viewpointFrequent mistakesPerformance evaluationA (old) example S_GPU library
4 PerformancesRecent situation: Evaluating GPU gainPractical cases
5 Conclusion and Messages
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
What is Parallel Computing?
Easy to say. . .Simultaneous use of multiple compute resourcesto solve a computational problem
. . . but not so easy to implementA problem is broken in multiple parts which can besolved concurrently
Each part is associated to a series of instructions
Instruction from each part are executed simultaneouslyon different Compute Processing Units
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
The Compute Processing Unit(s)
A computing machine (node) is made of:Control Unit
Arithmetic Logic Unit
Memory Unit
They might exist in different ratio of different architectures
After all, they are transistors
What does technology offer us?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Moore’s law
40 years of improvementsTransistor counts double everytwo years. . .
. . . but how?
Power is the limiting factor (around 100 W nowadays)
Power ∝ Frequency3 * Clock rate is limitedMultiple slower devices preferable than one superfast device* More performance with less power→ software problem?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Why software problem?
The power cost of frequency
Cores Hz (Flop/s) W Flop/s/WSuperscalar 1 1.5 × 1.5 × 3.3 × 0.45
Multicore 2 0.75 × 1.5 × 0.8 × 1.88
Exercise:Take a given computational problem
Write a code at a time t0.Solve the problem on a computer.
Freeze your code and wait some time t1− t0Take a new computer at time t1.Solve again the same problem.
What happens to your performances?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
HPC thumb-rules have changed
Frequency-dominated era4 Parallelism is not improved by the architecture
4 Frequency increases→ No. Flop/s increases
* Code runs faster
Manycore era4 Parallelism is dramatically changed in the architecture
4 Frequency decreases
* Code runs slower
6 The code should be changed
The parallel behaviour of a code (oversimplification)Capacity computing: many independent jobs
Capability computing: single job, parallel intensive
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
How to parallelize your data?
Distributed Memory
Private Memory
Processors operateindependently
Data transfer should beprogrammed explicitly(MPI)
Relies (also) on networkperformances
Shared Memory
Memory is common to allprocessors
Threads operateconcurrently on data
Relies on bandwidthperfomances
Memory operations are crucial for parallelism
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
The cost of the memory transfer
1W × 1 Year = 1$ (neglecting cooling and storage)
Some facts about memory transfer: Memory bandwidth40 GB/s (CPU); 20 GB/s (RAM); 3.5 GB/s (interconnect)
Bandwidth evolves less faster than computational power:4 ∼90’s (Math co-processor): 1 Flop/s each 4 Bytes transferred6 Nowadays: 62 Flop/s per Bytes transferred
The cost in energy of data movementComputation: a FMA costs now 100 pJ (10 pJ in the future)
Move data in RAM costs 4.8 nJ (1.92 nJ)
Communicating data (MPI) costs 7.5 nJ (2.5 nJ)
* Moore’s law revisited:Thread number executions will double each year
A complicated scenario for HPC (with ab initio) codes
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Top500
World Top Supercomputer Ranking
Some considerationsFaster than Moore’s law (doubles every 14 months)
* In 8 years top 1 goes off the list
Hybrid (CPU/GPU) architectures are emerging
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Distribute the data on hybrid supercomputer
How a code can be executed on hybrid CPU-GPUarchitectures?
Network
CPU
GPU
GPUCPU
GPU
GPU
CPU
CPU
Data transfer is stillMPI-based
Only on-boardcommunication between
GPU and CPU
Data distribution should depend on the presence of GPUs onthe nodes→ Multilevel parallelization required
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Hybrid Supercomputing nowadays
GPGPU on SupercomputersTraditional architectures are somehow saturatingMore cores/node, memories (slightly) larger but not faster
Architectures of Supercomputers are becoming hybrid3 out to 5 Top Supercomputers are hybrid machines
Extrapolation: In 2015, No. 500 will become petafloppicLikely it will be a hybrid machine
Codes should be conceived differently# MPI processes is limited for a fixed problem size
Performances increase only by enhancing parallelism
Further parallelisation levels should be added (OpenMP,GPU)
Does (electronic structure calculations) codes are suitable?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Future scenarios for the supercomputing
Exascale is foreseen for 2018: we cannot waitSimulation: limited money (200 M$) and power (20 MW)How can you get exascale (1000 times more powerful)?
100 times more memory
100× for bandwidth
Interrupt time 10 times smaller (one each day)
The Blue-Gene like scenario100 thousands nodes with 1000 cores on it
The GPU-like scenario10 thousands node with 10 thousands "cores"
First observations:Huge thread concurrencyFault resiliance might become crucial
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Future problems from the supercomputing
Architectures change way faster than codesNumber of CPU hours is increasing
* Should not be scared in asking Mhours
Which scientific codes are exempted?Might we ignore this? What is the price to pay?
Produce new science by preserving system size(possible only for new scientific domains)
Stop coding→ Parallelism is not altered
“Easy” things have already been done→ life is hard
This approach cannot last: a (yet) new challengeArchitectures for HPC are market driven
Low-power is now dominating (smartphones)
BigDFT on ARM architecture: 1/30 of Power, 10 times slower!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
How far is petaflop (for DFT)?
At present, with traditional architecturesRoutinely used DFT calculations are:
Few dozens (hundreds) of processors
Parallel intensive operations (blocking communications,60-70 percent efficiency)
Not freshly optimised (legacy codes, monster codes)
* Optimistic estimation: 5 GFlop/s per core × 2000 cores ×0.9 = 9 TFlop/s = 200 times less than Top 500’s #3!
It is such asDistance Earth-Moon = 384 MmDistance Earth-Mars = 78.4 Gm = 200 times more
Moon is reached. . . can we go to Mars? (. . . in 2015?)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Using GPUs in a given (DFT) code
Developer and user dilemmasDoes my code fits well? For which systems?
How much does porting costs?
Should I always use GPUs?
How can I interpret results?
Evaluating GPU convenienceThree levels of evaluation
1 Bare speedups: GPU kernels vs. CPU routinesDoes the operations are suitable for GPU?
2 Full code speedup on one processAmdahl’s law: are there hot-spot operations?
3 Speedup in a (massively?) parallel environmentThe MPI layer adds an extra level of complexity
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Case study: 1D convolutions (BigDFT code)Initially, naive routines (FORTRAN?)
y(j, I) =U
∑`=L
h`x(I + `, j)
Easy to write anddebug
Define referenceresults
do j=1,ndatdo i=0,n1tt=0.d0do l=lowfil,lupfil
tt=tt+x(i+l,j)*h(l)enddoy(j,i)=tt
enddo
enddo
Optimisation can then start (Ex. X5550,2.67 GHz)
Method GFlop/s % of peak SpeedUpNaive (FORTRAN) 0.54 5.1 1/(6.25)
Current (FORTRAN) 3.3 31 1Best (C, SSE) 7.8 73 2.3
OpenCL (Fermi) 97 20 29 (12.4)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
How to optimize?
A trade-off between benefit and effort
FORTRAN based4 Relatively accessible (loop unrolling)
4 Moderate optimisation can be achieved relatively fast
6 Compilers fail to use vector engine efficiently
Push optimisation at the bestAbout 20 different patterns have been studied for one1D convolution
Tedious work, huge code −→ Maintainability?
* Automatic code generation?
Consider new programming paradigmsNew coding approaches are most welcome
→ Kronos’ OpenCL standard
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
GPU-ported operations in BigDFT (double precision)
Convolutions Kernels* (OpenCL (re)written)4 Fully functional (all BC)4 Based on the former
CUDA version4 A 10 to 60 speedup
0
10
20
30
40
50
60
70
Magicfilter
Magicfilter_reverse
Magicfilter_grow
Magicfilter_shrink
Kinetic
Kinetic_k
Analysis
Synthesis
Synthesis_grow
Analysis_shrink
Uncom
press
Com
press
Rat
io to
CP
U
Kernels
Performances of CPU vs NVIDIA vs AMD
CPUNVIDIA
AMD
2
4
6
8
10
12
14
16
18
20
0 10 20 30 40 50 60 70
GP
U s
peed
up (
Dou
ble
prec
.)
Wavefunction size (MB)
locdenlocham
precond
GPU BigDFT sectionsGPU speedups between 5and 20, depending of:4 Wavefunction size4 CPU-GPU Architecture
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Interpretation of HPC behaviour:
Evaluate the code behaviour:A not so easy task (especially nowadays)Frequent mistakes:
Parallel efficiency is not walltime
Scalability is not only communication-drivenPerformance evaluation is a multicritierion evaluationprocess
Best scalability (Machine point of view)Best acceleration efficiency (Vendor point of view)Best walltime (User point of view)
But also robustness, fault tolerance, best Flop/W ratio
Anticipated messagesFar from trivial situation:
No golden rule
HPC Optimal Stategies should be interpreted
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Amdahl’s law
A basic conceptThe speedup with N cores depends of the parallel fraction(P) of the code:
speedup =1
PN +(1−P)
It represents the limits to the scalability of a given code
An important definition
Parallel Efficiency = Time(Nref)Time(N)
NNref
Often used as a benchmarkof a code in a parallel environment
Lots of factors involvedScalability of the problem
Communication performances
Computational cost of operations
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Intranode bandwidth problem
0
20
40
60
80
100
1x1 2x1 4x1 8x1 1x2 2x2 4x2 1x4 2x4 1x8 1
2
3
4
5
6
7P
erc
en
t
Sp
ee
du
p
MPI proc x OMP threads
B80 Cage, Free BC, 120 Orbitals, CCRT Titane: 2 x 4-core Intel Xeon X5570
CommsLinAlgConvNL PSPPSolverXCOtherSpeedupEfficiency (%)
Scalability does not depend only on communicationAmdahl’s law is a upper limit!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Task repartition for a small system (ZnO, 128 atoms)
0
10
20
30
40
50
60
70
80
90
100
16 24 32 48 64 96 144 192 288 576 1
10
100
1000P
erce
nt
Sec
onds
(lo
g. s
cale
)
No. of cores
1 Th. OMP per core
CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)
What are ideal conditions for acceleration (e.g. GPU)To-be-accelerated routines should take the majority of the time
What happens to parallel efficiency?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Parallelisation and architectures
Same code, same runs. Which is the best?
0
10
20
30
40
50
60
70
80
90
100
16 24 32 48 64 96 144 192 288 576 1
10
100
1000P
erce
nt
Sec
onds
(lo
g. s
cale
)
No. of cores
1 Th. OMP per core
CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)
0
10
20
30
40
50
60
70
80
90
100
16 24 32 48 64 96 144 192 288 576 1
10
100
1000
Per
cent
Sec
onds
(lo
g. s
cale
)
No. of cores
1 Th. OMP per core
CommsLinAlgConvCPUOtherTime (sec)Efficiency (%)
CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5)
Titane is 2.3 to 1.6 times faster than Rosa!
Degradation of parallel performances: why?1 Calculation power has increased more than networking2 Better libraries (MKL)
* Walltime reduced, but lower parallel efficiency
This will always happen while using GPU!Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Architectures, libraries, networking
Same runs, same sources; different user conditions
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
0 100 200 300 400 500 600
Cp
u H
ou
rs
No. of cores
Titane MpiBull2 no MKLJade MPT
Titane MpiBull2, MKLTitane OpenMPI, MKL
Rosa Cray, istanbul
Differencesup to afactor of 3!
A case-by-case studyConsideration are often system-dependent, a thumb rule notalways exists.
* Know your code!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
A (even more) frequent mistake
Example: two DFT codes running on the same system.Naive question: Which one is faster?
The running conditionsMachine generation (CPU, cores, cache,. . . )
Parallel environment (MPI procs, OMP threads, GPUs)
Binaries (libraries, compiler,. . . )
Network vs. Computation performance
The code conditions (DFT example)Basis set (formalism, cut-off, . . . )
Self-Consistentcy (Input Guess, minimization scheme)
How this question should be posed?Which is lowest time-to-solution possible for this systemon a given machine?
Which is the fastest machine for this system?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Data repartition on a node
Non-hybrid caseGPU not used→ homogeneous repartition
MPI 0 MPI 1 MPI 2 MPI 3
DATA DATA DATA DATA
GPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Data repartition on a node
“Naive” repartitionAll the cores use the GPU at the same time
MPI 0 MPI 1 MPI 2 MPI 3
DATA DATA DATA DATA
GPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Data repartition on a nodeInhomogeneous repartition
Only one node use the GPU with more data
MPI 0 MPI 1 MPI 2 MPI 3
DATA DATA DATA
DATA
GPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Data repartition on a node
The S_GPU approachS_GPU library manages GPU resource within the node
MPI 0 MPI 1 MPI 2 MPI 3
DATA DATA DATA DATA
GPU
S_GPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
S_GPU library (M. Ospici, LIG / Bull / CEA)
De-synchronisation of operationsTwo semaphores are activated for each card on the node:
Data transfer (CPU→ GPU and GPU→ CPU)
Calculation on the GPU
Each operation (e.g. convolution of a wavefunction) isassociated to a stream.
Operation overlapCalculation and data transfer of different stream may overlapOperation are scheduled on a first come - first served basis
Several advantagesThe time for memory transfers is saved
Heavy calculation can be passed to the card one - by -one, avoiding scheduling problems
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Example of a time chart
The GPU can be viewed as a shared co-processor
MPI 3
MPI 2
MPI 1
MPI 0
Time
CPU → GPU Calculation GPU → CPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Convenience of S_GPU approach (end of 2009)
Different tests thanks to BigDFT flexibilityWe have performed many tests, with different ratiosGPU/CPU on the same node
Speedup on the full code (examples)S_GPU is the best compromise speedup/easinessExamples:
CPU -GPU 8 - 1 8 - 2 4-2 2-2
S_GPU 1.96 3.69 3.73 5.09Inhomogeneous (best) 2.08 2.64 2.32 2.40
Full code tested on Multi-GPU platformsCINES -Iblis48 GPU, Prototype calculations
CCRT - TitaneUp to 196 GPU (Grand challenge 2009)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
Case study: BigDFT in hybrid codes
Acceleration of the full BigDFT codeConsiderable gain may be achieved for suitable systemsAmdahl’s law should always be considered
Resources can be used concurrently (OpenCL queues)More MPI processes may share the same card!
0
20
40
60
80
100
CPU
-mkl
CPU
-mkl-m
pi
CUDA
CUDA-m
kl
OCL-cublas
OCL-m
kl
CUDA-m
pi
CUDA-m
kl-mpi
OCL-cublas-m
pi
OCL-m
kl-mpi
0
2
4
6
8
10
Perc
ent
Speedup
Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid
CommsLinAlgConvCPUOtherSpeedup
(Lower time-to-solution for a given architecture)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
The time-to-solution problem I: Efficiency
Good example: 4 C at, surface BC, 113 Kpts
Parallel efficiency of 98%, convolutions largely dominate.
Node:2× Fermi + 8 ×Westmere8 MPI processes
# GPU added 2 4 8
SpeedUp (SU) 5.3 9.8 11.6# MPI equiv. 44 80 96
Acceler. Eff. 1 .94 .56
0
20
40
60
80
100
0 20 40 60 80 100
Sp
ee
dU
p
No. of MPI proc
idealCPU+MPI
GPU
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
The time-to-solution problem II:Robustness
Not so good example: A too small system
0
20
40
60
80
100
1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 0
1
2
3
4
5
6
7
Per
cent
Spe
edup
with
GP
U
No. of MPI proc
Titane, ZnO 64 at.: CPU vs. Hybrid
CommsLinAlgConvCPUOtherEfficiency (%)
Speedup
Hybrid code (rel.)CPU code
8 CPU efficiency is poor (calculation is too fast)
8 Amdahl’s law not favorable (5x SU at most)
4 GPU SU is almost independent of the size
4 The hybrid code always goes faster
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
A look in near future: science with HPC codes
A concerted set of actionsImprove codes functionalities for present-day and nextgeneration supercomputers
Test and develop new formalisms
* Transform challenges in opportunities (needs work!)
The Mars missionIs Petaflop performance possible?
Multilevel parallelization→ one order of magnitude
Bigger systems, heavier methods→ (more than) oneorder of magnitude bigger
Two challenges comes from HPCConceive unprecedented things on new machines
Preserve and maintain to-date functionalities on futuremachines
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
A rapidly evolving situation
Architecture evolutionsManycore era (multilevel parallelisation)
Memory traffic as the limiting factor
Software evolutionsSuperposition of parallelization layers
Optimization issues: maintainability vs. robustness
Users abilityArchitecture dimensioning: adapt the runs to the system
Performance evaluation approach
And it is not going better:New set of architectures (GPU, MIC, BG/Q,. . . )New development paradigms(MPI, OpenMP, OpenCL,. . . )HPC codes must follow(HPC projects, Users how-to,. . . )
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese
HPC andMulti-GPU
ArchitecturesSoftware problem
HPC nowadays
Memory bottleneck
DeveloperapproachFuture Scenarios
Present Situation
Optimization
UserviewpointFrequent mistakes
Performanceevaluation
S_GPU
PerformancesGPU
Practical cases
Conclusion
General considerations
What is desirable? (Does it open new directions?)Performance should lead to improvements
Optimisation effortKnow the code behaviour and featuresCareful performance study of the complete algorithm
Identify and make modular critical sectionsFundamental for mainainability and architecture evolution
Optimisation cost: consider end-user running conditionsRobustness is more important than best performance
Performance evaluation know-howNo general thumb-rule: what means High Performance?A multi-criterion evaluation process
Multi-level parallelisation always to be usedYour code will not (anymore) become faster via hardware
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese