nvidia application lab at juelichdeveloper.download.nvidia.com/gtc/pdf/gtc2012/...14.11.2012 4dirk...
TRANSCRIPT
-
Mitg
lied
de
r H
elm
ho
ltz-
Gem
ein
sch
aft
NVIDIA Application
Lab at Jülich
Dirk Pleiter | Jülich Supercomputing Centre (JSC)
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 2
Forschungszentrum Jülich at a Glance (status 2010)
Budget: 450 mio Euro
Staff: 4,800 (thereof 1,630 scientists)
Visiting scientists: 900 per year
Trainees: 90
Publications: 1,800
Protective rights and licences: 14,800
Research fields: health, energy and
environment, and information technology;
key technologies for tomorrow
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 3
Supercomputer operation for: Centre – FZJ,
Regional – JARA
Helmholtz & National – NIC, GCS
Europe – PRACE, EU projects
Application support User support; coordination with SimLabs
Scientific Visualization
Peer review support and coordination
R&D work Algorithms, performance analysis and tools
Community data management service
Computer architectures, Exascale Laboratories: EIC, ECL, NVIDIA
Education and Training
Jülich Supercomputing Centre
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 4
2004
2006-8
2009
2012 File Server
GPFS, Lustre
IBM Power 4+
JUMP, 9 TFlop/s
IBM Power 6
JUMP, 9 TFlop/s
IBM Blue Gene/P
JUGENE, 1 PFlop/s
Supercomputer Systems: Dual Track Approach
HPC-FF
100 TFlop/s
JUROPA
200 TFlop/s
General-Purpose Highly-Scalable
2014
JUROPA++
Cluster, 1-2 PFlop/s
+ Booster
IBM Blue Gene/Q
JUQUEEN
5.7 PFlop/s (target)
IBM Blue Gene/L
JUBL, 45 TFlop/s
JUDGE
240 TFlop/s
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 5
JUDGE Cluster
System 206 IBM iDataPlex nodes
2 Tesla M2050 or M2070 per node
Infiniband QDR network
Peak performance: 239 Tflops
Users Institute for Advanced Simulations
Molecular dynamics and mechanics, micro-magnetism simulations, medical image reconstruction
JuBrain partition
Milkey Way partition
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 6
NVIDIA Application Lab at Jülich
Collaboration between JSC and NVIDIA since July 2012 Enable scientific applications for GPU-based architectures
Provide support for their optimization
Investigate performance and scaling
Work focus Application requirements analysis
Kepler and CUDA feature analysis
Parallelization on many GPUs
Collaboration with performance tools developers
Training
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 7
Pilot Application: JuBrain
Application developed at the Institute of Neuroscience and Medicine (INM-1) at Forschungszentrum Jülich:
Katrin Amunts, Markus Axer, Marcel Huysegoms
Research goal Accurate, highly detailed computer model of the human brain
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 8
Brain Section Images
Blockface pictures Created while cutting brain in sections
Histological images Polarized light images
Low resolution vs. high resolution
100 μm → 3 μm pixel size
30 MBytes → 40 Gbytes data
Challenge: 3d reconstruction
Exceeds GPU
memory capacity
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 9
3D Reconstruction
Registration algorithms Rigid registration → 3 parameters
Afine registration → 6 parameters
Elastic registration → O(100) parameters
Moving image Metric
Fixed image Interpolator Transformation
Optimizer
O(30)
speedup
on GPU
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 10
Fluid dynamics on Fermi and Kepler
Lattice Boltzmann method D2Q37 model
Application developed at U Rome Tore Vergata/INFN, U Ferrara/INFN, TU Eindhoven
Reproduce dynamics of fluid by simulating virtual particles which collide and propagate
Simulation of large systems requires double precision computation on many GPUs
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 11
Collide kernel on Fermi
Kernel dominated by arithmetic operations
Floating-point performance as a function of the number of threads/block [GFlop/s]
Implementation:
F. Schifano (U Ferrara/INFN)
Excellent
performance
on Fermi
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 12
Kepler Performance Tuning
Performance analysis observations
Significant increase of L1 cache misses
17% (Tesla M2090) → 67% (Tesla K20)
SM performance increased, but L1 cache capacity remained unchanged
Problem mitigation by simple code change Enforce loop unrolling to eliminate indirect memory accesses
for (i = 0; i < NPOP-1; i++) {
lPop = p_prv[i*NX*NY + idx];
u = u + param_cx[i] * lPop;
v = v + param_cy[i] * lPop;
}
#pragma unroll
for (i = 0; i < NPOP-1; i++) {
lPop = p_prv[i*NX*NY + idx];
u = u + param_cx[i] * lPop;
v = v + param_cy[i] * lPop;
}
J. Kraus (NVIDIA Lab)
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 13
Collide kernel on Kepler GK110
Comparison Fermi vs. Kepler Grid size considered here:
252 x 16384
Floating-point performance as a function of the number of threads/block
Performance
improvement
1.7x
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 14
Propagate kernel
Kernel dominated by memory access Grid size considered here:
252 x 16384
Memory bandwidth [GByte/s] as a function of the number of threads/block
Performance
improvement
1.4x
-
14.11.2012 Dirk Pleiter | NVIDIA Application Lab at Jülich 15
Summary
NVIDIA Application Lab at Jülich New and fruitful model for collaboration
We are just at the beginning ...
Application requirements analysis JuBrain: Project aiming for realistic model of the human brain
Kepler feature analysis Initial performance results for Lattice Boltzmann application on GK110
Very high performance level reached on Fermi can be sustained