breaking through the barriers to gpu accelerated monte ......operated by los alamos national...
TRANSCRIPT
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Jeremy SweezyScientist
Monte Carlo Methods, Codes and Applications Group
3/28/2018
LA-UR-18-XXXX
Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport
GTC 2018
What is Monte Carlo Particle Transport?
3/23/18 | 2Los Alamos National Laboratory
– Follows the path of individual particles through a system– Uses pseudo-random numbers to sample processes– Randomly sample physical and non-physical processes– Attributed to Stanislaw Ulam and
Enrico Fermi– Named because Ulam had an
uncle who who would borrow money from relatives because he “just had to go to Monte Carlo”
FERMIAC
Porting to Specialized Hardware is Prohibitively Expensive
3/23/18 | 3Los Alamos National Laboratory
–The world’s production Monte Carlo codes have decades of development–LANL’s MCNP code has been in development since 1977–Equally extensive amount of V&V effort–Codes have to run on desktop machines and super-computers–DOE HPC platforms have been in a state of flux for the last 10-years
• Cell Broadband Engine • Intel Xeon Phi (MIC)• GPUs• ARM???
Barrier #1: Limited Resources (Money, People, Time)
Monte Carlo Random Walk on GPU Hardware has reached a Performance Wall
3/25/18 | 4Los Alamos National Laboratory
• A least 6 different research groups have ported the Monte Carlo random walk to GPU hardware for neutron transport
• All report results against different numbers of CPUs• All get the same results!• Almost all are extremely simplified• Production codes will likely have
worse performance.• What are the limitations?
– Conditional branching– Random data access– No small computational intensive kernel
to accelerate
Barrier #2: Performance of random walk on GPUs
4.5x
3.0x
How do You Define Performance?
3/23/18 | 5Los Alamos National Laboratory
• A computer scientist might measure performance as an increase in speed.
𝑷 =𝑻𝑪𝑷𝑼𝑻𝑮𝑷𝑼
• A Monte Carlo specialist would measure performance as an balance between speed and statistical variance using a Figure-of-Merit
To date, almost all GPU implementations of Monte Carlo particle transport of have focused on increasing speed.
𝑬𝒙𝒂𝒎𝒑𝒍𝒆: 𝑭𝑶𝑴 =𝟎. 𝟏𝟐 7 𝟏min𝟎. 𝟎𝟓𝟐 7 𝟐min = 𝟐
𝑭𝑶𝑴 =𝝈𝑪𝑷𝑼𝟐 𝑻𝑪𝑷𝑼𝝈𝑮𝑷𝑼𝟐 𝑻𝑮𝑷𝑼
Next Event Estimator
3/23/18 | 6Los Alamos National Laboratory
• Next-event estimator calculates the probability of a particle from a source or collision event reaches a point without interaction
• Typically used for image tallies
A
Cell 1
Cell 2
μ
Image Plane
B
𝑺 𝑹, 𝑬 =𝒘
𝟐𝝅𝑹𝟐 ×
C𝝈𝒊 𝑹, 𝑬𝝈𝑻
𝒑𝒊 𝝁, 𝑬 → 𝑬G exp(−M 𝚺𝑻 𝒔, 𝑬G 𝒅𝒔𝑹
𝟎)
𝑵
𝒊S𝟏Ray-cast
One to two orders of magnitude faster on GPU hardware
Traditional Track-Length Estimator
3/25/18 | 7Los Alamos National Laboratory
• The standard Monte Carlo fluence estimator• Uses the sampled distance in each cell as fluence estimator• Only contributes to cells through which the particle passes • Easy to compute• Nothing to accelerate on GPU
Cell 1
B
Cell 2
Cell 3
Computing has changed, we need to change our algorithms too!
Volumetric-Ray-Casting Estimator
3/25/18 | 8Los Alamos National Laboratory
• For use in place of the traditional track-length estimator on GPU• Multiple pseudo-rays are generated at each source and collision event• Computational intensive estimator with lower variance
Cell 1
B
Cell 2
Cell 3
F 𝒊, 𝑬′ = 𝒘 𝟏UVWX U𝚺𝑻,𝒊 𝑬Y 𝒍𝒊𝑵𝚺𝑻,𝒊(𝑬Y)
exp −∫ 𝚺𝑻 𝒓 + 𝛀′𝒔′, 𝑬G 𝒅𝒔′𝒓YU𝒓𝟎
Ray-cast
A neutron dance for a neutron fan. P.M. Dawn
MonteRay - Accelerating Monte Carlo Transport with GPU Ray Tracing
3/23/18 | 9Los Alamos National Laboratory
• MonteRay – A library for accelerating Monte Carlo tallies with GPU • Random walk is maintained on CPU• Ray casting based tallies are calculated on the GPU
–Next-Event estimator –Volumetric-Ray-Casting estimator, a new estimator designed for GPUs–Supports neutron and photon tallies
• Can be incorporated into new and legacy Monte Carlo codes• Uses continuous energy cross-section data• Single precision ray casting• Single precision attenuation cross-sections• Double precision tallies
Reduces cost of accelerating an existing Monte Carlo code with GPUs
MonteRay - Testing
3/23/18 | 10Los Alamos National Laboratory
• Tests use:–GeForce GTX TitanX GPU with NVIDIA Maxwell architecture–2 CPUs (Intel Haswell E5-2660 v3 at 2.60 GHz), with 10 cores each
• MonteRay linked with LANL’s C++ Monte Carlo code MCATK• MCATK uses MPI parallelism building shared ray buffers using MPI-3
shared memory• 3-D Cartesian Structured Mesh Geometry• 2 tests measured performance of the Next-event estimator• 4 tests measured the performance of the Volumetric-ray-casting
estimator• Volumetric-ray-casting estimator performance on GPU compared to the
Track-length estimator performance on the CPU• Base performance measured as compared to 8 CPU cores
Testing the Next-Event Estimator on GPU Hardware:Two Radiography Tests
3/23/18 | 11Los Alamos National Laboratory
MonteRay – Medical X-Ray Imaging Simulation
3/23/18 | 12Los Alamos National Laboratory
• 50-keV X-ray beam• 0.12mm spot size• Radiograph used Next-Event Estimator• Simulation useful for designing collimator to minimize scattered contribution
MonteRay – Medical X-Ray Imaging Simulation
3/23/18 | 13Los Alamos National Laboratory
• Source and Collided contribution calculated separately
• Source contribution relatively easy to calculate
• Collided contribution important for collimator design
• Collided performance 15-18x
14.5x 15.3x
MonteRay – Industrial Radiography
3/23/18 | 14Los Alamos National Laboratory
• Simulated a physical test object used at Los Alamos’ Dual Axis Radiographic Hydrodynamic Test Facility
• Used 4-MeV mono-energetic X-ray beam• 100 x 100 image grid (10,000 estimators) to simulate image detector • Calculation of scatter component needed to design
collimators and experiment, but too computational expensive
I'm a peeping-tom techie with x-ray eyes – Patrick Lee MacDonald
MonteRay – Industrial Radiography
3/23/18 | 15Los Alamos National Laboratory
10
100
0 5 10 15 20
Re
lative
Pe
rfo
rma
nce
Number of CPU Cores / GPU
SourceCollided
Collided calculation performance 15-32x!
GPU Performance vs Number of CPU Cores
28.5x24.2x
Volumetric-Ray-Casting Estimator on GPU Hardware vs
Track-Length Estimator on CPU Hardware
3/23/18 | 16Los Alamos National Laboratory
Cancer Treatment Simulation
3/23/18 | 17Los Alamos National Laboratory
• 2-MeV Photon beam ( peak of 6MV medical accelerator photon spectrum)• 1-cm beam radius
Tumor
2-MeV Photon Beam
What is the dose to healthy tissue?
GPU Performance vs 8 CPU Cores
14x performance improvement in healthy tissue
Cancer Treatment Simulation
3/23/18 | 18Los Alamos National Laboratory
GPU Performance vs Number of CPU Cores in Healthy Tissue
Performance is 14x vs 8 CPU cores or 10x vs 12 CPU cores
14.3x
10.2x
Pressured Water Reactor Assembly Simulation
3/23/18 | 19Los Alamos National Laboratory
• 16x16 Fuel Assembly• Performance 7.5x in the Control Rods, 5x in the fuel, and 4.5x in the coolant
GPU Performance vs 8 CPU Cores
Control Rod
Fuel Pin
Pressured Water Reactor Assembly Simulation
3/23/18 | 20Los Alamos National Laboratory
GPU Performance vs Number of CPU Cores
Compared to 8 CPU cores performance in control rod 7.2x and 6.0x in the fuel
7.2x
5.4x6.0x
4.4x
Criticality Accident Simulation
3/23/18 | 21Los Alamos National Laboratory
• Critical Uranium sphere in the corner of a concrete room• Concrete floor, walls, ceiling, and 4 concrete pillars
GPU Performance vs 8 CPU CoresUranium Sphere
Performance increase of 14-16x in the center of the room
Criticality Accident Simulation – Smoother Fluence Estimate
3/23/18 | 22Los Alamos National Laboratory
Track-Length Estimator Volumetric-Ray-Casting Estimator
Criticality Accident Simulation
3/23/18 | 23Los Alamos National Laboratory
GPU Performance vs Number of CPU Cores
Things are going great, and they’re only getting better – Patrick Lee MacDonald
15x
10.5x
Reflected Godiva Criticality Experiment Simulation
3/23/18 | 24Los Alamos National Laboratory
• U-235 sphere reflected by water• Performance Improvement
–2.5x in the core–1.0x in the water
GPU Performance vs 8 CPU Cores
Reflected Godiva Criticality Experiment Simulation
3/23/18 | 25Los Alamos National Laboratory
• Variance of the Volumetric-Ray-Casting estimator approaches that of the Track-Length estimator is strong scattering material.
1
1.5
2
2.5
3
3.5
4
4.5
1 4 8 12 16 20
Varia
nce
Rat
io ( σ T
L2 / σ2 VR
C )
Number of Samples per Collision (N)
Performance is limited by the estimator variance, not the GPU speed
Variance Ratio vs Num. Collisions
GPU Performance vs. Num. CPU Cores
2.2x
2.2x
Conclusions
3/23/18 | 26Los Alamos National Laboratory
• MonteRay provides a low cost method of providing GPU accelerated Monte Carlo particle transport–Can be incorporated into legacy codes at low cost.–Works with standard variance reduction methods
• Performance improvements of MonteRay are significant:–Up to 32 times for the Next-event estimator as compared to 8 CPU cores–Up to 14 times for the Volumetric-ray-casting estimator as compared to the Track-Length
estimator on 8 CPU cores
MonteRay provides a method of breaking through the barriers of limited resources and limited performance
Extra
3/23/18 | 28Los Alamos National Laboratory
Uncertainty - Pressured Water Reactor Assembly Simulation
3/23/18 | 29Los Alamos National Laboratory
Volumetric-Ray-Casting EstimatorTrack-Length Estimator
600 sec., 8 CPU Cores and 1 GPU93 cycles, 40000 Particles/Cycle8 rays/collision
600 sec., 8 CPU Cores124 cycles, 40000 Particles/Cycle