portable performance for monte carlo simulation of photon … · 2016. 3. 31. · monte carlo...
TRANSCRIPT
![Page 1: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/1.jpg)
PORTABLE PERFORMANCE FOR MONTE
CARLO SIMULATION OF PHOTON
MIGRATION IN 3D TURBID MEDIA FOR
SINGLE AND MULTIPLE GPUS
Fanny Nina-Paravecino
Leiming Yu Qianqian Fang* David
Kaeli
Department of Electrical and Computer Engineering
Department of Bioengineering*
Northeastern University
Boston, MA
![Page 2: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/2.jpg)
Outline
• Portable Performance Monte Carlo Extreme
(MCX)
• MCX in CUDA
• Persistent Threads in MCX
• Portable Performance MCX
• MCX on multiple GPUs
• Linear Performance
• Linear Programming Model
• Performance Results
![Page 3: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/3.jpg)
PORTABLE PERFORMANCE MCX
Photons initialization
3D
voxelated
media
![Page 4: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/4.jpg)
Monte Carlo Extreme (MCX) in CUDA
• Estimates the 3D light (fluence) distribution by
simulating a large number of independent photons
• Most accurate algorithm for a wide ranges of optical
properties, including low-scattering/voids, high
absorption and short source-detector separation
• Computationally intensive, so a great target for GPU
acceleration
• Widely adopted for bio-optical imaging applications:
• Optical brain functional imaging
• Fluorescence imaging of small animals for drug development
• Gold stand for validating new optical imaging instrumentation
designs and algorithms
![Page 5: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/5.jpg)
MCX in CUDA
Simulation of photon
transport inside
human brain
Imaging of bone
marrow in the tibia
Imaging of a complex mouse model
using Monte Carlo simulations
![Page 6: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/6.jpg)
MCX in CUDA [1]
Thread i Thread i+1
…
Launch a photon
Compute the
scattering length
Move photo one
voxel
Compute attenuation
based on absorption
Accumu. Probability
to the volume
Scattering
ends ?
Total move
or photon #
reached?
Terminate
thread
Exceeds
time gate?
Compute a
scattering
direction
vector
Global
Memor
y
y
y
n
n
y n
Seed GPU RNG
with CPU RNG
Repetition
complete?
Retrieve solution
Normalize & save
solution
CPU GPU
Start
End of
simulatio
n
Loop of repetitions
[1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics exp ress
17.22 (2009): 20178-20190.
![Page 7: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/7.jpg)
Persistent Threads (PT) in MCX
• PT kernels alter the notion of a virtual thread lifetime,
treating those threads as physical hardware threads
• PT kernels provide a view that threads are active for
the entire duration of the kernel
• We schedule only as many threads as the GPU SMs can
concurrently run
• The threads remain active until end of kernel execution Thread
Block
Grid
…
CUDA Grid
Structure
![Page 8: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/8.jpg)
Persistent Threads (PT) in MCX
• A PT kernel bypasses the hardware scheduler,
relying on a work queue to schedule blocks
• A PT kernel checks the queue for more work and
continues doing so until no work is left
• PT MCX works on a FIFO blocking queue
Enqueue
Back Front
Queue
Shared
Multiprocessor
Blo
cks
![Page 9: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/9.jpg)
Portable Performance for MCX •
Fermi Kepler Maxwell
MaxThreadBlocks
/Multiprocessor 8 16 32
MaxThreads/Multi
processor 1536 2048 2058
Multiprocessors
(MP) 16 14 22
CUDA cores / MP 32 192 128
# threadsPerBlock =
(MaxThread/MP)/(MaxThreadBlocks/MP)
# blocks = # threadsPerBlock * (MaxThreadBlocks/MP) *
MP
![Page 10: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/10.jpg)
Portable Performance MCX - Results
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
Kepler GK110 Maxwell 980Ti
Spee
dup
Baseline Code Improved
Kepler GK110 Baseline Improved Code
ThreadsPerBlock 32 128
# Total Threads 86,016 28,672
# Blocks 2688 224
Performance
(Photons/ms) 2383 2887
Speedup 1.0 1.21
Maxwell 980Ti Baseline Improved Code
ThreadsPerBlock 32 128
# Total Threads 90,112 45,056
# Blocks 2816 352
Performance
(Photons/ms) 13,369 15,015
Speedup 1.0 1.12
![Page 11: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/11.jpg)
MCX ON MULTIPLE GPUS
![Page 12: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/12.jpg)
Linear Programming Model
• Given n devices: D1, D2, … Dn
• Given linear performance for each device
• Given the performance for 10 Million photons and 100
Millions for each device
• We can obtain the linear equation for each device as
follow:
y1 = b1 + (x1 -1)a1 +C1Device 1 f1:
y2 = b2 + (x2 -1)a2 +C2Device 2 f2:
yn = bn + (xn -1)an +CnDevice n f3:
.
.
.
. .
.
![Page 13: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/13.jpg)
Performance Results
• We evaluated our Linear Programming on Linear
Model (LPLM) scheme for two different
configurations of NVIDIA devices
• The resulting partition of the workload achieves an
average 8% speedup over the baseline
1750
1800
1850
1900
1950
2000
2050
2100
10M 50M 100M 10M 50M 100M
GTX980+GT730 GTX980+GT730+GTX580
Ph
oto
s/m
s
# Photons
Baseline LPLM
![Page 14: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/14.jpg)
Summary
• We have improved the performance of MCX across a
range of NVIDIA GPU architectures
• We have showed how to exploit Persistent Thread
kernel to automatically tune MCX kernel
• We developed a linear programming model to find the
best partition to run MCX on multiple GPUs
• We improved performance of MCX run on multiple
NVIDIA GPUs, including Kepler and Maxwell
• We obtained an 8% speedup when using automatic
partitioning
![Page 15: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/15.jpg)
Future Work
• PT MCX
• The queue of blocks can either can be static (know at compile time)
or dynamic (generated at runtime), and can be used to control the
order, location, and the timing of each block
• Instrumentation of MCX
• Leverage SASSI to instrument MCX and better characterize the
behavior of a kernel to guide auto-tuning
• MCX on Multiple GPUs
• Evaluate our partitioning optimization for multiple devices
![Page 16: Portable Performance for Monte Carlo Simulation of photon … · 2016. 3. 31. · Monte Carlo Extreme (MCX) in CUDA •Estimates the 3D light (fluence) distribution by simulating](https://reader034.vdocuments.us/reader034/viewer/2022052003/60166d793828566b7337074f/html5/thumbnails/16.jpg)
THANK YOU!
Questions?