iap09 cuda@mit 6.963 - lecture 01: high-throughput scientific computing (hanspeter pfister, harvard)
DESCRIPTION
See http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009TRANSCRIPT
High-Throughput Scientific Computing
Hanspeter [email protected]
Themes
• How is the brain wired?
• How did the Universe start?
How is the brain wired?The Connectome Project
Connectome Team• Harvard Center for Brain Science
– Jeff Lichtman & Clay Reid
• Microsoft Research / UW– Michael Cohen
• Kitware Inc.– Will Schroeder, Charles Law, Rusty Blue
• VRVis Vienna– Markus Hadwiger, Johanna Beyer
• IIC– Amelio Vazquez, Eric Miller (Tufts)– Won-Ki Seung, Hanspeter Pfister
The Scientific Challenge
composite from Roe et al. 1989, Sutton and Brunso-Bechtold 1991
Confocal Microscopy:Brainbow
Adapted from OlympusConfocal.com
Electron Microscopy: ATLUM
Serial Sectioning
...Section i, i (1, …,N)
Adapted from http://parasol.tamu.edu Texas A&M University
z
x y
40,000x40,000 pixels1.6 GB
120x120 µm (3 nm/pixel)
Here shown 40x undersampled
6 15mu EM big view
5 8mu rlp
4 3mu rlp
3 1mu rlp
2 300 nm rlp
The Data Challenge• 1 mm3 ~= mouse thalamus ~= 1 petabyte
• 1 cm3 ~= mouse brain ~= 1 exabyte
• 1000 cm3 ~= human brain ~= 1 zettabyte
All of Google’s world-wide storage today ~= 1 exabyte
Addressing the Data Challenge
• Multi-Scale Imaging
• Hierarchical Data Representation
• Distributed Heterogeneous Computing
• Visualization
• Segmentation
• Analysis
Addressing the Data Challenge
• Multi-Scale Imaging
• Hierarchical Data Representation
• Distributed Heterogeneous Computing
• Visualization
• Segmentation
• Analysis
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Direct Volume Rendering
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Ray Casting• Image-order ray shooting
•Interpolate•Assign color & opacity•Composite
•Simple to implement•Very flexible
(adaptive sampling, …)•Correct perspective
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Transfer Functions• Mapping of density to optical properties• Simplest: color table with opacity over density
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Connectome: EM Data
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Single-Pass Ray Casting• Enabled by conditional loops • Substitute multiple passes with single loop and early
loop exit
• Volume rendering examplein NVIDIA CUDA SDK(procedural ray setup)
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Basic Ray Setup / Termination•Two main approaches:
•Procedural ray/box intersection[Röttger et al., 2003], [Green, 2004]
•Rasterize bounding box[Krüger and Westermann, 2003]
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Procedural Ray Setup / Term.•Procedural ray / box intersection
•Everything handled infragment shader
• Ray given by camera positionand volume entry position
• Exit criterion needed
• Pro: simple and self-contained• Con: full load on fragment shader
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
- =
"Image-Based" Ray Setup / Term.
• Rasterize bounding boxfront faces and back faces
• Ray start positions:front faces
• Direction vectors:back faces − front faces
• Independent of projection (orthogonal/perspective)
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Kernel• Image-based
ray setup• Ray start image• Direction image
• Ray-cast loop• Sample volume• Accumulate
color and opacity
• Terminate
• Store output
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Standard Ray Casting Optim. (1)
Early ray termination•Isosurfaces:
stop when surface hit•Direct volume rendering:
stop when opacity >= threshold
•Several possibilities•Current GPUs: early loop exit works well
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Standard Ray Casting Optim. (2)
Empty space skipping•Skip transparent samples•Depends on transfer function•Start casting close to first hit
• Several possibilities•Per-sample check of opacity (expensive)•Hierarchical data store (e.g., octree with stack-less
traversal [Gobbetti et al., 2008] )
•These are image-order:what about object-order?
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Object-Order Empty Space Skip. (1)
•Modify initial rasterization step
rasterize bounding box rasterize “tight" bounding geometry
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Object-Order Empty Space Skip. (2)
• Store min-max values of volume blocks• Cull blocks against transfer function or isovalue• Rasterize front and back faces of active blocks
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Connectome: Fluorescence Data
MARKUS HADWIGER, VRVIS RESEARCH CENTER, VIENNA, AUSTRIA
Connectome: Implicit Surfaces
Addressing the Data Challenge
• Multi-Scale Imaging
• Hierarchical Data Representation
• Distributed Heterogeneous Computing
• Visualization
• Segmentation
• Analysis
Active Ribbons
Active Ribbon:A set of two non-intersecting and coupled Active Contours
Active Contour: Deformable closed curve that can be used to segment objects in an image
Inner Active Contour
Outer Active Contour
Active Ribbon
Results (Matlab)
Axon Segmentation
Interactive Analysis
How did the Universe start?
The MWA Project
Kevin Dale, Richard Edgar, Daniel Mitchell, Randall Wayth, Lincoln Greenhill, and Hanspeter Pfister
MWA CfA / IIC Team• Harvard Center for Astrophysics /
Smithsonian Astrophysical Observatory– Lincoln Greenhill– Daniel Mitchell– Randall Wayth– Stephen Ord
• IIC / SEAS– Richard Edgar– Kevin Dale, Hanspeter Pfister
The Scientific Goals• Epoch of Re-
Inonisation (EOR)
• Heliospheric and Ionospheric
• Transient detection
• Pulsars, Surveys, Interstellar Medium, Galactic Magnetic Field, …
ionized
neutral
( H )
ionized
Th
e “G
ap
”
The Murchison Widefield Array (MWA)
• Located in the remote Australian outback
• Extremely wide fields of view for radio astronomy in the 80-300 MHz band
• 512 tiles, each a 4x4 array of dipoles, scattered over ~ 1.5 km
• Data center for real-time processing co-located with the array
http://www.haystack.mit.edu/ast/arrays/mwa/index.html
© Murchison Wide-field Array Project (MIT/Harvard/Smithsonian/ANU/Curtin U./U.Melb./UWA/CSIRO)
© Murchison Wide-field Array Project (MIT/Harvard/Smithsonian/ANU/Curtin U./U.Melb./UWA/CSIRO)
Ionospheric offsets
Ungridded visibilities with bright sources
peeledImaging
Calibration
FFT
Averaging ( !)
GriddingVector Rotation
16 GB/s
0.5s cadence
(1) GB/s
8s cadence
Mapping
Science
v. parallel computation
entangled Calibration Loop
The Data Rate Challenge
Implementation• Hardware
• 2.4 GHz dual-core AMD Opteron, 4GB RAM
• NVIDIA Quadro FX 5600
• Software
• AMD Core Math Library (ACML)
• NVIDIA CUDA (CUBLAS, CUFFT)
• OpenGL
Single-GPU SpeedupCPUGPU speedup
0 10 20 30 40 50 60 70
RotateAndAccumulateVisibilities
MeasureIonosphericOffset
MeasureTileResponse
ReRotateVisibilities
PeelTileResponse
UnpeelTileResponse
Gridding *
Imaging
� � � � � � �
������
Image Formation
Calibration Loop
Mostly OpenGL
Example Results
GPU Reference
• Noisy images from test data
Scaling to a Cluster
• 1000 frequency channels, 65 sources every 8 seconds, and 16002 output image
• 20-40 frequencies / GPU
• 32-64 GPUs, i.e., 16 Tesla S1070s
• Need MPI for internal data transfer
Conclusions
• GPUs enable high-throughput scientific computing
• Performance gains of 10-100x
• CUDA makes life easier (but not perfect)
• Rasterization / OpenGL still useful
• Need CUDA MPI for clusters