experiences accelerating matlab systems biology applications heart wall tracking lukasz szafaryn,...
TRANSCRIPT
Experiences Accelerating MATLAB Systems Biology
Applications
Heart Wall Tracking
Lukasz Szafaryn, Kevin SkadronUniversity of Virginia
2
Outline
• MATLAB• Optimizations to MATLAB• GPU Acceleration with CUDA• Applications (Heart Wall Tracking and Myocyte Simulation)
– Problem– Algorithm– Optimization and performance– Lessons
• Conclusions • Future Research
MATLAB
• Convenient but inefficient programming language of choice for scientists- Interpreted language- Most of the existing code and libraries are single-threaded
• MATLAB Parallel Toolbox - understanding of parallel programming
• Jacket and GPUmat - large parallelism to justify overhead
3
MATLAB contd.
• Interpreted language optimized by JIT compiler – 2x slower than C
• MATLAB Embedded Compiler has limited support – 1.2-1.4x slower than C
• MEX Interface to link C code- translating to C - many functions written from scratch- no support for convenient OpenMP standard, need to use
thread libraries
4
5
Acceleration
1. Translation: - convert MATLAB to C
2. Parallelization:– C for multi-core CPU– CUDA for GPU
Experimental Setup
– CPU: 3.2 GHz quad-core Intel Core 2 Extreme– GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)– MS Windows, MS C Compiler
6
Allocate GPU memory
Transfer inputs
Launch kernel
Return to CPU
Transfer results
Free GPU memory
C Program
CUDA Kernel
CPU GPU
Acceleration with GPU (CUDA)
Entire Heart Wall Application
read first frame from input file, display
image
[0.533 s] [0.28 %]
crop image, display image
[0.089 s] [0.05 %]
SRAD, display image
[3.224 s] [1.72 %]
detect edges, display image
[0.448 s] [0.24 %]
morphological transformation, display image
[0.275 s] [0.15 %]
dilate image, display image
[0.285 s] [0.15 %]
inner and outer ellipse parameter
setup
[0.001 s] [0.00 %]
create ellipse sample points
[0.726 s] [0.39 %]
track movement of sample points in all
frames
[129.276 s] [68.99 %]
display movement of sample points
through frames
[36.63 s] [19.55 %]
save outputs into file
[0.035 s] [0.02 %]
Hough Search, display images
[15.872 s] [8.47 %]
8
Heart Wall TrackingDescription
• Speed and shape of contractions provides important information about body’s response to stimulus
• Measured by tracking inner and outer heart walls through multiple frames
Input OutputTracking
9
Heart Wall TrackingAlgorithm
• Processing 20 inner and 30 outer heart wall points, total 50 points (TLP)
• Processing of each point - sequence of operations on the surrounding area and template (DLP)
Update templates
Read next frame
Track inner point
Track outer point
Save point locations
20
30
10 # of frames / 10
…
time
task-level parallelism (TLP)
1 2 3 4 50
data-level parallelism (DLP)
10
Heart Wall TrackingPerformance
• Times reported for processing of 300 frames (10s of ultrasound recording)
1.21x
2.71x
10.86x
2.37x
3.39x
5.94x 6.29x
33.20x 36.89x 41.23x
Porting Tradeoffs
Convenience Performance
Developing modular code
• replacing each MATLAB function with equivalent parallelized GPU function
• common routines can be possibly reused in other applications
Hiding specific aspects of GPU programming inside each module
• each module sets up its own GPU execution parameters
• each module performs its own I/O with GPU transparently
Restructuring and combining code
•writing code based on algorithm tasks rather than MATLAB statements
•overlap parallel (often unrelated) tasks by executing them in the same kernel call
Exposing specific aspects of GPU programming
•doing more work inside each GPU kernel call to fully exploit parallelism and avoid overhead
•performing I/O with GPU manually to eliminate redundant data transfers
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
multiple
add,sub, mul, div
---------- SYNC ----------
Reduction
---------- SYNC ----------
multiple
add,sub, mul, div
---------- SYNC ----------
convolution
tim eSMs
outer heart points
(30)
inner heart points
(20)
…
…
1 2 3 30…
global SYNC
Frame 1
Frame 2
…
13
Heart Wall TrackingLessons
• Typical MATLAB code written by a scientist has room for optimization – 1.3x
• Conversion to C requires significant coding effort• Selective offloading results in multiple CPU-GPU data
transfer overheads• Iterative codes require merging kernels and reusing
variables to avoid overhead• CUDA libraries cannot be used as a part of GPU code• Good performance - significant changes to the structure of
code, difficult for a scientist to understand
14
Conclusions
• Limited availability of C libraries necessitates time consuming coding
• Many systems biology applications (even those with limited parallelism) benefit from GPU
• GPU overheads are significant (should be eliminated in new CPU-GPU architectures)
• Real-time processing feasible in near future• Ultimately, acceleration of applications should be
automated!
15
Future Research
• Automatic acceleration with the use of compiler - via use of architecture-specific libraries- via compiling for target architecture
• Merging of workloads- based on resource needs- based on dependency
• Acceleration with alternative architectures- well suited for fine-grained parallelism
- esp. FPGA
16
Acknowledgements
• Funding provided by:– NSF grant IIS-0612049– SRC grant 1607.001
• Equipment donated by NVIDIA
17
Software
Source code will be soon available at:http://lava.cs.virginia.edu/wiki/rodinia
Questions
18
Backup
19