image reconstruction on multicore processors
DESCRIPTION
Image Reconstruction on Multicore Processors. Graduate Students Eric Fontaine and Viraj Paropkari Faculty Members: Ada Gavrilovska and Hsien-Hsin S. Lee. Agenda. Background FDK algorithm Overview Parallelization Method Current Results Katsevich Algorithm Overview - PowerPoint PPT PresentationTRANSCRIPT
Image Reconstruction on Multicore Processors
Graduate StudentsEric Fontaine and Viraj Paropkari
Faculty Members:Ada Gavrilovska and Hsien-Hsin S. Lee
2
Agenda
• Background
• FDK algorithm– Overview– Parallelization Method– Current Results
• Katsevich Algorithm– Overview– Parallelization Method– Current Results
• Future Plans
3
Background• Use 3-D CT scan to identify tumors and other
defects inside the body.• Two common methods
– MRI• Complex math and physics• Main function ─ Simple IFFT
– Filtered back-projection • Two common filtered back-projection algorithms
– FDK • Approximation, fast• Use projections taken on a circular path surrounding the
object• More accurate on the plane containing the circle
– Katsevich• More accurate, but also more compute-intensive • Use projections taken on a helical path surrounding the
object • It can reconstruct long objects, unlike the original FDK.
• Both contain large data parallelism
4
FDK Algorithm Overview• Cone beam image reconstruction with
source on a helix for a flat detector• Reconstruction for 3-D volume• Initialize the helix source parameters• Compute/load cone beam data• Length correction weighting• 1-D horizontal filtering• Linear Pre-interpolation• Back projection• Compare Results with standard phantom
5
Parallelization Strategy• Based on FDK algorithm for general scanning
paths like helix.*
– Each thread is assigned a subset of the total number of projections, and performs length correction weighting, filtering and back-projections of its assigned projections.
– After all threads are done, there is an implicit barrier necessary for synchronization. Then each thread is assigned a subset of the total volume to reconstruct.
– We use OpenMP • Reconstruct subsets of the total volume in parallel (to fit into
individual cache)• Piece the image together at the end (reduced inter-core
communication)
Assign Projections
Length correction weighting, filtering, back-projection barrier
Length correction weighting, filtering, back-projection
*Ge Wang, Tein-Hsiang Lin, Ping-chin Cheng, and Douglas M. Shinozaki. A general cone-beam reconstruction algorithm. IEEE Trans. On Medical Imaging, 12(3):486-496, September 1993
Reconstructed Image
6
Single and Dual-Thread Performance
Single Thread 13 17 49 311 2416
Dual Thread 31 33 50 184 1263
16^3 32^3 64^3 128^3 256^30
0.5
1
1.5
2
Speedup 0.419 0.515 0.98 1.69 1.912
16^3 32^3 64^3 128^3 256^3
Slowdown
Performance (Seconds)Speedup of
dual-thread OpenMP code
7
FDK Analysis for Memory Behavior
0
5
10
15
20
25
30
35
40
45
L1 Miss% 23.65 38.62 36.93 34.08 21.5
L2 Miss% 0.64 41.74 23.75 18.6 14.39
DTLB Miss% 6.67 9.95 12.93 12.5 14.79
16^3 32^3 64^3 128^3 256^3
0
20
40
60
80
100
L1 Miss% 81.8 90.55 91.45 91.58 96.58
L2 Miss% 1.64 1.76 1.78 1.87 6.13
DTLB Miss% 10.27 12.05 11.67 12.86 11.71
16^3 32^3 64^3 128^3 256^3
Statistics of Single Thread Statistics of Two Threads
8
Katsevich Algorithm Overview• Reconstructs a 3-D cylindrical volume exactly from
2-D projections.[1]
– The inputs are projections (b) taken from a helical path surrounding the volume of interest (a).
• Implemented the Noo method [2]:– These projections are differentiated and weighted
appropriately (c).– These undergo a 1-D Hilbert transform along the κ-
lines.• First undergo remapping to κ-line coordinates (d).• Perform 1-D convolution w/ filter kernel (e).• Return to projection coordinates by remapping (f).
– To reconstruct the 3-D volume (g), each voxel’s coordinates is back projected the source projections
• The cumulative sum is taken for all projections belonging to the PI-interval containing that voxel.
• Used similar parallelization strategy to FDK– Each thread processes a subset of the projections.– After synchronization, each thread reconstructs a
subset of the total volume.[1] Alexander Katsevich, "Theoretically exact FBP-type inversion algorithm for spiral CT", Society for Industrial and Applied Mathematics Journal on Applied Mathematics, 62:2012-2026, 2002.
[2] F. Noo, J. Pack, and D. Heuscher, “Exact helical reconstruction using native cone-beam geometries,” Physics in Medicine and Biology, vol. 48, pp. 3787–3818, 2003.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
9
Results
Speedup using 2 Threads (single-precision)
1.993 1.974 1.9651.940
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
64 128 256 512
Width of Reconstructed Volume
Sp
eed
up
ove
r 1
thre
ad
Reconstruction Time (single-precision)
0.5 6.1
1454.6
0.2 3.1 45.890.5
740.1
0
200
400
600
800
1000
1200
1400
1600
64 128 256 512
Width of Reconstructed Volume
Tim
e in
sec
on
ds
1 Thread
2 Threads
• Using Intel Core2 Duo @ 2.66 GHz.• Close to 2x speedup
Reconstruction Time (double precision)
0.6 8.8136.0
2133.0
0.3 4.4 68.9
1087.0
0
500
1000
1500
2000
2500
64 128 256 512
Width of Reconstructed Volume
Tim
e in
sec
on
ds
1 thread
2 threads
Speedup using 2 Threads (double precision)
1.984 1.980 1.974 1.962
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
64 128 256 512
Width of Reconstructed Volume
Sp
eed
up
ove
r 1
thre
ad
10
Image Quality
512^3 Reconstruction512 Projections per Turn, 512x64 size projections
512^3 original Phantom
11
Benchmark
• Compared against the published timing results in [3], which used 64-bit AMD Opteron processors.
• Unable to determine exact parameters used by author of [3], so the comparison may be questionable.
Comparison to Published Results in [3] (single-threaded, double precision)
9 136
2133
226
975
6013
0
1000
2000
3000
4000
5000
6000
7000
128 256 512
Width of Reconstructed Volume
Tim
e in
sec
onds
my results
uiowa results
[3] Deng, J., Yu, H., Ni, J., He, T., Zhao, S., Wang, L., and Wang, G. 2006. A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction. J. Supercomput. 38, 1 (Oct. 2006), 35-47.
12
Optimizations Used
• Majority of time spent during backprojection and determining the PI-intervals.– PI-intervals are constant for a particular helix.
• PI-intervals are precomputed and saved to a file.– Only necessary to precompute PI-intervals for one horizontal slice.
• PI-intervals for different horizontal slices can be determined by rotation.– Easy ~25% speedup
Time Breakdown for Different Stages (initial version)
Differentiation andFiltering
Determine PI-intervals
PerformBackprojection
Other
13
Optimizations Used• Next focused on backprojection inner loop.
– Removed trival lookup tables to save cache space.
• ~10% speedup.
– Used sin, cos lookup tables• ~15% speedup.
– Moved if statements for smoothing the ends of the PI-interval outside the loop.
• Duplicated inner loop code.• ~10% speedup.
– Removed if statements required for bounds testing the backprojected coordinates.
• Needed to add extra row and column slack to projection data.
• ~3% speedup.
14
Future work• Explore memory layout to reduce
cache misses and page faults.• Implement the same algorithms on
Cell processor for competitive analysis.