image reconstruction on multicore processors

Image Reconstruction on Multicore Processors

Graduate StudentsEric Fontaine and Viraj Paropkari

Faculty Members:Ada Gavrilovska and Hsien-Hsin S. Lee

2

Agenda

• Background

• FDK algorithm– Overview– Parallelization Method– Current Results

• Katsevich Algorithm– Overview– Parallelization Method– Current Results

• Future Plans

3

Background• Use 3-D CT scan to identify tumors and other

defects inside the body.• Two common methods

– MRI• Complex math and physics• Main function ─ Simple IFFT

– Filtered back-projection • Two common filtered back-projection algorithms

– FDK • Approximation, fast• Use projections taken on a circular path surrounding the

object• More accurate on the plane containing the circle

– Katsevich• More accurate, but also more compute-intensive • Use projections taken on a helical path surrounding the

object • It can reconstruct long objects, unlike the original FDK.

• Both contain large data parallelism

4

FDK Algorithm Overview• Cone beam image reconstruction with

source on a helix for a flat detector• Reconstruction for 3-D volume• Initialize the helix source parameters• Compute/load cone beam data• Length correction weighting• 1-D horizontal filtering• Linear Pre-interpolation• Back projection• Compare Results with standard phantom

5

Parallelization Strategy• Based on FDK algorithm for general scanning

paths like helix.*

– Each thread is assigned a subset of the total number of projections, and performs length correction weighting, filtering and back-projections of its assigned projections.

– After all threads are done, there is an implicit barrier necessary for synchronization. Then each thread is assigned a subset of the total volume to reconstruct.

– We use OpenMP • Reconstruct subsets of the total volume in parallel (to fit into

individual cache)• Piece the image together at the end (reduced inter-core

communication)

Assign Projections

Length correction weighting, filtering, back-projection barrier

Length correction weighting, filtering, back-projection

*Ge Wang, Tein-Hsiang Lin, Ping-chin Cheng, and Douglas M. Shinozaki. A general cone-beam reconstruction algorithm. IEEE Trans. On Medical Imaging, 12(3):486-496, September 1993

Reconstructed Image

6

Single and Dual-Thread Performance

Single Thread 13 17 49 311 2416

Dual Thread 31 33 50 184 1263

16^3 32^3 64^3 128^3 256^30

0.5

1

1.5

2

Speedup 0.419 0.515 0.98 1.69 1.912

16^3 32^3 64^3 128^3 256^3

Slowdown

Performance (Seconds)Speedup of

dual-thread OpenMP code

7

FDK Analysis for Memory Behavior

0

5

10

15

20

25

30

35

40

45

L1 Miss% 23.65 38.62 36.93 34.08 21.5

L2 Miss% 0.64 41.74 23.75 18.6 14.39

DTLB Miss% 6.67 9.95 12.93 12.5 14.79

16^3 32^3 64^3 128^3 256^3

0

20

40

60

80

100

L1 Miss% 81.8 90.55 91.45 91.58 96.58

L2 Miss% 1.64 1.76 1.78 1.87 6.13

DTLB Miss% 10.27 12.05 11.67 12.86 11.71

16^3 32^3 64^3 128^3 256^3

Statistics of Single Thread Statistics of Two Threads

8

Katsevich Algorithm Overview• Reconstructs a 3-D cylindrical volume exactly from

2-D projections.[1]

– The inputs are projections (b) taken from a helical path surrounding the volume of interest (a).

• Implemented the Noo method [2]:– These projections are differentiated and weighted

appropriately (c).– These undergo a 1-D Hilbert transform along the κ-

lines.• First undergo remapping to κ-line coordinates (d).• Perform 1-D convolution w/ filter kernel (e).• Return to projection coordinates by remapping (f).

– To reconstruct the 3-D volume (g), each voxel’s coordinates is back projected the source projections

• The cumulative sum is taken for all projections belonging to the PI-interval containing that voxel.

• Used similar parallelization strategy to FDK– Each thread processes a subset of the projections.– After synchronization, each thread reconstructs a

subset of the total volume.[1] Alexander Katsevich, "Theoretically exact FBP-type inversion algorithm for spiral CT", Society for Industrial and Applied Mathematics Journal on Applied Mathematics, 62:2012-2026, 2002.

[2] F. Noo, J. Pack, and D. Heuscher, “Exact helical reconstruction using native cone-beam geometries,” Physics in Medicine and Biology, vol. 48, pp. 3787–3818, 2003.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

http://cfi.lbl.gov/3D-2001/abstracts/01-1.pdf

9

Results

Speedup using 2 Threads (single-precision)

1.993 1.974 1.9651.940

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

64 128 256 512

Width of Reconstructed Volume

Sp

eed

up

ove

r 1

thre

ad

Reconstruction Time (single-precision)

0.5 6.1

1454.6

0.2 3.1 45.890.5

740.1

0

200

400

600

800

1000

1200

1400

1600

64 128 256 512


Tim

e in

sec

on

ds

1 Thread

2 Threads

• Using Intel Core2 Duo @ 2.66 GHz.• Close to 2x speedup

Reconstruction Time (double precision)

0.6 8.8136.0

2133.0

0.3 4.4 68.9

1087.0

0

500

1000

1500

2000

2500

64 128 256 512


Tim

e in

sec

on

ds

1 thread

2 threads

Speedup using 2 Threads (double precision)

1.984 1.980 1.974 1.962

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

64 128 256 512


Sp

eed

up

ove

r 1

thre

ad

10

Image Quality

512^3 Reconstruction512 Projections per Turn, 512x64 size projections

512^3 original Phantom

11

Benchmark

• Compared against the published timing results in [3], which used 64-bit AMD Opteron processors.

• Unable to determine exact parameters used by author of [3], so the comparison may be questionable.

Comparison to Published Results in [3] (single-threaded, double precision)

9 136

2133

226

975

6013

0

1000

2000

3000

4000

5000

6000

7000

128 256 512


Tim

e in

sec

onds

my results

uiowa results

[3] Deng, J., Yu, H., Ni, J., He, T., Zhao, S., Wang, L., and Wang, G. 2006. A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction. J. Supercomput. 38, 1 (Oct. 2006), 35-47.

12

Optimizations Used

• Majority of time spent during backprojection and determining the PI-intervals.– PI-intervals are constant for a particular helix.

• PI-intervals are precomputed and saved to a file.– Only necessary to precompute PI-intervals for one horizontal slice.

• PI-intervals for different horizontal slices can be determined by rotation.– Easy ~25% speedup

Time Breakdown for Different Stages (initial version)

Differentiation andFiltering

Determine PI-intervals

PerformBackprojection

Other

13

Optimizations Used• Next focused on backprojection inner loop.

– Removed trival lookup tables to save cache space.

• ~10% speedup.

– Used sin, cos lookup tables• ~15% speedup.

– Moved if statements for smoothing the ends of the PI-interval outside the loop.

• Duplicated inner loop code.• ~10% speedup.

– Removed if statements required for bounds testing the backprojected coordinates.

• Needed to add extra row and column slack to projection data.

• ~3% speedup.

14

Future work• Explore memory layout to reduce

cache misses and page faults.• Implement the same algorithms on

Cell processor for competitive analysis.

image reconstruction on multicore processors

Documents

assigned projections

fastuse projections

projections b

d cylindrical volume

d volume g

d volumeinitialize

d hilbert

total number of projections