on the efficiency of iterative ordered subset reconstruction algorithms for acceleration on gpus

Oa

FDa

b

U

a

A

R

R

1

A

K

I

C

C

G

1

TchcHan

0d

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270

journa l homepage: www. int l .e lsev ierhea l th .com/ journa ls /cmpb

n the efficiency of iterative ordered subset reconstructionlgorithms for acceleration on GPUs

ang Xua, Wei Xub, Mel Jonesb, Bettina Keszthelyib, John Sedatb,avid Agardb, Klaus Muellera,∗

Center for Visual Computing, Computer Science Department, Stony Brook University, NY 11794-4400, United StatesHoward Hughes Medical Institute and the Keck Advanced Microscopy Center, Department of Biochemistry & Biophysics,niversity of California at San Francisco, CA, United States

r t i c l e i n f o

rticle history:

eceived 25 February 2009

eceived in revised form

September 2009

ccepted 3 September 2009

eywords:

terative reconstruction

omputed tomography

ommodity graphics hardware

PU

a b s t r a c t

Expectation Maximization (EM) and the Simultaneous Iterative Reconstruction Technique

(SIRT) are two iterative computed tomography reconstruction algorithms often used when

the data contain a high amount of statistical noise, have been acquired from a limited angu-

lar range, or have a limited number of views. A popular mechanism to increase the rate of

convergence of these types of algorithms has been to perform the correctional updates

within subsets of the projection data. This has given rise to the method of Ordered Subsets

EM (OS-EM) and the Simultaneous Algebraic Reconstruction Technique (SART). Commod-

ity graphics hardware (GPUs) has shown great promise to combat the high computational

demands incurred by iterative reconstruction algorithms. However, we find that the spe-

cial architecture and programming model of GPUs add extra constraints on the real-time

performance of ordered subsets algorithms, counteracting the speedup benefits of smaller

subsets observed on CPUs. This gives rise to new relationships governing the optimal num-

ber of subsets as well as relaxation factor settings for obtaining the smallest wall-clock time

for reconstruction—a factor that is likely application-dependent. In this paper we study the

generalization of SIRT into Ordered Subsets SIRT and show that this allows one to opti-

mize the computational performance of GPU-accelerated iterative algebraic reconstruction

methods.

© 2009 Elsevier Ireland Ltd. All rights reserved.

sive gains originate in the highly parallel Same Instruction

. Introduction

he rapid growth in speed and capabilities of programmableommodity graphics hardware boards (GPUs) has propelledigh performance computing to the desktop, spawning appli-ations far beyond those used in interactive computer games.
igh-end graphics boards, such as the NVIDIA 8800 GTXnd their successors, featuring 500 G Flops and more, areow available for less than $500, and their performance is
∗ Corresponding author. Tel.: +1 631 632 1524.E-mail address: [email protected] (K. Mueller).

169-2607/$ – see front matter © 2009 Elsevier Ireland Ltd. All rights resoi:10.1016/j.cmpb.2009.09.003

consistently growing at a triple of Moore’s law that governsthe growth of CPUs. Speedups of 1–2 orders of magnitudehave been reported by many researchers when mapping CPU-based algorithms onto the GPU, in a wide variety of domains[10,18], including medical imaging [8,11,13,16]. These impres-

Multiple Data (SIMD) architecture of the GPU and its high-bandwidth memory access. For example, the NIVIDIA 8800GTX has 128 such SIMD pipelines while the most recent

erved.

mailto:[email protected]

dx.doi.org/10.1016/j.cmpb.2009.09.003

s i n

In the following discussion, we have only considered algebraic

262 c o m p u t e r m e t h o d s a n d p r o g r a m

NVIDIA card, the GTX 295, has 2 × 240 processors to yield apeak performance of 1.7 T Flops.

It is important to note, however, that the high speeduprates facilitated by GPUs do not come easy. They require one tocarefully map the target algorithm from the single-threadedprogramming model of the CPU to the multi-threaded SIMDprogramming model of the GPU where each such threadis dedicated to computing one element of the (final orintermediate) result vector. Here, special attention mustbe paid to keep all of these pipelines busy. While thereare 100s of SIMD processors on the GPU, many morethreads need to be created to hide data fetch latencies.It is important to avoid both thread congestion (too manythreads waiting for execution) and thread starvation (notenough threads available to hide latencies). These condi-tions are in addition to avoiding possible contingencies inlocal registers and caches that will limit the overall num-ber of threads permitted to run simultaneously. For example,in [13], it was shown that a careful mapping of Feld-kamp’s filtered backprojection algorithm to the GPU yieldeda 20× speedup over an optimized CPU implementation,enabling cone-beam reconstructions of 5123 volumes from360 projections at a rate of 52 projections/s, greatly exceed-ing the data production rates of modern flat-panel X-rayscanners that have become popular in fully 3D medical imag-ing.

The compute-intensive nature of iterative reconstructionalgorithms motivated their acceleration via commodity graph-ics hardware early on, first using graphics workstations [9]and later GPU boards [14]. We now refine these works by ana-lyzing the acceleration of iterative reconstruction algorithmsmore closely in terms of the underlying GPU programmingmodel and architecture. Iterative algorithms are differentfrom analytical algorithms in that they require frequent syn-chronization which interrupts the stream of data, requirescontext switches, and limits the number of threads availablefor thread management. Iterative algorithms, such as Expec-tation Maximization (EM) [12] or the Simultaneous IterativeReconstruction Technique (SIRT) [5] consist of three phases,executed in an iterative fashion: (1) projection of the objectestimate, (2) correction factor computation (the updates), and(3) backprojection of the object estimate updates. Each phaserequires a separate pass. Flexibility comes from the conceptof ordered subsets, which have been originally devised mostlybecause they accelerated convergence. The projection data isdivided into groups, the subsets, and the data within eachof these groups undergo each of the three phases simulta-neously. Here, it was found that the larger the number ofsubsets (that is, the smaller the groups) the faster is typi-cally the convergence, but adversely also the higher the noisesince there is more potential for over-correction. In EM, themethod of Ordered Subsets (OS-EM) has become widely pop-ular. OS-EM conceptually allows for any number of subsets,but the limit with respect to noise has been noted alreadyin the original work by Hudson and Larkin [7]. For the alge-braic scheme, embodied by SIRT, the Simultaneous Algebraic
Reconstruction Technique (SART) [2] is also an OS scheme, butwith each set only consisting of a single projection. In SART,the over-correction noise is kept low by scaling the updatesby a relaxation factor � < 1. Block-iterative schemes for alge-
b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270

braic techniques have been proposed as well [3]. In fact, theoriginal ART [6] is the algorithm with the smallest subsetsize possible: a single data point (that is, ray or projectionpixel).

It is well known that SART converges much faster than SIRT,and a well-chosen � can overcome the problems with streakartifacts and reconstruction noise, allowing it produce goodreconstruction results [1]. On the CPU, faster rate of conver-gence is directly related to faster time performance. But, aswe previously demonstrated [15], when it comes to acceler-ation on a streaming architecture such as the GPU, SART isnot the fastest algorithm in terms of time performance. Infact, the time performance is inversely related to the num-ber subsets, making SIRT the faster scheme. This is due to theoverhead incurred by the frequent context switching whenrepeatedly moving through the three iterative phases: pro-jection, correction, and backprojection. In the same work, wealso demonstrated that specific subset sizes can optimize bothreconstruction quality and performance. However, the influ-ence of the relaxation factor � had not been considered inthese experiments.

In the research reported here we now complete the studyof [15] and also investigate the role of � on GPU reconstructionspeed performance, in relation to subset size. Here we findthat an optimal choice of � can have a great impact. Despitethe fact that SART is slower than SIRT per iteration, an opti-mized �-setting can reduce the number of required iterationsfor convergence to a greater extent than the extra per-iterationcost incurred by GPU data streaming and context switching.This reinstates SART and the lower subset versions of SIRTas the fastest iterative CT reconstruction schemes also onGPUs.

We shall note, however, that the optimal setting islikely application-dependent, which is not unique to theGPU setting alone. Here we make the reasonable assump-tion that a certain application will always incur similartypes of data and thus an optimal parameter setting,once found, will likely be close to optimal for all datawithin that application setting. In that sense, our aim forthis paper is not to provide optimal subset and relax-ation factor settings for all types of data, but rather toraise awareness to this phenomenon and offer an explana-tion.

Our paper is structured as follows. First, in Section 2, wediscuss iterative algorithms in the context of ordered sub-sets, present a generalization of SIRT to OS-SIRT, and describetheir acceleration on the GPU. Then, in Section 3, we studythe impacts of both subset size and relaxation factor on GPUreconstruction performance and present the results of ourstudies. Finally, Section 4 ends with conclusions.

2. Iterative reconstruction and itsacceleration on GPUs

reconstruction algorithms (SART, SIRT), but our argumentsand conclusions readily extend to Expectation Maximization(EM) algorithms as well since they are very similar with respectto their mapping to GPUs [14].

i n b

2i

MmvsaertHtt

r

wvpotff

v

FastOtbwpuanputw

2p

Aiisffpnt

c o m p u t e r m e t h o d s a n d p r o g r a m s

.1. Iterative algebraic reconstruction: decompositionnto subsets

ost iterative CT techniques use a projection operator toodel the underlying image generation process at a certain

iewing configuration (angle) ϕ. The result of this projectionimulation is then compared to the acquired image obtainedt the same viewing configuration. If scattering or diffractionffects are ignored, the modeling consists of tracing a straightay ri from each image element (pixel) and summing the con-ributions of the (reconstruction) volume elements (voxels) vj.ere, the weight factor wij determines the contribution of a vj

o ri and is given by the interpolation kernel used for samplinghe volume. The projection operator is given as

i =N∑

j=1

vjwij i = 1, 2, . . . , M (1)

here M and N are the number of rays (one per pixel) andoxels, respectively. Since GPUs are heavily optimized for com-uting and less for memory bandwidth, computing the wij

n the fly, via bilinear interpolation, is by far more efficienthan storing the weights in memory. The correction updateor projection-based algebraic methods is computed with theollowing equation:

(k+1)j

= v(k)j

+ �∑

pi ∈ OSs

pi − ri∑N

l=1wil

ri =N∑

l=1

wilv(k)l

(2)

or the purpose of this paper, we have written this equations a generalization of the original SART and SIRT equations toupport any number of subsets. Here, the pi are the pixels inhe M/S acquired images that form a specific (ordered) subsetSs where 1 ≤ s ≤ S and S is the number of subsets. The fac-

or � is the relaxation factor, as mentioned above, which wille subject to optimization. The factor k is the iteration count,here k is incremented each time all M projections have beenrocessed. In essence, all voxels vj on the path of a ray ri arepdated (corrected) by the difference of the projection ray ri

nd the acquired pixel pi, where this correction factor is firstormalized by the sum of weights encountered by the (back-rojection) ray ri. Since a number of backprojection rays willpdate a given vj, these corrections need also be normalized byhe sum of (correction) weights. For SIRT, these normalizationeights are trivial.

.2. GPU-accelerated reconstruction: threads andasses

s mentioned, the NVIDIA 8800 GTX board has 128 general-zed SIMD processors. Up to very recently, the only way tonterface with GPU hardware was via a suitable graphics API,uch as OpenGL or DirectX, and using CG [4] (or GLSL or HLSL)or coding shader programs to be loaded and run on the SIMD
ragment processors. With the introduction of a new API, Com-ute Unified Device Architecture (CUDA) [19], the GPU canow directly be perceived as a multi-processor. With CUDA,he CG fragments become the CUDA (SIMD) computing threads
i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270 263

and the shader programs become the computing kernels. Withthe CUDA specifications, much more information about theoverall GPU architecture is now available, which helps pro-grammers to fine-tune thread and memory management tooptimize performance, viewing GPUs as the multi-processorarchitecture it really is. Reflecting this General Programmingon GPUs (GPGPU) trend, new GPU platforms have now becomeavailable that do not even have graphics display capabili-ties, such as the NVIDIA Tesla board. Although we used GLSLshaders to obtain the results presented in this paper, similarsymptoms also occur in CUDA where synchronization oper-ations have to be formally called to finish executions of allthreads within a thread block to resume the pipeline. Afterall, the underlying hardware and its architecture remain thesame just the API is different.

The GPU memory model differentiates itself significantlyfrom its CPU counterpart, posing greater restrictions on mem-ory access operations in order to reduce latency and increasebandwidth. Here not only registers and local memory arereduced or even completely eliminated—the global memoryalso allows only read instructions during the computation.Further, the write operator can be executed only at the endof a computation, when the thread (or fragment) is releasedfrom the pipeline, to be blended with the target. Therefore,in general-purpose computing using GPUs, computations aretriggered by initializing a “pass”. A pass includes setting up thecomputation region and attaching a kernel program to simul-taneously apply specific operations on every thread generated.The data is then streamed into the pipeline, where the mod-ification can be done only at the end of the pass. Cycles andloops within a program can be implemented either inside thekernel or by running multiple passes. The former is generallyfaster since evoking a rendering pass and storing intermediateresults in memory are costly, but there exists a register countlimit in the current hardware which prevents unconstraineduse of loops in the kernel.

2.3. GPU-accelerated iterative reconstruction:implementation details

To illustrate the significance of our generalization of SARTand SIRT into OS-SIRT, in context of GPU-accelerated comput-ing, we shall now describe our specific GPU implementationin closer detail. As mentioned, iterative algorithms consistof three passes which are repeatedly executed until con-vergence: forward projection, corrective update computation,and backprojection. Of these, only the first and third pass arecomputationally challenging as they involve the reconstruc-tion volume. We note that in SART each such pass involvesonly one projection, while in SART it involves the entire set ofprojections. In OS-SIRT the number of projections involved isdetermined by the size of the subset.

The high performance of GPUs stems from the parallelexecution of a sufficiently large number of SIMD computingthreads. Each such thread must have a target which eventu-ally receives the outcome of the thread’s computation. In the
forward projection the targets are the pixels on the projectionimages receiving the outcome of the ray integrations (Eq. (1)),while in the backprojection the targets are the reconstructedvolume voxels receiving the corrective updates (Eq. (2)). Thus

s i n b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270

Fig. 1 – Time per iteration (in s) as a function of number of

to the integration result of the corresponding acquisition raywhich then determines the corrective update.

Fig. 2 – Reconstructions obtained with the linear


the forward projection is inherently pixel-driven, while thebackprojection is voxel-driven. These two mechanisms (orpipelines) are mainly distinguished by their interpolationscheme: the former interpolates (ray) samples in reconstruc-tion volume space, while the latter interpolates (correctionupdate) samples in projection image space. This constitutesan unmatched projector/backprojector pair, which has beensuccessfully used in the past in other contexts [17].

In the work considered here, the 3D object estimate isrepresented as a three-dimensional texture contained in abounding box of six polygons. The resolution of the textureis that of the volume to be reconstructed. The forward pro-jector casts rays across this texture, and the backprojectorupdates the texture. Both observe the viewing geometry ofthe detector that is associated with the acquired data. We shallnow describe the GPU implementations of these two pipelinesmore closely. For further detail the reader is referred to a num-ber of available papers (see, for example [13,14]).

Forward projector: given the viewing geometry and positionof the detector associated with the data, we first require foreach of its rays the 3D coordinates of their entry and exitpoints into the volume texture’s bounding box. These canbe efficiently obtained with the GPU via OpenGL commands.We first transform the bounding box by the viewing geom-etry’s transformation matrix and then render the boundingbox polygons into the depth buffer. Here the viewport is cho-sen such that its resolution matches that of the projectiondata image. The entry points are computed by projecting thebounding box polygons and setting the GL depth buffer testto GL LESS, which will only retain the depth coordinates ofthe front-facing polygons, and the exit points are computedby setting the GL depth buffer test to GL GREATER, which willonly retain the depth coordinates of the back-facing polygons.The ray direction vector for each ray (pixel) is then calculatedby subtracting the ray’s entry point 3D coordinates from theexit point 3D coordinates and normalizing this vector. Givena certain pre-defined ray step size (we use 1.0) each threadcan then compute the number of sample points along its rayto initialize a volume traversal loop. Beginning at the volumeentry point, it interpolates the volume texture (using the GPU’shardwired trilinear interpolation facility), adds this value toits ray integral (initialized to zero), and then advances the rayby adding the direction vector scaled by the step size to itscurrent volume position. All rays advance in lock step and inparallel, each on a different parallel GPU processor. The resultof this forward projection procedure is a texture of the sameresolution than the input data.

Corrective update computation: this simple thread subtractsthe texture obtained in the forward projection step from theinput data texture and normalizes the result by a texturestoring the sum of weights for each ray (computed in a prepro-cessing step with the volume texture set to unity). The resultis stored in another 2D texture we call corrective update texture.

Backprojector: now the 3D volume texture becomes the ren-dering target. A thread is spawned for each voxel and themapping from the voxel’s 3D coordinate to the detector is
determined via projective geometry calculations. Then theGPU’s bilinear interpolation facility is called to compute thevoxel update from the corrective update texture. The resultingvalue is then added to the 3D texture at that voxel position.
subsets.

We observe that since the forward projector uses a fairlygood interpolation filter (trilinear) and a ray step size indepen-dent of the viewing angle, it produces a ray integral estimatethat is a reasonably close approximation to an analytical inte-gration (assuming ideal ray physics). The corrective updates ofthe backprojector, on the other hand, are not spread into thevoxel grid via a trilinear function at regular intervals along theray. Rather, the ray interval varies between 1 and

√2 and the

spreading only occurs in-slice via bilinear interpolation. Sincethe spreading (scattering) of thread results into multiple tar-gets (the voxels in the trilinear neighborhood) is not supportedon GPUs (and can also lead to hazards in parallel computation)this restriction is difficult to overcome. Fortunately, having ahigh-quality forward projector is of much higher importancein CT reconstruction since its integration result is compared

�-selection schedule for various subset sizes for a fixedCC = 0.95 and 180 projections in an angular range of 180◦.We observe that OS-SIRT with 10 subsets of 18 projectionseach reaches this fixed CC value 12% faster than SIRT and50% faster than SART.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270 265

Fig. 3 – Optimal relaxation factor � as a function of numberof subsets for the optimal �-selection scheme.

Fig. 4 – Reconstructions with the optimized �-selectionscheme obtained with various subset sizes for a fixedCC = 0.95 and 180 projections in an angular range of 180◦.We observe that now the number of subsets is directlyrt

2e

Fsoptmbpiwioni

Fig. 5 – CC vs. wall-clock time for the linear �-selectionscheme. We observe that OS-SIRT achieves the

is what has been adopted in OS-EM. In contrast, we use a ran-domized approach to fill the subsets, which we find yieldsbetter results than the regular subset population approach.

elated to wall-clock computation time, with SART beinghe fastest.

.4. GPU-accelerated iterative reconstruction:xtension to ordered subsets

rom what has been discussed above we observe that orderedubsets (with the extreme case being SIRT) allow better controlver the number of threads per kernel invocation. The morerojections are grouped into a subset, the more pixel rays (andherefore threads) are spawned. This allows a better hiding of

emory latencies since the threads waiting for memory cane (temporarily) replaced by other non-waiting threads not yetrocessed for the current program operation. An inherent lim-

tation on the number of threads that can be managed thisay is the available GPU (shared) memory. On the other hand,

n the backprojection there are typically a sufficient numberf threads since the number of voxels is an order of mag-itude greater than the number of pixels in the projection

mages. However, backprojection threads are typically short

reconstruction it in the smallest amount of time, within ourGPU-accelerated framework.

which makes them less efficient. Since ordered subsets mapeach voxel to a set of projections this sequence of mappingsincreases the length of the kernel program and improves effi-ciency. A final advantage of ordered subsets is that they reducethe number of passes and the context switches that occurbetween the three repetitive phases of the iterative algorithm.

In the following we shall present our OS-SIRT algorithmin further detail. Eq. (2) above described the generalization ofalgebraic reconstruction into the OS configuration. What is leftto define is how the subsets OSs are composed and how � ischosen for given number of subsets S. As specified above, OSs

is the set of projections contained in each subset, to be usedin a pair of simultaneous forward projections and simultane-ous backward projections. In our application, each subset hasthe same number of projections, that is |OSs| = |OS|, which istypical. Thus, the total number of projections M = |OS|S. Thetraditional way of filling a certain subset OSs is to select pro-jections whose indices m (1 ≤ m ≤ M) satisfy m mod S = s. This

Fig. 6 – CC vs. wall-clock time for the optimized �-selectionscheme. We observe that there is now a clear ordering interms of reconstruction time from small subsets to largerones.

266 c o m p u t e r m e t h o d s a n d p r o g r a m s i n

Fig. 7 – Line profiles of the image background for SART,SIRT, and the original Barbara image.

For this, we simply generate a projection index list in randomorder and sequentially divide this list into S subsets.

In [15] we proposed the following linear equation for therelaxation factor � to be used for an arbitrary S, setting � = 1for SIRT and � = �SART = 0.1 for SART:

� = (�SART − 1)(

S − 1N − 1

)+ 1 1 ≤ S ≤ N (3)

This scheme sought to balance the smoothing effect achievedby the application of a larger set of simultaneous projections(employed in SIRT) with that obtained by a lower relaxationfactor (when a smaller set of projections is applied with SART
or SIRT with smaller subsets). That is, the lesser projections ina subset, the lower the �.
In our current work, we collected results for all possibleinteger-subsets, each for a representative set of �-levels, and

Fig. 8 – Reconstructed baby head using high-quality simulated pobtained with the linear �-selection scheme with various subsetangular range of 180◦. OS-SIRT with 10 subsets of 18 projectionsand 86% faster than SART.

b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270

used it to produce a more accurate �(S) function than the lin-ear function in Eq. (3). In the following, we shall call the firstscheme the linear �-selection scheme, and the second methodthe optimal �-selection scheme.

The arguments presented thus far as well as our priorobservations in [15] justify the lower speed obtained withsingle iterations of SART or OS-SIRT with a large number ofsubsets, over single iterations with basic SIRT or OS-SIRT witha small number of subsets. The results we shall present belowwill show, however, that these extra costs incurred by SARTand OS-SIRT with a large number of subsets are more thancompensated by the faster convergence rates they can achieve,given the proper �-settings obtained with a more informed�-selection scheme.

3. Experiments and results

All our experiments were conducted on an NVIDIA 8800 GTXGPU, programmed with GLSL. For the first set of experimentswe used the 2D Barbara test image to evaluate the performanceof the different reconstruction schemes described above. Weused this image, popular in the image processing literature,since it has several coherent regions with high-frequencydetail, which present a well observable test for the fidelity of agiven reconstruction scheme. The target 2D image is obtainedby cropping the original image to an area of 256 × 256 pixelsresolution. We obtained 180 projections at uniform angu-lar spacing of [−90◦, +90◦] in a parallel projection viewing
geometry. We also simulated a limited-angle scenario, whereiterative algorithms are often employed. Here, we produced140 projections in the interval [−70◦, +70◦]. All reconstructionsused linear interpolation.
rojection data of the volume labeled ‘Original’. Results weresizes for a fixed R-factor = 0.007 and 180 projections in aneach reaches this set R-factor value 26% faster than SIRT

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270 267

Fig. 9 – Reconstruction results of the baby head obtained with the optimized �-selection scheme with various subset sizesfor a fixed R-factor = 0.007 and 180 projections in an angular range of 180◦. Using the �-selections shown above, OS-SIRTwith 20 subsets of 9 projections each is the fastest in this case. It reaches this set R-factor value 91% faster than SIRT and72% faster than SART. We also note that the time is about 7 times faster than with the linear selection scheme since theo

mn

C

wi�

t

ptimal �-factor is higher.

We will use the cross-correlation coefficient (CC) as theetric to measure the degree of similarity between the origi-

al image and its reconstruction:

C =∑

j(rj − �r)(oj − �o)√∑

j(rj − �r)

2∑j(oj − �o)2

(4)

here j counts the number of image pixels, r and o are pixels
j j
n the reconstruction and original image, respectively, and thefactors are their mean values.We shall now explore if there is an optimum in terms of

he number of subsets. Such an optimal subset size could

then be used to generate the best reconstruction in the small-est amount of (wall-clock) time. First, we evaluated the timeper iteration for each configuration. Fig. 1 plots the timeper iteration for each subset configuration, for the full-anglecase of 180 projections. We observe a roughly linear rela-tionship between the number of subsets and the time periteration, with SIRT requiring the smallest and SART thelongest time (about 5 times more than SIRT which is signif-icant). This was to be expected given the arguments provided
above.
However, in practice we are not interested in time periteration but in time for convergence. Next, we test the twostrategies presented (the linear and the optimal �-selection

268 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270

Fig. 10 – Reconstruction of the baby head from a limited set of 64 projections in an angular range of 180◦ and the optimized�-selection scheme shown above with various subset sizes for a fixed R-factor = 0.007. We observe that OS-SIRT with 16

ches
subsets of 4 projections each is the fastest in this case. It reathan SART.
schemes) to choose the relaxation factor � for each possibleinteger-subset configuration.

Fig. 2 shows the reconstruction results obtained withthe linear �-selection schemes, for a fixed CC (comparingreconstruction with the original) which means that all recon-structed images are nearly identical to each other (in terms ofstatistical error). We observe that the smaller the number ofsubsets, the greater the number of iterations that are requiredto reach the set convergence threshold. We measured thatwith the linear �-selection scheme SART on the GPU takesnearly twice as long as SIRT and using 10 subsets (5 for thelimited-angle case with fewer projections) achieves the best
timing performance compared to the other subset configura-tions. For a CPU implementation, where the wall-clock timeis directly related to the number of iterations, SIRT would beroughly 87/6 = 14 times slower than SART. However, due to the
this set R-factor value 84% faster than SIRT and 56% faster

mentioned overhead involved in the GPU-based framework,different wall-clock times are produced with a GPU implemen-tation and the ratio is not that severe.

We next evaluate the optimal �-selection scheme. Fig. 3shows the plot of the optimal � found for each subset con-figuration for a CC = 0.95. We observe that the relationship isfar from linear, with � being relatively stable at around 0.95until OS = 45 and then falling off to around 0.6 for SART. Thus,the linear curve underestimates � most of the time, leading tounnecessarily small corrective updates. Fig. 4 shows the cor-responding reconstruction results for the Barbara dataset. Wesee that SART now is the fastest algorithm, roughly 20 times
faster than SIRT.
Figs. 5 and 6 compare the reconstruction quality vs. wall-clock time for both �-schedules, respectively. We clearlyobserve that SIRT behaves very similarly, due to the simi-

i n b

lmrb

tswar

s

R

wgrbo0

fpSt�

t2isrj�

iaGm

4

WmwtbcvtLkiTpi

bi

c o m p u t e r m e t h o d s a n d p r o g r a m s

ar �-factor chosen, while the other subsets converge muchore expediently for the optimal selection scheme, with SART

eaching convergence after just one iteration, closely followedy OS-SIRT 60.

The tendency of SART to produce reconstructions noisierhan the original and that of SIRT to produce reconstructionsmoother than the original is also demonstrated in Fig. 7,here we show the renditions of a line profile across another

rea of the Barbara image (only for the original image andeconstructions with SART and SIRT).

We also studied a medical dataset, a slice of a baby head ofize 2562. To assess the error we used the R-factor:

=∑

j(|SOj| − |SCj|)∑

j|SOj|

(5)

here the SO and SC are the acquired and simulated sino-rams, respectively. The R-factor is more practical for realeconstruction scenarios because it measures convergenceased on the acquired data and not the (typically unavailable)bject. All reconstructions were stopped once an R-factor of.007 was reached.

Fig. 8 shows the results obtained with the linear �-selectionramework. We observe that OS-SIRT with 10 subsets of 18rojections each reaches the preset R-factor 26% faster thanIRT and 86% faster than SART. Fig. 9 shows the reconstruc-ion results of the baby head obtained with the optimized-selection scheme (shown above the table) from 180 projec-ions in an angular range of 180◦. We see that OS-SIRT with0 subsets is fastest in this case. We also note that the times about 7 times faster than with the linear selection schemeince the optimal �-factor is higher. Finally, Fig. 10 presentseconstructions of the baby head from a limited set of 64 pro-ections in an angular range of 180◦ again using the optimized-selection scheme. We observe that OS-SIRT with 16 subsetss the fastest in this case. The reconstruction time is close tomere 10th of a second which demonstrates the capability ofPUs to also facilitate iterative reconstructions in interactiveodes (10 frames/s).

. Conclusions

e have shown that iterative reconstruction methods used inedical imaging, such as EM or SIRT, have special propertieshen it comes to their acceleration on GPUs. While splitting

he data used within each iterative update into a larger num-er of smaller subsets has long been known to offer greateronvergence and computation speed on the CPU, it can beastly slower on the GPU. This is a direct consequence of thehread fill rate in the projection and backprojection phase.arger subsets spawn a greater number of threads, whicheeps the pipelines busier and also reduces the latencies

ncurred by a greater number of passes and context switches.his is different from the behavior on CPUs where this decom-osition is less relevant in terms of computation overhead per

teration.We have also shown that these effects can be mitigated

y optimizing the relaxation factor �, restoring the theoret-cal advantage of data decompositions into smaller ordered

r

i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270 269

subsets. In this paper we were mainly concerned with demon-strating that the poor per-iteration GPU performance foriterative CT with small subsets (in the limit SART) can becompensated for by choosing an appropriate �-factor forthe reconstruction. This leads to fast convergence of thesmall subset methods and thus provides a good amortiza-tion of the high per-iteration cost. Doing so in many casesallows even iterative reconstruction procedures to be run atnear-interactive speeds, making them more applicable formainstream CT reconstruction.

The question now remains how one chooses such an opti-mal �-value in a practical setting. While this was not thetopic of this paper, preliminary results have shown that theoptimal number of subsets seems to vary depending on thedomain application, the general reconstruction scenario, andalso the level of noise present in the data. Thus, in orderto identify the optimal subset number, as well as �, for anew application setting and noise level, to be used later forrepeated reconstructions within these scenarios, one maysimply run a series of experiments with different numbersof subsets and �-settings, and then use the setting com-bination with the shortest wall-clock time required for thedesired reconstruction quality. In fact, such strategies aretypical for GPU-accelerated general-purpose computing appli-cations (GPGPUs). For example, the GPU bench was designed torun a vast benchmark suite [20] to determine the capabilitiesof the tested hardware.

We also believe that our findings with SIRT readily extend toEM since the two methods have, as far as the computations areconcerned, similar operations and overhead, and we intend tostudy these dependencies more closely in future work. Finally,although we have used GLSL shaders to obtain the results pre-sented in this paper, similar symptoms also occur in CUDAwhere synchronization operations have to be formally calledto finish executions of all threads within a thread block toresume the pipeline. In general, the underlying hardware, itsarchitecture, and the overall thread management remain thesame—just the API is different, enabling a tighter control overthe threads and also memory. In future work, we plan to studythe reported effects to a more detailed extent in CUDA, todetermine if a shift in the performance-optimal subset con-figuration occurs. But in fact, this is likely to happen for everynew generation of the hardware.

Conflict of interest

None declared.

Acknowledgements

This work was partially funded by NSF grant CCR-0306438,NIH grant R21 EB004099-01 and GM31627 (DAA) and the KeckFoundation, and the Howard Hughes Medical Institute (DAA).

e f e r e n c e s

[1] A. Andersen, Algebraic reconstruction in CT from limitedviews, IEEE Transactions on Medical Imaging 8 (1989) 50–55.

s i n
[2] A. Andersen, A. Kak, Simultaneous Algebraic ReconstructionTechnique (SART): a superior implementation of the ARTalgorithm, Ultrasonic Imaging 6 (1984) 81–94.

[3] Y. Censor, On block-iterative maximization, Journal ofInformation and Optimization Science 8 (1987) 275–291.

[4] R. Fernando, M. Kilgard, The Cg Tutorial: The DefinitiveGuide to Programmable Real-time Graphics, Addison-WesleyProfessional, 2003.

[5] P. Gilbert, Iterative methods for the 3D reconstruction of anobject from projections, Journal of Theoretical Biology 76(1972) 105–117.

[6] R. Gordon, R. Bender, G. Herman, Algebraic ReconstructionTechniques (ART) for three-dimensional electronmicroscopy and X-ray photography, Journal of TheoreticalBiology 29 (1970) 471–481.

[7] H. Hudson, R. Larkin, Accelerated image reconstructionusing ordered subsets of projection data, IEEE Transactionson Medical Imaging 13 (1994) 601–609.

[8] J. Kole, F. Beekman, Evaluation of accelerated iterative X-rayCT image reconstruction using floating point graphicshardware, Physics in Medicine and Biology 51 (2006) 875–889.

[9] K. Mueller, R. Yagel, Rapid 3D cone-beam reconstructionwith the Algebraic Reconstruction Technique (ART) by using
texture mapping hardware, IEEE Transactions on MedicalImaging 19 (12) (2000) 1227–1237.
[10] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E.Lefohn, T. Purcel, A survey of general-purpose computationon graphics hardware, Eurographics (2005) 21–51.

b i o m e d i c i n e 9 8 ( 2 0 1 0 ) 261–270

[11] T. Schiwietz, T. Chang, P. Speier, R. Westermann, MR imagereconstruction using the GPU, Proceedings of SPIE 6142(2006) 1279–1290.

[12] L. Shepp, Y. Vardi, Maximum likelihood reconstruction foremission tomography, IEEE Transactions on Medical Imaging1 (1982) 113–122.

[13] F. Xu, K. Mueller, Real-time 3D computed tomographicreconstruction using commodity graphics hardware, Physicsin Medicine and Biology 52 (2007) 3405–3419.

[14] F. Xu, K. Mueller, Accelerating popular tomographicreconstruction algorithms on commodity PC graphicshardware, IEEE Transactions on Nuclear Science 52 (2005)654–663.

[15] F. Xu, K. Mueller, M. Jones, B. Keszthelyi, J. Sedat, D. Agard,On the efficiency of iterative ordered subset reconstructionalgorithms for acceleration on GPUs, in: MICCAI (Workshopon High-performance Medical Image Computing &Computer Aided Intervention), New York, September, 2008.

[16] X. Xue, A. Cheryauka, D. Tubbs, Acceleration of fluoro-CTreconstruction for a mobile C-Arm on GPU and FPGAhardware: a simulation study, Proceedings of SPIE 6142(2006) 1494–1501.

[17] G. Zeng, G. Gullberg, Unmatched projector/backprojector
pairs in an iterative reconstruction algorithm, IEEETransactions on Medical Imaging 19 (5) (2000) 548–555.
[18] http://www.gpgpu.org.[19] http://www.nvidia.com/object/cuda develop.html.[20] http://graphics.stanford.edu/projects/gpubench.

http://www.gpgpu.org/

http://www.nvidia.com/object/cuda_develop.html

http://graphics.stanford.edu/projects/gpubench

on the efficiency of iterative ordered subset reconstruction algorithms for acceleration on gpus

Documents