an fpga architecture for real-time 3-d tomographic...

University of California

Los Angeles

An FPGA Architecture for

Real-Time 3-D Tomographic Reconstruction

A thesis submitted in partial satisfaction

of the requirements for the degree

Master of Science in Electrical Engineering

by

Henry I-Ming Chen

2012

c© Copyright by

Henry I-Ming Chen

2012

The thesis of Henry I-Ming Chen is approved.

John Villasenor

William J. Kaiser

Dejan Markovic, Committee Chair

University of California, Los Angeles

2012

ii

For all of those who have been my teachers.

iii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Computed Tomography . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Radon Transform . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Filtered Backprojection . . . . . . . . . . . . . . . . . . . . 6

2.2 Cone-Beam Single-Slice Rebinning . . . . . . . . . . . . . . . . . . 8

3 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.3 Communications Interfaces . . . . . . . . . . . . . . . . . . 17

3.2.4 Design Tools and Libraries . . . . . . . . . . . . . . . . . . 19

3.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.1 Rebinning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.3 Backprojection . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.4 Runtime Testing . . . . . . . . . . . . . . . . . . . . . . . 32

4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 Hardware Implementation . . . . . . . . . . . . . . . . . . 34

4.1.2 Software Reference . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Hardware Resource Utilization . . . . . . . . . . . . . . . . . . . . 37

4.3 Full-Scale System . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.1 Single-Engine Scaling . . . . . . . . . . . . . . . . . . . . . 39

4.3.2 Multi-Engine Scaling . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1 Summary of Research Contributions . . . . . . . . . . . . . . . . 46

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

v

List of Figures

2.1 Geometry of a set of projection line integrals . . . . . . . . . . . . 4

2.2 Sinogram of a delta function . . . . . . . . . . . . . . . . . . . . . 5

2.3 Example of an image and its corresponding sinogram . . . . . . . 5

2.4 Fan-beam projection scanning . . . . . . . . . . . . . . . . . . . . 6

2.5 Imaging a 3-D object using stack of 2-D image slices . . . . . . . 8

2.6 Helical scan trajectory . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 Phantom sinogram using helical cone-beam scanning . . . . . . . 10

3.1 ROACH system block diagram . . . . . . . . . . . . . . . . . . . . 15

3.2 ROACH shared memory interfaces . . . . . . . . . . . . . . . . . 18

3.3 ROACH bus-attached external memory . . . . . . . . . . . . . . . 19

3.4 Designing in the System Generator environment . . . . . . . . . . 20

3.5 FPGA reconstruction flow . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Rebinning architecture . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 Rebinning module implementation block diagram . . . . . . . . . 22

3.8 Rebinning factor buffering and readout . . . . . . . . . . . . . . . 24

3.9 Filtering architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.10 Backprojection architecture . . . . . . . . . . . . . . . . . . . . . 29

3.11 Sinogram projection in-memory zero-padding . . . . . . . . . . . . 30

3.12 Reconstruction engine runtime test architecture . . . . . . . . . . 33

4.1 Filtering trade-off in parallelized computation times . . . . . . . . 44

vi

4.2 Rebinning trade-off in parallelized computation times . . . . . . . 45

vii

List of Tables

4.1 Image Throughput Comparison . . . . . . . . . . . . . . . . . . . 37

4.2 FPGA Hardware Resource Utilization Summary . . . . . . . . . . 38

4.3 External Memory Utilization Summary . . . . . . . . . . . . . . . 38

4.4 Full-Scale Image Throughput Comparison . . . . . . . . . . . . . 40

4.5 Full-Scale External Memory Requirements . . . . . . . . . . . . . 40

4.6 Expected Real-Time Performance Gap . . . . . . . . . . . . . . . 41

viii

Acknowledgments

I would like to thank, first and foremost, my advisor, Professor Dejan Markovic.

Without his continued help and support, I could not have completed this work.

I would also like to thank Professors John Villasenor and William J. Kaiser for

their encouragement and keen insights throughout this process.

My gratitude to Andy Kotowski, Tim Coker, and Dan Oberg of Rapiscan

Systems for enabling us to engage this topic. I also owe a great deal to Dr. Marta

Betcke, University College London, without whose gracious help the black magic

of the reconstruction algorithm would remain an unsolved mystery before me.

The assistance of Dr. Jianwen Chen has also been invaluable.

To all of my colleagues of DMGroup—Dr. Chia-Hsiang Yang, Dr. Victo-

ria Wang, Dr. Chaitali Biswas, Tsung-Han Yu, Rashmi Nanda, Sarah Gibson,

Chengcheng Wang, Vaibhav Karkare, Fengbo Ren, Fang-Li Yuan, Richard Dor-

rance, Yuta Toriyama, Kevin Dwan, Qian Wang, and Hari Chandrakumar—

thank you all for the coffee runs, the useful discussions, the mentoring, and the

friendship. A great thank-you, too, to Kyle Jung, who always made sure we had

everything we needed.

Thank you to Dan Werthimer and the rest of the CASPER collaboration for

helping me develop the skills needed to complete this project, not to mention the

platform to implement it on!

I sincerely appreciate all that my parents and sister have done for me in order

to get me to this point. Without them, I would be nowhere. And, finally, I am

grateful for my dear Helen, who helped me to realize that I could achieve more.

ix

Abstract of the Thesis

An FPGA Architecture for

Real-Time 3-D Tomographic Reconstruction

by

Henry I-Ming Chen

Master of Science in Electrical Engineering

University of California, Los Angeles, 2012

Professor Dejan Markovic, Chair

Tomographic imaging is a cross-sectional technique that has broad uses in

medical, industrial, and security applications. However, it requires filtered back-

projection, an O(n3) algorithm, in order to reconstruct images from the scanning

system. Modern helical cone-beam scanning systems have the ability to scan at

even faster rates, further stressing the need for accelerated reconstruction meth-

ods. Moreover, these new scanners have unique geometries that require additional

processing. A scalable hardware architecture based on FPGAs is presented to deal

with the computational complexity of this kind of image reconstruction, demon-

strating a 5× throughput improvement over a reference software implementation.

The system is also taken as the starting point for a proposed architecture that can

extend performance by another 600×, making it suitable for high-speed real-time

scanner systems.

x

CHAPTER 1

Introduction

Tomography is a cross-sectional imaging technique in which an object is illumi-

nated from many angles using a penetrating radiation. Since its development in

the 1980s, tomographic imaging has found many uses in medical, industrial, and

security applications.

Computed tompgraphy (CT), one of the primary tomographic imaging tech-

niques, is a very computationally-intensive undertaking, needing a O(n3) recon-

struction algorithm. Because of this, CT imaging tends to be a very slow process,

putting great constraints on its ability to be used in functional or real-time imag-

ing settings. Attempts to accelerate CT reconstruction is an ongoing effort.

As general-purpose CPUs have begun to slow in their scaling-derived perfor-

mance increases, alternative approaches are being sought in order to continue

meeting the computational requirements of CT reconstruction. By the nature

of the algorithms involved—highly iterative and regular operations—CT recon-

structions are ideal candidates for acceleration by field programmable gate arrays

(FPGAs). FPGAs are reconfigurable hardware, offering a middle road between

a custom application-specific integrated circuit (ASIC), which offers high perfor-

mance but fixed functionality, and CPUs, which are highly programmable but

not very efficient [1]. Using an FPGA, performance can be accurately and de-

terministically characterized, a highly desireable trait in the design of real-time

systems. FPGAs also allow for greater customization at the system level, being

1

able to add different types of memory and communications interfaces.

In this work, a scalable FPGA-based design is presented that targets these

computational requirements. Special consideration is paid to making this solution

a framework that can be used with high-speed real-time CT scanning systems. An

overview of the algorithms involved in such a system is first covered in Chapter 2.

With the basic requirements in mind, the hardware platform will be introduced,

along with a description of the system implemented on that platform (Chapter 3).

An analysis of the system performance results follows in Chapter 4.

2

CHAPTER 2

Algorithm

2.1 Computed Tomography

A conventional CT system is constructed such that the object of interest is placed

at the center of a scanning system, about which a source and detector pair rotate.

As the source projects a radiation ray, the detector measures the absorption due

to all parts of the object along the projection line. By making multiple projections

from all angles around the object, a cross-sectional reconstruction of the object

can be made.

2.1.1 Radon Transform

The tomographic imaging process can be discussed in terms of the Radon trans-

form. The Radon transform describes an object using a series of line integrals

of the object. For example, if the two-dimensional object shown in Figure 2.1

is defined as f(x, y), then a line integral P along a line with source angle θ and

lateral offset t is given by

Pθ,t =

∫ ∞−∞

f(x, y)δ(x cos θ + y sin θ − t)dxdy (2.1)

and the set of all such line integrals over θ and t P (θ, t) is the Radon transform

of f(x, y) [2]. The visual representation of the transform of an object is also often

called its sinogram, as the Radon transform of a delta function is a sinusoid. This

3

θ

y

x

f(x,y)Pθ,t

t

Figure 2.1: Geometry of a set of projection line integrals

can be seen in Figure 2.2.

Figure 2.3 shows the Radon transform applied to a more complex geome-

try. The Shepp-Logan phantom, Figure 2.3a, is frequently used as a standard

benchmark for the development of tomographic reconstruction algorithms. The

phantom’s sinogram, generated from its Radon transform over 180◦, is shown in

Figure 2.3b.

When applied to tomographic imaging, the 2-D function f(x, y) can be used

to describe a cross-sectional slice of a 3-D body. A ray of penetrating radiation

(typically an X-ray for computed tomography) travels through this 2-D slice along

the line described by θ and t. As this ray passes through the physical object,

it is absorbed or scattered by the matter it encounters. If f(x, y) describes the

absorption or scattering of the radiation ray due to the object at each spatial

location (x, y), then the line integral sums up the total attenuation due to the

object as the X-ray passes through. In this way, one can “calculate” a line

integral by measuring the total attenuation of the X-ray. The Radon transform

4

(a) Delta function image

Det

ecto

r

θ

50 100 150 200 250 300 350

50

100

150

200

250

Delta Sinogram

(b) Sinogram of delta taken over 360◦

Figure 2.2: Sinogram of a delta function

(a) Shepp-Logan phantom

20 40 60 80 100 120 140 160 180

50

100

150

200

250

300

350

400

Phantom Sinogram

Det

ecto

r

θ

(b) Sinogram of phantom taken over 180◦

Figure 2.3: Example of an image and its corresponding sinogram

5

Figure 2.4: Fan-beam projection scanning

of a physical object can therefore be taken by measuring the attenuations of

X-rays projected from multiple angles θ with different offsets t.

When considering the physical implementation of an imaging system, taking

the Radon transform using a series of offset line integrals can be inefficient. Such a

scanning geometry would require a series of linear scans for every angular rotation

in order to cover the object of interest. A simpler scanning geometry uses a single

point source to project rays in a fan-beam (Figure 2.4). If received by a bank of

detectors then full coverage of the object can be achieved more rapidly.

2.1.2 Filtered Backprojection

Inverting the Radon transform comprises the heart of computed tomography; by

undoing the Radon transform of the scanning process, an image of the original

object can be recovered from projection data. It can be seen as “unsmearing”

6

the image data back along each projection line to re-form the original image.

By using the Fourier slice (or projection-slice) theorem, it is possible to derive

a reconstruction function for f(x, y) given as

f(x, y) =

∫ π

0

Qθ(x cos θ + y sin θ)dθ (2.2)

where Qθ(t) is

Qθ(t) =

∫ ∞−∞

Sθ(w)|w|e2πjwtdw (2.3)

and Sθ(w) is the Fourier transform of a line projection along a projection angle

θ. Qθ thus represents a filtering operation on Sθ using a filter with frequency

response |w| [2]. Therefore, the reconstruction process is generally referred to as

“filtered backprojection,” (FBP) where backprojection alludes to the reversal of

the “forward projection” of the line integral.

This expression is readily discretized for real-world applications as

img(x, y) =K∑θ=0

Qθ(x cos θ + y sin θ) (2.4)

More importantly, this gives the basic form for the computed tomography

filtered backprojection algorithm:

foreach θ do

filter sino(θ, ∗);

foreach x do

foreach y do

n = x cos θ + y sin θ;

img(x, y) = sino(θ, n) + img(x, y);

end

end

end

7

Figure 2.5: Imaging a 3-D object using stack of 2-D image slices

This parallel-beam filtered backprojection can also be applied to fan-beam

scanners if the fan-beam images are first converted to parallel-beam sinograms [3].

In this way, the same backprojection technique can be applied to different scanner

geometries.

2.2 Cone-Beam Single-Slice Rebinning

Because CT scanning is based on cross-sectional slices of an object, it is an

inherently two-dimensional process. In order to image all parts of an object, or to

create a three-dimensional reconstruction, a stack of 2-D image slices are needed

(Figure 2.5). The basic approach for this is a step-and-repeat methodology,

in which a scanner system moves linearly along an object, taking projections at

small distance intervals. However, this solution presents several major drawbacks,

primarily that it can be extremely slow, and that it greatly increases the amount

of exposure to ionizing X-ray radiation.

Several different scanning geometries and accompanying reconstruction tech-

niques have been proposed in order to overcome these downsides. The approach

8

Figure 2.6: Helical scan trajectory

presented here is a proprietary adaptation [4] of the cone-beam single-slice rebin-

ning method discussed by Noo et al. [5].

To summarize, the cone-beam single-slice rebinning (CB-SSRB) technique

uses helical cone-beam scanning that traverses a continuous helical path around

the object of interest (Figure 2.6). Total coverage of the scanned volume is

achieved by using a volumetric cone-beam projection. Because the cone-beam

projects along the z -axis as well as the x - and y-axes, it can account for the gaps

along the helical scan path.

The different geometric setup of helical cone-beam scanning clearly produces

scan images different from the 2-D Radon transform, as seen in Figure 2.7. There-

fore, in order to recover images using conventional FBP techniques, additional

pre-processing of helical cone-beam scan images must be performed. This pro-

cedure is termed “rebinning,” with the end result being a conversion of helical

cone-beam scan images into a sequence of 2-D projection images.

Details on the mathematics of rebinning are found in the literature [2,5], but

an intuition of its mechanism can be achieved by thinking of it as a form of

9

20 40 60 80 100 120 140 160 180

50

100

150

200

250

Cone-Beam Helical Scan Projections

Det

ecto

r

Projection Source

Figure 2.7: Phantom sinogram using helical cone-beam scanning

vector projection. As a cone beam sweeps over a volume, the desired result is

to form image slices from within that volume. To the first order, recovering an

image slice on a rebinning center λ is the projection of the scan projection vector

onto the λ-plane. As such, it is performed by weighting the cone-beam scan

detector value with a geometry-dependent scaling factor. Because the original

scanner is performed with a conical beam, slices taken from within that volume

will be fan-beam images. These can be reconstructed natively using fan-beam

backprojection techniques, or converted into parallel-beam images.

For high-speed imaging applications, traditional rotating scanners are being

replaced with new geometries [6]. In a traditional scanner, where a source/detector

pair rotate, scanning speeds are limited by the G-forces acting on the rotation

mechanism. The new geometries use scanner rings studded with numerous sta-

tionary scanners and detectors. In such a system, rotational scanning is simulated

10

by firing the sources in sequence around the ring. Cone-beam helical scanning

can be achieved by extending these rings in the lateral direction. In this way, the

cone beam can be received over an area, and the firing sequence can emulate a

helical path.

11

CHAPTER 3

FPGA Implementation

CT reconstruction has always been a computationally demanding process, due

to backprojection being an O(n3) algorithm. Despite its high time complexity,

the calculation requirements are actually relatively low. Backprojection itself is

highly repetitive, but using very simple arithmetic operations. If the sin/cos

functions are pre-computed and stored as look-up tables, then each iteration

only requires two multiplications and two additions. These characteristics make

the algorithm a prime candidate for FPGA acceleration. The remainder of this

chapter details the implementation of an FPGA reconstruction engine using an

algorithm [4] suitable for the helical cone-beam CT system described by Morton

et al. in 2009 [6].

The reference system uses 768 projector sources and 1152 detectors per ring,

with 8 rings in a row to simulate helical rotation. Each revolution around the

rings is used to rebin 32 image slices, from which 800×800-pixel images are gener-

ated. Because of limited hardware resources, the reconstruction engine presented

here processes a reduced image size. The design is for a helical cone-beam scan-

ner having 192 sources and 288 detectors per ring, but maintains 8 rings per

revolution. Using this projection data, 200×200-pixel images are reconstructed.

12

3.1 Prior Work

Methods of accelerating filtered backprojection for CT imaging is fairly well-

studied, owing to its extensive use in medical and industrial applications. In

particular, the benefits of FPGA acceleration have been explored. In an early

work using Xilinx’s first-generation Virtex FPGAs [7], CT reconstruction was

performed on moderately-sized images (512×512). Backprojection was acceler-

ated using the FPGA using a 4×-parallel architecture that was capable of up

to a 20× speedup versus software backprojection. In addition to demonstrating

a parallel backprojection architecture, this work also showed that a fixed-point

implementation using moderate bitwidths can have as low as 0.015% error of the

software floating-point computation

In two fairly recent works [8,9], an Altera Stratix-II FPGA [10] was used as a

computational co-processor by connecting it to the system’s CPU via the Hyper-

Transport interconnect protocol [11]. A multi-threaded software implementation

was used as the reference, against which up to 103× speedup was achieved by

using a 128×-parallel architecture.

While these works have clearly demonstrated the capabilities of FPGAs for

accelerating backprojection, their architectures are not suited for real-time appli-

cations. In particular, the systems presented do not provide a complete solution

for filtered backprojection; filtering is performed as a pre-processing step before

offloading to the FPGA for accelerating backprojection, and its computational

needs are not taken into account. As will be examined in further detail in Chap-

ter 4, filtering represents a non-negligible portion of computation, and one that

benefits greatly from FPGA acceleration. More importantly, the systems are op-

timized for offline acceleration in which all data is readily available. Projection

data must be loaded into memory buffers accessible to the FPGA on a sinogram-

13

by-sinogram basis using the CPU of the host system. These software-based archi-

tectures do not properly address the needs of a real-time implementation suitable

for scanner integration. Furthermore, these works do not demonstrate the feasi-

bility of implementing the rebinning function in hardware to allow use with helical

cone-beam scanners, and thus can only reconstruct parallel-beam sinograms.

The reference system provides a complete real-time solution that includes the

necessary rebinning and filtering operations, implemented on four Cell Broad-

band Engines (CBEs) [12]. While Cell processors are capable of providing an

accelerated solution suitable for real-time applications, it is still inherently a

software implementation, subject to the non-deterministic runtime behavior of a

software stack. For example, the real-time requirements can only be guaranteed

by over-provisioning the system’s capabilities and allowing for more than a 10%

safety margin in performance. More importantly, the Cell processor has a unique

architecture that poses significant challenges for customizing applications that

can properly extract its full performance capabilities. This had lead to limited

adoption, and an uncertainty in the longevity of the microprocessor architecture.

This work, therefore, seeks to present a solution that can bridge the two bodies of

work, offering a complete hardware solution for the required algorithm suitable

for integrating into a real-time scanner system.

3.2 Platform

The entirety of this reconstruction algorithm was demonstrated on the open-

source academic research platform “ROACH” (Reconfigurable Open Architecture

Computing Hardware) [13]. ROACH was developed as part of an international

collaboration of primarily radio astronomy research instutitions as a common

high-performance computing platform for real-time digital signal processing ap-

14

Xilinx Virtex-5 SX95TFPGA

PowerPC 440EPx

QDR-II+SRAM

QDR-II+SRAM

DDR2 DRAM

GPIO10Gb

Ethernet1Gb

Ethernet

OPB Bus

DDR2 DRAM

Figure 3.1: ROACH system block diagram

plications [14]. A block diagram of the system is shown in Figure 3.1. While

much of the information about the ROACH is found among CASPER documen-

tation, the memory and communications interfaces figure heavily into the system

design, and are thus highlighted here.

3.2.1 FPGA

ROACHs are equipped with FPGAs from the Xilinx Virtex-5 family; this partic-

ular instance had a Virtex-5 SX95T component optimized for DSP applications

with a large number of embedded multiplers (“DSP48E”) and dual-ported SRAM

blocks (“BlockRAM” or “BRAM”). In addition to its 14,720 logic SLICEs1, an

SX95T has 640 DSP48Es and 488 18-kbit BRAMs [15]. A Virtex-5’s DSP48E

is a “hard macro”; that is, it is a circuit-level implementation of a 25×18-bit

1A Virtex-5 SLICE is four 6-input lookup tables (LUTs) and four flip-flops (FFs)

15

multiplier-accumulator, and consumes no reconfigurable logic resources in order

to create a multiplier.

3.2.2 Memory

To aid in real-time DSP applications, ROACH boards are designed with high-

bandwidth memory interfaces connected directly to its FPGA. Each board can

be outfitted with a commercial-standard DDR2 SDRAM DIMMs up to 1 GB and

two QDR-II+ SRAM chips of up to 72 Mb.

Having both SDRAM and QDR SRAM helps cover different usage models

for memory requirements. As a DRAM technology, DDR2 provides high storage

capacity, at the cost of increased access complexity. DRAM carries very high

random-access latencies due to its structure and internal architecture, and as

such the highest bandwidth is achieved when accessed in large sequential bursts.

DDR2 DIMMs (dual inline memory modules) run a 64-bit dual data rate (DDR)

interface, meaning that data is transferred on both the rising and falling edges of

the clock. Data must be accessed in small bursts of consecutive memory locations.

ROACH uses ECC (error-correcting code) memory modules which have an

extra parity bit per byte, giving a 72-bit interface. The FPGA’s DDR2 controller

provides a 144-bit, two-word-burst interface to the FPGA fabric. Because of the

way DDR2 memory is implemented at the physical level, the ROACH can only

support operating it at a limited number of fixed frequencies: 150, 200, 266,

300, or 333 MHz. Read and write requests are executed in the order of issue,

but due to low-level actions taken by the controller such as row switching or

refreshing, their execution latencies are non-deterministic. A FIFO buffer sits at

the interface boundary, providing some elasticity to absorb the non-determinism.

In contrast, the QDR (quad data rate) SRAM comes in significantly lower

16

densities, but provides much more flexibility in its interface. Because it is an

SRAM technology, there is no penalty for fully-random access patterns. QDR

achieves its quadrupled data rate by providing independent read and write ports,

sharing address and control signals, operating at DDR. The highest bandwidth

is achieved by properly interleaving read and write accesses so that, for a given

clock cycle, two words are written concurrently with two words being read.

Like DRAM, the QDR interface specification also requires bursted accesses,

so that switching on the control lines is reduced and that requests are spaced out

so reads and writes can be interleaved. The QDR memories on ROACH have

18-bit data buses with a 4-word DDR burst of consecutive memory locations.

As with the DRAM used, the non-multiple-of-eight data width is nominally for

ECC parity bits; in both interfaces 9 bits is considered to be a single byte. The

FPGA’s QDR controller abstracts this physical layer and presents an interface

with a two-word burst of 36 bits each. The SRAM structure allows the controller

to have a fixed 10-cycle read latency and 1-cycle write latency. The QDR chips

have the flexibility to run at any clock frequency between 150 and 400 MHz.

3.2.3 Communications Interfaces

In addition to the computational and memory resources provided by the FPGA,

the ROACH platform was also selected for its multitude of interface capabilities.

Data can be transferred to the FPGA at three levels: as raw electrical signals,

network packets, or data files.

Included on the ROACH board is a PowerPC 440EPx embedded processor

with 512 MB of dedicated DDR2 DRAM, non-volatile Flash memory, and a

Gigabit Ethernet interface. This enables the PowerPC to boot a full Linux op-

erating system, providing higher-level communications protocols like telnet and

17

FPGA

PowerPCOPB Bus

Bu

s B

rid

ge

User Logic SW Register

SharedBRAM

SW Register

Figure 3.2: ROACH shared memory interfaces

SSH/SCP. The FPGA is in turn connected directly to the procesor’s External Bus

Controller (EBC), allowing some memory spaces on the FPGA to be mapped onto

the On-chip Periperal Bus (OPB). This ability, coupled with the full Linux pro-

gramming environment, provides a good balance between standalone operation

of the FPGA and ease of use and communication. Including the local software

stack, the PowerPC has roughly 7 MB/s of bandwidth to the FPGA [16].

A variety of 32-bit devices can be mapped onto the PowerPC bus from the

FPGA. The simplest is a unidirectional single-word software register, which can

be used for sending control settings to the FPGA, or for reporting statuses with

low rates of change. BRAMs can also be configured as Shared BRAMs, which

are bidirectional; the dual-ported memories are connected to both the PowerPC

bus and the FPGA fabric, allowing block data transfer between the two domains.

The shared memory interfaces are diagrammed in Figure 3.2. Finally, the DRAM

and QDR controllers can be attached to bus interfaces as well, (Figure 3.3) for

data sizes beyond the capacity of the on-FPGA BRAMs. Because of the 32-bit

bus interface, the parity bits of the DRAM and QDR memories are discarded

when they are used in this way.

18

Bu

s B

rid

ge

UserLogic

PowerPCOPB Bus

DRAMController

QDRController

DDR2DRAM

QDR-II+SRAM

FPGA

Figure 3.3: ROACH bus-attached external memory

3.2.4 Design Tools and Libraries

In addition to the board itself, a part of the ROACH platform is a set of design

tools and libraries. Rather than code in an HDL like Verilog or VHDL, as would

be the case for most FPGA-based designs, most ROACH development is done

using System Generator, a Xilinx toolbox for the MATLAB Simulink graphical

modeling environment [17, 18]. Figure 3.4 shows an example of designing in the

Simulink/System Generator environment.

System Generator provides bit-/cycle-accurate simulation, as well as auto-

matic data type inference for arithmetic precision and bit growth. The afore-

mentioned ROACH-specific memory controllers and interfaces have been ported

to this environment, along with a large library of commonly-used DSP modules

such as FIRs and FFTs [19]. This allowed development to focus on the system

architecture, instead of low-level computational and interface blocks.

19

Figure 3.4: Designing in the System Generator environment

Rebin Filter Backproject

Figure 3.5: FPGA reconstruction flow

3.3 System Architecture

The processing flow necessary for reconstructing an image from a helical cone-

beam scanner is broken up into modules that can be assembled sequentially. This

kind of flow-through architecture lends itself well to a streaming implementation.

Scanner data is passed along through the rebinning, filtering, and backprojection

modules (Figure 3.5).

3.3.1 Rebinning

As discussed in Section 2.2 and shown in Figure 3.6, the rebinning function can

be realized as a coefficient weighting of selectively-accessed data points from scan-

ner input. Two weighted samples are then weighted again and added together

20


Indices Weights

DRAM

Rev0 Buffer

Rev1 Buffer

Rev2 Buffer

SRAM

Figure 3.6: Rebinning architecture

in order to perform the fan-to-parallel conversion. This can be seen in the more

detailed block diagram of the rebinning module, Figure 3.7. Because both co-

efficients are applied linearly to the data samples, they can be combined into a

single rebinning/fan-to-parallel weight. Doing so reduces the number of hardware

multiplications, memory capacity, and memory bandwidth.

Rebinning indices and weights are determined by the particular rebinning

function to be performed and the specific parameters of the scanner geometry.

For a given setup, they are considered to be static, and can be loaded into the

system once at initialization. Because these rebinning factors have a one-to-one

relationship with the rebinning output and are pre-computed offline, they can be

phrased in a sequential manner so as to be suitable for accessing from DRAM.

On the other hand, projection data must be accessed according to the rebinning

indices. As such, the random-access capabilities of SRAM are preferrable.

21

DRAM

BRAM

Latency

Match

QDR

SRAM

Circ.

Buffer

Map

Z-2

BRAM

Projection data

Weights,

Indices

Weight

IndexFan-to-Parallel

Folding

Figure 3.7: Rebinning module implementation block diagram

The rebinning function uses data from three consecutive scan revolutions to

reconstruct a stack of rebinned images on their rebinning center λ. Each pixel

in the rebinned pseudo-fan-beam image2 is reconstructed by weighting one point

from the scan data, and two fan-beam images are needed to synthesize a single

parallel-beam sinogram. Therefore, for each revolution forming nλ image slices

of npix pixels using nsrc sources, rebinning needs

2× nλ × nsrc × npix (3.1)

rebinning indices and

2× nλ × nsrc × npix (3.2)

rebinning weights. This particular implementation uses the same rebinning func-

tion for all revolutions of data, so only one set of rebinning factors is needed.

Each stack of nλ images needs to draw repeatedly from three revolutions of

scan data. As each revolution is composed of nring rings of nsrc projection sources

2Termed pseudo because the extra scaling does not mathematically synthesize a fan-beamimage, but a scaled version of one

22

and ndet detectors,

3× nring × nsrc × ndet (3.3)

scan projection words to be buffered. This imposes a minimum bitwidth of

dlog2 (3× nring × nsrc × ndet)e on the index words.

In the particular scanner parameters being designed for, nλ = 32, nsrc =

192, ndet = 288, npix = 200. Thus 1,327,104 projections points need to be buffered

to construct 32 rebinned images. 2,457,600 21-bit indices are needed to select data

from the scan buffer to be paired with 2,457,600 rebinning weights. So far, all

quantities have been expressed in terms of words, and determining actual memory

requirements is determined by bitwidth selection.

That the index bitwidth, even in this case of a very reduced geometry, is

21 bits sets a very firm bound on determining the bitwidths for the rebinning

factors. Two additional considerations are taken into account: that rebinning

indices and weights are used in pairs, and that both must be loaded into DRAM.

Because of this, powers-of-two bitwidths are preferred, as they are more easily

packed together when using a 32-bit interface (the ECC bits are dropepd). Since

the 21-bit index precludes a smaller encoding, both indices and weights are stored

as 32-bit words when written into DRAM. Each DRAM transaction made from

the FPGA thus gives four rebinning index/weight pairs. Because data is loaded

into the FPGA via the PowerPC using a telnet protocol, a TCP/IP API from

the MATLAB Instrumentation Control Toolbox can be used. The MATLAB

environment makes it straightforward to read the indices and weights from files,

then work with the data as needed to load it via the API.

When the rebinning factors are read from DRAM into the FPGA, they are

first buffered in a pair of BRAMs using a ping-pong scheme. This helps to further

mask the non-deterministic accesses and smooth out data bursts to help ensure

23

.BRAM

.

.

DRAM

128

128

64

Index

WeightCPU

64

Figure 3.8: Rebinning factor buffering and readout

a controlled, continuous data flow from DRAM during rebinning. A buffer depth

of 256 index/weight pairs is used, as that corresponds to fitting in exactly a

single 18-kbit BlockRAM each. The Xilinx BlockRAMs are dual-ported and can

support different memory aspect ratios on the two ports—allowing the memory

to be filled up as 128 128-bit words, but read out as 256 64-bit words. In this

way, a burst of 128-bit words, each composed of two index/weight pairs, is read

sequentially from DRAM and written into a buffer. The buffers are read out as

64-bit words, which are bit-sliced into the 32-bit indices and weights. Figure 3.8

diagrams this flow. Whenever a buffer has been emptied, DRAM acceses are

immediately initiated to refill it while data is drawn from the other buffer.

As previously discussed, three consecutive revolutions of scan data are stored

in QDR to implement the rebinning function. As each stack of nλ images is

processed, the buffer is updated with a new revolution of scan data on a first-in,

24

first-out basis. For each revolution, nring × nsrc × ndet words must be replaced,

which is too large a block of data to be practically done using a true FIFO.

Therefore, the QDR buffer is accessed as a circular buffer, where the revolution

to be discarded is overwritten with new data. To account for the circular buffer

addressing, the rebinning indices coming from DRAM are passed through a block

to map the logical rebinning indices n to a physical QDR address D:

D = (n+N) mod 3N (3.4)

where N is the number of words per revolution. As shown in Figure 3.7, rebinning

weights must be delayed to match the latency of this remapping.

While QDR does not incur latency penalties for random accesses as DRAM

does, its four-word-burst architecture limits access granularity. Physically the

memory array is read and written as a 4M×18-bit array, but the effective interface

is that of a 1M×72-bit array which is accessed every other cycle. This interface

requires that multiple words of 16-bit projection data be packed and unpacked

when transferring in and out of QDR. This can be made transparent to the

rebinning by packing in powers-of-two: each 72-bit QDR word can fit four 16-bit

projection points, allowing the upper dlog2 (3× nring × nsrc × ndet)e − 2 bits of

the index to be used as the QDR address and lower 2 bits to select between data

sliced from within the QDR word.

After a projection sample is selected by this indexing method, it is multiplied

with its latency-matched weight using an embedded multiplier. In a nominal

rebinning design each weighted sample would form one pixel in a synthesized

fan-beam sinogram, two of which would be added together with a fan-to-parallel

scaling factor to synthesize one pixel in a parallel-beam sinogram. However, this

implementation outputs a pseudo-fan-beam pixel at this point, and corresponding

pairs must be added together to form the desired parallel-beam image. This can

25

be done simply by interleaving the rebinning factors for the two pseudo-fan-

beams when loading into DRAM so that pixel pairs are selected and weighted

consecutively.

When combined with the interleaving of two fan-beam pixels to form one

parallel-beam pixel, this results in a reconstruction rate reduction of one rebinned,

parallel-beam image pixel every four clock cycles. Though it adds to processing

latency, this hardware limitation does not impact overall system throughput,

which is bounded by the backprojection stage.

3.3.2 Filtering

An additional optimization step is performed after rebinning to reduce the com-

putational complexity of backprojection by exploiting the symmetry in scanning

geometry. As a scanner rotates around an object, it is easy to visualize that the

projections taken from 180◦ to 359◦ scan along the same lines as those taken from

0◦ to 179◦. The resulting sinogram symmetry, which can be seen in Figure 2.2b,

allows a 360◦ scan can be folded and overlaid upon itself to reduce the sinogram in

half; the number of projection angles in the sinogram nθ therefore becomes nsrc

2.

This eliminates half of the computations required to perform backprojection, and

is easily implemented inline by following a specific read and write scheme using

a dual-port BRAM.

In a 2-D image or matrix-based implementation of the algorithm, this op-

eration would be achieved by splitting the image into two halves on the θ-axis

θ = [0 : nsrc

2− 1] and θ = [nsrc

2: nsrc − 1]. One half would be flipped up/down

around the θ-axis, then the two half-matrices added to each other. The hardware

implementation notes that the matrix is being processed as a long 1-D vector,

appending the matrix columns consecutively. Flipping a matrix in the up/down

26


FFT IFFT

ROM

Filter Coefficients

Figure 3.9: Filtering architecture

direction means reversing the order of the vector within the column boundaries.

It is also noted that this flip-add operation can also be performed by flipping the

first half of the matrix onto the second half; this simply results in a flipped result

that can be corrected later.

As the rebinned, parallel-beam sinogram data is being streamed from the

rebinning module, the firstnpix×nsrc

2samples are written into a BRAM of depth

2dnpix×nsrc

2e. As the second half of the vector is output, the BRAM is read out in an

inverted-column manner: a down-counter continuously counts down from npix−1

to 0, addressing each row within a column. Every npix reads, npix is added to the

address to increment to the next column. The sample output from the BRAM

using this read address scheme is added to the sample oming from rebinning. In

this way, the flip-add operation can be performed within the same time window

27

of rebinning output. The whole sinogram is then written into another BRAM

buffer in order to feed the filter module; the same inverted-column scheme is used

at this point to undo the aforementioned image flip.

Once the sinogram has been folded in this way, images are reconstructed

using a serial implementation of standard filtered backprojection (FBP). High-

pass filtering is accomplished in the frequency domain using the prototypal FFT-

multiply-IFFT technique, as shown in Figure 3.9. A streaming, radix-2 complex

FFT block is used from the ROACH DSP design library [20]. To account for the

fact that real data is being used for a complex FFT, data from each projection

angle is zero-padded to a length of 2dlog2(2×npix)e samples. Rather than expend

memory resources, the data stream is zero-padded by muxing in a constant zero

for the appropriate samples. For this image size 512-point FFT blocks are used for

both the FFT and IFFT operations. In both transforms, data is cast into 18-bit

numbers, as that offers the best efficiency in the FPGA’s embedded multipliers.

This demonstrator uses a simple ramp filter, with pre-computed filter coeffi-

cients being supplied from an on-FPGA ROM. Some FBP implementations use

more sophisticated high-pass filters to achieve different results in image quality.

Building the filtering stage in the frequency domain allows for flexibility in this

module, as the filtering function can be changed simply by changing the coef-

ficients of filter function with no structural changes. This can even be done at

runtime, if necessary, by replacing the ROM shown in Figure 3.9 with a Shared

BRAM.

3.3.3 Backprojection

Following the algorithm detailed in Section 2.1.2, each angle of sinogram data

coming out of the filter is used in backprojection. The basic backprojection al-

28


Buffer Addressing

Sinogram Buffer

BRAM

xcosθ+ysinθ

xy

Image Accumulation

Buffer

QDR SRAM

Figure 3.10: Backprojection architecture

gorithm, while iterative, is highly sequential in that there are no dependencies

within the loop. This means that the loops can be re-ordered in any way for

a serial implementation. Taking results directly from the filter output makes

a by-θs reconstruction approach the most natural, in which the outermost loop

increments on θ. Figure 3.10 illustrates the architecture of this type of backpro-

jection.

Though data is output from the filter at a deterministic rate, the filtered

vectors are buffered in two BRAMs in a ping-pong scheme, as with the rebinning

factors discussed in Section 3.3.1. The ping-pong or double-buffered approach

alternates using each BRAM for reading and writing, such that when one buffer

is being read from, the other is being written into. While there is no need to

absorb non-deterministic latencies in this case, double-buffering here allows the

29

0

projections

00

Indexing bounds

BRAM

2 - 2+pix pix pixn n n

2 - 2pix pixn n

2pixn

2log 2

2 pixn

Figure 3.11: Sinogram projection in-memory zero-padding

backprojector to run without interruption. As soon as the loop for one θ has

been completed, the vector of the next has already been written to BRAM and

its loop can begin without pausing.

For each filtered vector of npix samples buffered in the BRAM, the appropriate

pixels can be selected using the index

n = x cos θ + y sin θ (3.5)

A restriction on this addressing is that it must access a vector at least as long

as the output image diagonal—npix√

2. In this case a projection of length npix

is being used to reconstruct an image npix pixels on a side, so the vector must

be zero-padded to avoid indexing out-of-bounds. This is done inline by writing

the samples into the zero-initialized BRAM with an address offsetnpix

√2

2. In this

way, the samples are centered in a zero-vector of length npix√

2 (Figure 3.11).

A lookup table, indexed by an incrementing θ, is used to store pre-computed

sine and cosine values for the buffer address calculation. As the implementation

30

uses a straightforward loop ordering, simple counters can be used for generating

the x, y, and sin θ/cos θ values.

The accumulation function of the backprojection process is carried out us-

ing the second QDR SRAM on the ROACH, and is greatly facilitated by the

memory’s dual-ported, fixed-latency architecture. The pixel matrix of the image

is mapped onto a single long array in the SRAM, where a pixel location (x, y)

corresponds to the word at memory location (x−1)npix+y. The full image needs

to be stored in the QDR, requiring that

n2pix (3.6)

pixels be stored.

The serial, nearest-neighbor implementation of the backprojection loops means

that there are no dependencies or conflicts on a given pixel within the same θ

loop—giving an access spacing of n2pix − 1 clock cycles. This allows the back-

projection to continuously step through the QDR memory using a single address

counter.

In the steady-state, one counter leads the other and reads out pixels from the

accumulating image. Samples are concurrently selected from the projection buffer

according to the indexing given in Equation 3.5. After latency-matching to the

image pixels being read from QDR, the projection sample is added to the pixel

and written back into the accumulation buffer. Addressing for the writing phase

is simply a delayed version of the read address—in this instance 10 cycles for the

QDR latency, plus an additional 3 cycles for the accumulation adder latency.

The system’s streaming dataflow architecture means that the backprojector

runs continuously with no gaps. Because the system’s throughput is currently lim-

ited by backprojection, each backprojected image must be immediately followed

31

by a new one to achieve maximum performance. The read and write accesses to

the accumulation memory can be run in the same way at all times if the data

flow is handled specially at the image boundaries. Since every pixel location is

continuously read out during the for accumulation, the completed backprojection

will automatically be output from the memory at the (nθ + 1)th read iteration.

This set of data is simply flagged as valid, with no other control structures nec-

essary. At this point, the accumulator must be cleared to begin backprojecting

the next image. The memory can be cleared of the previous image by muxing in

a zero to the accumulation adder for the subsequent n2pix writes, with no changes

to the QDR access pattern.

3.3.4 Runtime Testing

One of the key features of this implementation is that it is designed to be directly

integrated into a scanner system for real-time image reconstruction, rather than

be an offline post-processing accelerator. However, due to the lack of a suitable

system for such integration, two minor enhancements allow testing the design

without a CT scanner. Because the platform is standalone in terms of the re-

sources it needs to carry out the reconstruction function, the only missing pieces

are interfaces to get raw scan data into and reconstructed image data out of the

design.

This modification is greatly simplified by the streaming architecture of the

reconstruction engine. Because the data input interface assumes nothing more

than data samples accompanied by a valid flag, alternate test data can easily be

multiplexed into the input stream. Likewise, the output interface is a data/valid

pair, which can be fanned out into an alternate endpoint. Shared BRAMs can

be used in both cases: at the input side one is loaded with test projection data

32

Reconstruction Engine

SharedBRAM

Sample projections Reconstructed image

Rebinning factors

SharedBRAM

Figure 3.12: Reconstruction engine runtime test architecture

from a server, then streamed out into the reconstruction engine; at the output

reconstructed images are written into another memory, then read back to the

server (Figure 3.12).

33

CHAPTER 4

Analysis

4.1 Performance

4.1.1 Hardware Implementation

The fully-pipelined streaming architecture of the reconstruction engine allows us

to consider its performance primarily from a throughput perspective. The engine

is structured in such a way so as to provide the maximum possible single-pipeline

performance. After some initial latency to completely fill the pipeline, the design

is able execute one iteration of the backprojection loop every FPGA clock cycle,

and can sustainably stream out a reconstructed image every

n2pix × nθfclk

(4.1)

seconds. The implemented system demonstrates a 200×200-pixel image backpro-

jected from a 96-angle sinogram running on an FPGA clocked at 200 MHz. This

results in a sustained backprojection throughput of 52.08 images per second—one

image every 0.0192 seconds.

The reconstruction steps preceeding theO(n3) backprojection phase—rebinning,

weighting, folding, and filtering—all work on sinograms, and so tend to be O(n2)

calculations. They can therefore be masked by processing time of backprojection.

In order to facilitate the streaming architecture, upstream modules are throttled

to provide single images of data for processing on a by-θs basis.

34

For each reconstructed image, the rebinning module must be able to supply

one sinogram. As discussed in Section 3.3.1, the combination of QDR burst-

access restrictions and fan-to-parallel sample interleaving reduces output from the

rebinning function to one pixel every four clock cycles. The rebinning modules

can thus produce one sinogram every

fclk4× npix × nsrc

(4.2)

seconds, for a maximum of 1302 images per second. However, once the rebinned

sinogram is complete, the module idles in order to match the backprojection

throughput. This leaves the rebinning module with a 4% duty cycle: it performs

the rebinning function in a burst of 4 × npix × nsrc cycles, then waits for the

remaining 96% of the image reconstruction operation. The current design is able

to increase its output to allow a single rebinning module to feed accelerated or

multiple backprojectors up to a factor of 24× increased throughput.

Folding the sinogram to account for projection symmetry halves the effective

data rate in the pipeline, but is still bound by the 4× npix× nsrc performance of

rebinning to get all of the necessary data points. As such, its nominal throughput

is considered to be the same as that of the rebinning module.

The bridge between the O(n2) and O(n3) domains lies in the buffer that

preceeds the filter for filtered backprojection. Unlike the preceeding modules, this

buffer has a more distributed active/idle duty cycle. It forms the outermost loop

of the filtered backprojection flow, incrementing on θ and dispensing bursts of

2dlog2 2×npixe (zero-padded from npix) data samples to the FBP modules every n2pix

cycles. The filtering operation also works in this intermediate domain, producing

2dlog2 2×npixe samples per n2pix cycles as detailed in Section 3.3.2. The filter can

processfclk

nθ × 2dlog2 2×npixe(4.3)

35

sinograms per second—4069 per second for the given geometries. From here, each

filtered dataset moves to the backprojection module that processes the inner

two loops of the algorithm. This final module, as previously described, is the

primary throughput bottleneck in the reconstruction pipeline, at only 52 images

per second.

4.1.2 Software Reference

Because of the developmental nature of the particular CB-SSRB reconstruction

algorithm used, the available software reference is a MATLAB implementation.

Backprojection, however, is available as a library function via the inverse Radon

transform. The software can compute backprojection using either the built-in

MATLAB iradon function or a compiled C executable. All data and rebinning

tables are accessed from MATLAB MAT-files stored on local disk. The software

implementation was evaluated on an eight-core (dual-socket, quad-core) server

with Intel Xeon E5420 CPUs and 16 GB of RAM. Each processor has 12 MB of

cache and runs at 2.5 GHz.

MATLAB’s matrix representation for all data structures makes it a natural

environment for working with images, which are treated as matrices of pixels.

Beyond the ability to substitute a C program to perform backprojection, no fur-

ther optimizations were made to the MATLAB implementation. Software perfor-

mance is summarized in Table 4.1. Even with a single-pipeline implementation,

the FPGA provides about a 9.5× speedup versus the native MATLAB iradon

function for backprojection, almost 5× speedup versus the C implementation.

Rebinning sees a similar speedup with the FPGA, while the FPGA filter can

provide 18× increased throughput.

36

Function Software ( imgsec

) FPGA ( imgsec

) Speedup

Rebinning 251.3 1302 5.181×

Filtering 221.1 4069 18.53×

Backprojection (MATLAB) 5.435 52.08 9.583×

Backprojection (C-accelerated) 10.45 52.08 4.984×

Table 4.1: Image Throughput Comparison

4.2 Hardware Resource Utilization

One of the primary downsides of FPGAs, in comparison to software, is the finite

number of resources per device. The resource utilization on a Xilinx Virtex-5

SX95T device is summarized in Table 4.2. While 70% of the SLICEs are occu-

pied, logic utilization (lookup tables and registers) is at about 40%. The num-

bers presented account for all overhead associated with the fully-running design,

including memory and communications controllers for the devices described in

Section 3.2. In particular, BRAM usage is very high (95%) due to the runtime

testing framework (Section 3.3.4): The Shared BRAMs used for loading test data

and for capturing final output data need 128 BlockRAMs each. An additional

32 are used for capturing intermediate data for debugging and analysis. In a

deployment system these BRAMs would not be necessary, and the number of

BRAMs used would be reduced to about 37% of the total available.

Total off-chip memory usage is shown in Table 4.3. As seen, raw memory

usage for all functions totals in excess of 20 MB, far outstripping the 8,784 kb

(≈ 1.07 MB) available from the on-chip BRAMs. From an FPGA standpoint,

reconstruction is very clearly a memory-bound application requiring less than

37

Resource Used Available % Used

Look-up Tables 20,850 58,880 35%

Registers 24,813 58,880 42%

SLICEs 10,332 14,720 70%

BlockRAMs468 488 95%

(180 488 37%)

DSP48Es 61 640 9%

Table 4.2: FPGA Hardware Resource Utilization Summary

Module Resource Available Used % Used

RebinningDRAM 1 GB 18.75 MB 1.8%

QDR SRAM 8 MB 2.53 MB 31.6%

Backprojection QDR SRAM 8 MB 0.1526 MB 1.97%

Table 4.3: External Memory Utilization Summary

half of its on-chip logic resources, but depending heavily on external memory.

However, though the memory requirements are large relative to BRAM capacities,

utilization of each external memory device is quite low. The only exception is

QDR SRAM available for the rebinning buffer. At 31.6% utilization, this was

the limiting factor for scaling the system up to the full size of reconstructing

800×800-pixel images from 768 sources and 1152 detectors—an operation that

would have required 16× more memory in each buffer.

38

4.3 Full-Scale System

The implemented system demonstrates the feasibility of an FPGA-based solution

to real-time 3-D tomographic reconstruction. However, the reduced geometries

would not be particularly useful for practical applications. The architecture pre-

sented would need to be scaled to match the specifications of the reference sys-

tem [6] to find real-world use. The performance and resource results discussed

in the preceeding sections serve as a baseline for this scaling.The scanner geome-

tries would be increased four-fold in each dimension. Rather than reconstruct a

200×200-pixel image from 192 sources and 288 detectors, an 800×800-pixel image

would be constructed from 8 rings of 768 sources and 1152 detectors. Each revolu-

tion would still be used synthesize 32 rebinning centers. In addition, the real-time

reconstruction target imposes minimum performance requirements. The system

would need to be able to reconstruct 480 images per second at those geometries.

4.3.1 Single-Engine Scaling

One of the primary benefits of using FPGAs for real-time applications is its de-

terministic behavior. This is especially true for dataflow-type applications, such

as this one, in which computations are carried out in a simple sequence of opera-

tions with no dynamic control or branching. The deterministic behavior further

allows scaling performance characteristics for different reconstruction sizes, based

on Eqs. 4.1, 4.2, and 4.3.

Given that the design is fully parameterized for any nsrc, ndet, nλ, npix, and

nθ, these can simply be increased to match the desired geometry. The presented

implementation architecture would have similar speedup factors, resulting in a

system that would have performance as summarized in Table 4.4.

39

Function Software ( imgsec

) FPGA ( imgsec

) Speedup

Rebinning 9.351 81.38 8.703×

Filtering 14.01 254.3 18.15×

Backprojection (MATLAB) 0.05295 0.8138 15.34×

Backprojection (C-accelerated) 0.1676 0.8138 4.856×

Table 4.4: Full-Scale Image Throughput Comparison

Module Resource Required Available % Available

RebinningDRAM 300 MB 1 GB 29.3%

SRAM 40.5 MB 8 MB 506.3%

Backprojection SRAM 2.441 MB 8 MB 30.52%

Table 4.5: Full-Scale External Memory Requirements

Such an engine design would have memory requirements as shown in Table 4.5.

As previously mentioned the rebinning buffer dominates memory requirements

for the full-scale system, and is the only external memory subsystem for which the

current platform is insufficient. This can be remedied by taking advantage of the

technology scaling progress since the ROACH platform was developed; 144 Mbit

QDR SRAMs have since become available [21]. With that advancement, a six-

fold increase in memory capacity would not be unfeasible by replacing the current

8 MB device with three 16 MB devices; for this purpose only additional capacity

is needed, not bandwidth, so the three devices can be bussed on a single bank

so that only three additional I/O pins are needed from the FPGA to cover the

increased address space.

40

Function Expected ( imgsec

) Required ( imgsec

) Gap

Rebinning 81.38

480

5.898×

Filtering 254.3 1.888×

Backprojection 0.8138 589.8×

Table 4.6: Expected Real-Time Performance Gap

If sufficient memory were available to scale up to the full-sized image using this

implementation scheme, the engine would have a sustained throughput of about

0.8 images per second. From this starting point, an engine would require a nearly

600× increase in order to meet the real-time performance target (Table 4.6).

4.3.2 Multi-Engine Scaling

As expected, it is clear that backprojection performance must be accelerated

in order to meet the real-time throughput requirements. As has been shown,

parallelization is an effective way of achieving this [8, 9, 22]. The existing by-

θs architecture, in particular, can be adapted to using parallel reconstruction

engines for each θ.

It can be observed from the backprojection algorithm that, there is no θ-

dependency in the sinogram indexing operation. In other words, each θ-slice of

the sinogram can be operated on independently of any other. Potential hazards

only arise with the sinogram pixels need to be accumulated into the reconstructed

image. The advantages of sequential loop ordering and the pipelined, streaming

reconstruction architecture can be combined to avoid this hazard. If all sinogram

indexing were to be performed in parallel, then it is easy to see that x and

y will increment simultaneously. If this is case, then accumulating the image

41

pixels can be achieved using an adder tree, rather than an accumulation memory.

If the operation cannot be made sufficiently parallel, a hybrid solution using

both an adder tree and an accumulation memory can still be implemented. The

accumulation is still safe from data hazards due to the sequential memory access

pattern described in Section 3.3.3.

The only caveat for this architecture is that all parallel θs must be indexed

with the same x and y, meaning the engine would need to wait until all are

available: for an N× parallel implementation the first filtered projection would

need to be buffered until the Nth projection is output from the filter. However,

the current implementation already buffers an entire sinogram before the filtering

stage in order to match the throughput of the backprojector. This buffer could

be broken up and pushed down the pipeline to buffer filtered projections, rather

than unfiltered projections.

If considering backprojection, the limiting factor in parallelization will be

the amount of memory available for these filtered sinogram buffers. For a fully-

parallelized implementation, nθ buffers of depth

2dlog2(npix

√2)e (4.4)

would be needed; an 800×384 sinogram would therefore need 384 2048-word

memories. If filtered sinograms remain at 18 bits to take advantage of the DSP48E

architecture, then each buffer would be exactly 36 kb. While 384 36-kb BRAMs

are not available on the SX95T being used, the larger SX240T has 516 such

memories available [15]. In the newer Virtex-6 family all but 2 of the 13 device

have more than 384 36-kb BlockRAMs [23].

As the parallelization factor n‖ increases the backprojection throughput, the

42

idle factor of upstream modules erodes. A n‖× parallelized backprojector needs

n2pix × nθn‖

(4.5)

cycles to complete a backprojection, whereas the filter needs

n‖ × 2dlog2 2×npixe (4.6)

cycles to supply the necesary portion of the filtered sinogram. As n‖ increases,

there is a crossover point at which the filter cannot keep up with the backprojector

(Figure 4.1). At a 192× parallelization factor, the filter has sufficient cycles

to filter all of the necessary projection θs. At 384× parallelization, increased

filter throughput will be needed. Fortunately, the biplex architecture of the

FFT engines used provides the ability to process two fully independent streams

simultaneously without using additional DSP resources [20]. Therefore, a second

filter can be implemented for only the cost of an extra filter coefficient multiplier.

Further upstream, rebinning is performed in the O(n2) domain, rather than

the O(n3) domain. Because it rebins an entire sinogram in order to perform the

flip/add operation for symmetry folding, it requires a constant

4× npix × nsrc (4.7)

cycles independent of the parallelization factor. As Figure 4.2 shows, the crossover

due to backprojection parallelization occurs much sooner than filtering, due

largely to the factor of four. If the throughput can be doubled, then the sys-

tem will have sufficient throughput for a 192×-parallel backprojector; removing

the factor of four altogether will provide sufficient throughput for a 384×-parallel

backprojector.

The first factor of two is a limitation of the QDR-II+ SRAM four-word-burst

architecture. If the SRAM used for the rebinning buffer were changed to one

43

Parallelization Factor n||

Com

pu

tati

on

Tim

e (C

ycl

es)

Parallelization Tradeoff

101 102 103

0

1

2

3

4

5

x 106

Backprojection

Filtering

Figure 4.1: Filtering trade-off in parallelized computation times

using a two-word burst (at 18 bits) architecture, then rebinning indexing could be

performed with fully-random access a the full SRAM bandwidth. Such memories

are currently available [24,25]. Removing the other factor of two involves changing

the interleaving of adjacent fan-to-parallel samples. This is a more expensive

change, requiring two independent buffers. The same projection data would

be written in simultaneously, but they would be read using different rebinning

indices. This change also doubles the amount of DRAM bandwidth required to

load rebinning factors, as each 128-bit word read from DRAM would be used in

a single cycle.

The 384× speedup afforded by parallelization goes a long way towards clos-

ing the performance gap. The remaining factor of 1.536 can be made up using

simply increasing the clock frequency of the FPGA. Upclocking from 200 MHz to

44

Parallelization Factor n||

Com

pu

tati

on

Tim

e (C

ycl

es)

Parallelization Tradeoff

101 102 103

0

1

2

3

4

5

x 106

Backprojection

Rebinning

Rebinning (2x)

Rebinning (4x)

Figure 4.2: Rebinning trade-off in parallelized computation times

333 MHz will net a 1.665× throughput improvement, and will be feasible on the

newer, faster Virtex-6 devices. Because the algorithm has already been mapped

to a flexible hardware architecture, further speed increases can be attained by

directly porting the FPGA design to a suitable ASIC implementation [26,27].

45

CHAPTER 5

Conclusion

5.1 Summary of Research Contributions

This thesis demonstrates the feasibility of a hardware system architecture capable

of handling the full image reconstruction flow for processing projection data from

a helical cone-beam scanner. It can handle the rebinning, filtering, and backpro-

jection steps necessary for such a reconstruction. The system is designed with a

pipelined, streaming architecture targeting integration for real-time applications,

as opposed to offline post-processing acceleration. The design is implemented in

hardware on a Xilinx SX95T FPGA using the academic ROACH reconfigurable

instrumentation platform.

The hardware implementation is capable of reconstructing more than 52 im-

ages per second. Despite being a single-pipeline design intended to establish a

baseline performance, the FPGA achieved a roughly 5× performance improve-

ment versus a software reference for a reduced system geometry. It is expected

to maintain a similar speedup for the full-sized system, at 0.8 images per second.

The engine architecture is also seen to be flexible and scalable.

46

5.2 Future Work

The architecture laid out in this thesis can be used as the basis for a full-scale

implementation that can meet the real-time performance requirement of 480 im-

ages per second for 800×800-pixel images. This will require the design of a

new hardware platform with increased SRAM memory capacity and bandwidth.

Enhancements to engine throughput can be achieved through the flexiblity for

increased parallelism enabled by the streaming architecture. The hardware-based

design scheme is also amendable to an ASIC port of the engine.

47

References

[1] C. Chang, J. Wawrzynek, and R. W. Brodersen, “BEE2: A High-End Re-configurable Computing System,” IEEE Design & Test of Computers, pp.114–125, March 2005. 1

[2] A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging.IEEE Press, New York, 1987. 2.1.1, 2.1.2, 2.2

[3] P. Dreike and D. P. Boyd, “Convolution Reconstruction of Fan Beam Projec-tions,” Computer Graphics and Image Processing, vol. 5, no. 4, pp. 459–469,1976. 2.1.2

[4] M. M. Betcke and B. R. Lionheart, “Optimal Surface Rebinning for ImageReconstruction from Asymmetrically Truncated Cone Beam Projections,”in preparation. 2.2, 3

[5] F. Noo, M. Defrise, and R. Clackdoyle, “Single-slice rebinning method forhelical cone-beam CT,” Physics in Medicine and Biology, vol. 44, no. 2, p.561, 1999. [Online]. Available: http://stacks.iop.org/0031-9155/44/i=2/a=019 2.2, 2.2

[6] E. Morton, K. Mann, A. Berman, M. Knaup, and M. Kachelrieß, “Ultrafast3D Reconstruction for X-Ray Real-Time Tomography (RTT),” in NuclearScience Symposium Conference Record (NSS/MIC), 2009 IEEE, 24 2009-nov. 1 2009, pp. 4077 –4080. 2.2, 3, 4.3

[7] “Xilinx Virtex 2.5 V Field Programmable Gate Arrays.” [Online]. Available:http://www.xilinx.com/support/documentation/data sheets/ds003.pdf 3.1

[8] J. Xu, “A FPGA Hardware Solution for Accelerating Tomographic Recon-struction,” Master’s thesis, Electrical Engineering Department, Universityof Washington, 2009. 3.1, 4.3.2

[9] N. Subramanian, “A C-to-FPGA Solution for Accelerating Tomographic Re-construction,” Master’s thesis, Electrical Engineering Department, Univer-sity of Washington, 2009. 3.1, 4.3.2

[10] “Altera Stratix-II Device Handbook.” [Online]. Available: http://www.altera.com/literature/hb/stx2/stratix2 handbook.pdf 3.1

[11] “HyperTransport Consortium.” [Online]. Available: http://www.hypertransport.org 3.1

48

http://stacks.iop.org/0031-9155/44/i=2/a=019

http://stacks.iop.org/0031-9155/44/i=2/a=019

http://www.xilinx.com/support/documentation/data_sheets/ds003.pdf

http://www.altera.com/literature/hb/stx2/stratix2_handbook.pdf

http://www.altera.com/literature/hb/stx2/stratix2_handbook.pdf

http://www.hypertransport.org

http://www.hypertransport.org

[12] “Wikipedia: Cell (microprocessor).” [Online]. Available: http://en.wikipedia.org/wiki/Cell (microprocessor) 3.1

[13] “ROACH.” [Online]. Available: https://casper.berkeley.edu/wiki/ROACH3.2

[14] “Collaboration for Astronomy Signal Processing and Engineering Research(CASPER).” [Online]. Available: http://casper.berkeley.edu 3.2

[15] “Virtex-5 Family Overview.” [Online]. Available: http://www.xilinx.com/support/documentation/data sheets/ds100.pdf 3.2.1, 4.3.2

[16] “CASPER Mailing List Archive MSG #01418.” [Online]. Available: http://www.mail-archive.com/[email protected]/msg01418.html 3.2.3

[17] “System Generator for DSP.” [Online]. Available: http://www.xilinx.com/tools/sysgen.htm 3.2.4

[18] “Simulink - Simulation and Model-Based Design.” [Online]. Available:http://www.mathworks.com/products/simulink/ 3.2.4

[19] “CASPER Block Documentation.” [Online]. Available: https://casper.berkeley.edu/wiki/Block Documentation 3.2.4

[20] “CASPER FFT Library Block Documentation.” [Online]. Available:https://casper.berkeley.edu/wiki/Fft 3.3.2, 4.3.2

[21] “Cypress CY7C1643KV18-400BZC.” [Online]. Available: http://www.cypress.com/?mpn=CY7C1643KV18-400BZC 4.3.1

[22] S. Coric, M. Leeser, E. Miller, and M. Trepanier, “Parallel-BeamBackprojection: An FPGA Implementation Optimized for MedicalImaging,” in Proceedings of the 2002 ACM/SIGDA tenth internationalsymposium on Field-programmable gate arrays, ser. FPGA ’02. NewYork, NY, USA: ACM, 2002, pp. 217–226. [Online]. Available: http://doi.acm.org/10.1145/503048.503080 4.3.2

[23] “Virtex-6 Family Overview.” [Online]. Available: http://www.xilinx.com/support/documentation/data sheets/ds150.pdf 4.3.2

[24] “Cypress CY71612KV18-333BZXC.” [Online]. Available: http://www.cypress.com/?mpn=CY7C1612KV18-333BZXC 4.3.2

[25] “Cypress CY71618KV18-333-BZXC.” [Online]. Available: http://www.cypress.com/?mpn=CY7C1618KV18-333BZXC 4.3.2

49

http://en.wikipedia.org/wiki/Cell_(microprocessor)

http://en.wikipedia.org/wiki/Cell_(microprocessor)

https://casper.berkeley.edu/wiki/ROACH

http://casper.berkeley.edu



http://www.mail-archive.com/[email protected]/msg01418.html

http://www.mail-archive.com/[email protected]/msg01418.html

http://www.xilinx.com/tools/sysgen.htm

http://www.xilinx.com/tools/sysgen.htm

http://www.mathworks.com/products/simulink/

https://casper.berkeley.edu/wiki/Block_Documentation

https://casper.berkeley.edu/wiki/Block_Documentation

https://casper.berkeley.edu/wiki/Fft

http://www.cypress.com/?mpn=CY7C1643KV18-400BZC

http://www.cypress.com/?mpn=CY7C1643KV18-400BZC

http://doi.acm.org/10.1145/503048.503080

http://doi.acm.org/10.1145/503048.503080



http://www.cypress.com/?mpn=CY7C1612KV18-333BZXC




[26] D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, and R. W. Brodersen,“ASIC Design and Verification in an FPGA Environment,” in Custom Inte-grated Circuits Conference, 2007. CICC ’07. IEEE, sept. 2007, pp. 737–740.4.3.2

[27] R. Nanda, “DSP Architecture Optimization in MATLAB/Simulink Envi-ronment,” Master’s thesis, Electrical Engineering Department, University ofCalifornia, Los Angeles, 2008. 4.3.2

50

an fpga architecture for real-time 3-d tomographic...

Documents