an fpga architecture for real-time 3-d tomographic...
TRANSCRIPT
University of California
Los Angeles
An FPGA Architecture for
Real-Time 3-D Tomographic Reconstruction
A thesis submitted in partial satisfaction
of the requirements for the degree
Master of Science in Electrical Engineering
by
Henry I-Ming Chen
2012
The thesis of Henry I-Ming Chen is approved.
John Villasenor
William J. Kaiser
Dejan Markovic, Committee Chair
University of California, Los Angeles
2012
ii
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Computed Tomography . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Radon Transform . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Filtered Backprojection . . . . . . . . . . . . . . . . . . . . 6
2.2 Cone-Beam Single-Slice Rebinning . . . . . . . . . . . . . . . . . . 8
3 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Communications Interfaces . . . . . . . . . . . . . . . . . . 17
3.2.4 Design Tools and Libraries . . . . . . . . . . . . . . . . . . 19
3.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Rebinning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Backprojection . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.4 Runtime Testing . . . . . . . . . . . . . . . . . . . . . . . 32
4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Hardware Implementation . . . . . . . . . . . . . . . . . . 34
4.1.2 Software Reference . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Hardware Resource Utilization . . . . . . . . . . . . . . . . . . . . 37
4.3 Full-Scale System . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Single-Engine Scaling . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Multi-Engine Scaling . . . . . . . . . . . . . . . . . . . . . 41
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Summary of Research Contributions . . . . . . . . . . . . . . . . 46
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
v
List of Figures
2.1 Geometry of a set of projection line integrals . . . . . . . . . . . . 4
2.2 Sinogram of a delta function . . . . . . . . . . . . . . . . . . . . . 5
2.3 Example of an image and its corresponding sinogram . . . . . . . 5
2.4 Fan-beam projection scanning . . . . . . . . . . . . . . . . . . . . 6
2.5 Imaging a 3-D object using stack of 2-D image slices . . . . . . . 8
2.6 Helical scan trajectory . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Phantom sinogram using helical cone-beam scanning . . . . . . . 10
3.1 ROACH system block diagram . . . . . . . . . . . . . . . . . . . . 15
3.2 ROACH shared memory interfaces . . . . . . . . . . . . . . . . . 18
3.3 ROACH bus-attached external memory . . . . . . . . . . . . . . . 19
3.4 Designing in the System Generator environment . . . . . . . . . . 20
3.5 FPGA reconstruction flow . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Rebinning architecture . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Rebinning module implementation block diagram . . . . . . . . . 22
3.8 Rebinning factor buffering and readout . . . . . . . . . . . . . . . 24
3.9 Filtering architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.10 Backprojection architecture . . . . . . . . . . . . . . . . . . . . . 29
3.11 Sinogram projection in-memory zero-padding . . . . . . . . . . . . 30
3.12 Reconstruction engine runtime test architecture . . . . . . . . . . 33
4.1 Filtering trade-off in parallelized computation times . . . . . . . . 44
vi
List of Tables
4.1 Image Throughput Comparison . . . . . . . . . . . . . . . . . . . 37
4.2 FPGA Hardware Resource Utilization Summary . . . . . . . . . . 38
4.3 External Memory Utilization Summary . . . . . . . . . . . . . . . 38
4.4 Full-Scale Image Throughput Comparison . . . . . . . . . . . . . 40
4.5 Full-Scale External Memory Requirements . . . . . . . . . . . . . 40
4.6 Expected Real-Time Performance Gap . . . . . . . . . . . . . . . 41
viii
Acknowledgments
I would like to thank, first and foremost, my advisor, Professor Dejan Markovic.
Without his continued help and support, I could not have completed this work.
I would also like to thank Professors John Villasenor and William J. Kaiser for
their encouragement and keen insights throughout this process.
My gratitude to Andy Kotowski, Tim Coker, and Dan Oberg of Rapiscan
Systems for enabling us to engage this topic. I also owe a great deal to Dr. Marta
Betcke, University College London, without whose gracious help the black magic
of the reconstruction algorithm would remain an unsolved mystery before me.
The assistance of Dr. Jianwen Chen has also been invaluable.
To all of my colleagues of DMGroup—Dr. Chia-Hsiang Yang, Dr. Victo-
ria Wang, Dr. Chaitali Biswas, Tsung-Han Yu, Rashmi Nanda, Sarah Gibson,
Chengcheng Wang, Vaibhav Karkare, Fengbo Ren, Fang-Li Yuan, Richard Dor-
rance, Yuta Toriyama, Kevin Dwan, Qian Wang, and Hari Chandrakumar—
thank you all for the coffee runs, the useful discussions, the mentoring, and the
friendship. A great thank-you, too, to Kyle Jung, who always made sure we had
everything we needed.
Thank you to Dan Werthimer and the rest of the CASPER collaboration for
helping me develop the skills needed to complete this project, not to mention the
platform to implement it on!
I sincerely appreciate all that my parents and sister have done for me in order
to get me to this point. Without them, I would be nowhere. And, finally, I am
grateful for my dear Helen, who helped me to realize that I could achieve more.
ix
Abstract of the Thesis
An FPGA Architecture for
Real-Time 3-D Tomographic Reconstruction
by
Henry I-Ming Chen
Master of Science in Electrical Engineering
University of California, Los Angeles, 2012
Professor Dejan Markovic, Chair
Tomographic imaging is a cross-sectional technique that has broad uses in
medical, industrial, and security applications. However, it requires filtered back-
projection, an O(n3) algorithm, in order to reconstruct images from the scanning
system. Modern helical cone-beam scanning systems have the ability to scan at
even faster rates, further stressing the need for accelerated reconstruction meth-
ods. Moreover, these new scanners have unique geometries that require additional
processing. A scalable hardware architecture based on FPGAs is presented to deal
with the computational complexity of this kind of image reconstruction, demon-
strating a 5× throughput improvement over a reference software implementation.
The system is also taken as the starting point for a proposed architecture that can
extend performance by another 600×, making it suitable for high-speed real-time
scanner systems.
x
CHAPTER 1
Introduction
Tomography is a cross-sectional imaging technique in which an object is illumi-
nated from many angles using a penetrating radiation. Since its development in
the 1980s, tomographic imaging has found many uses in medical, industrial, and
security applications.
Computed tompgraphy (CT), one of the primary tomographic imaging tech-
niques, is a very computationally-intensive undertaking, needing a O(n3) recon-
struction algorithm. Because of this, CT imaging tends to be a very slow process,
putting great constraints on its ability to be used in functional or real-time imag-
ing settings. Attempts to accelerate CT reconstruction is an ongoing effort.
As general-purpose CPUs have begun to slow in their scaling-derived perfor-
mance increases, alternative approaches are being sought in order to continue
meeting the computational requirements of CT reconstruction. By the nature
of the algorithms involved—highly iterative and regular operations—CT recon-
structions are ideal candidates for acceleration by field programmable gate arrays
(FPGAs). FPGAs are reconfigurable hardware, offering a middle road between
a custom application-specific integrated circuit (ASIC), which offers high perfor-
mance but fixed functionality, and CPUs, which are highly programmable but
not very efficient [1]. Using an FPGA, performance can be accurately and de-
terministically characterized, a highly desireable trait in the design of real-time
systems. FPGAs also allow for greater customization at the system level, being
1
able to add different types of memory and communications interfaces.
In this work, a scalable FPGA-based design is presented that targets these
computational requirements. Special consideration is paid to making this solution
a framework that can be used with high-speed real-time CT scanning systems. An
overview of the algorithms involved in such a system is first covered in Chapter 2.
With the basic requirements in mind, the hardware platform will be introduced,
along with a description of the system implemented on that platform (Chapter 3).
An analysis of the system performance results follows in Chapter 4.
2
CHAPTER 2
Algorithm
2.1 Computed Tomography
A conventional CT system is constructed such that the object of interest is placed
at the center of a scanning system, about which a source and detector pair rotate.
As the source projects a radiation ray, the detector measures the absorption due
to all parts of the object along the projection line. By making multiple projections
from all angles around the object, a cross-sectional reconstruction of the object
can be made.
2.1.1 Radon Transform
The tomographic imaging process can be discussed in terms of the Radon trans-
form. The Radon transform describes an object using a series of line integrals
of the object. For example, if the two-dimensional object shown in Figure 2.1
is defined as f(x, y), then a line integral P along a line with source angle θ and
lateral offset t is given by
Pθ,t =
∫ ∞−∞
f(x, y)δ(x cos θ + y sin θ − t)dxdy (2.1)
and the set of all such line integrals over θ and t P (θ, t) is the Radon transform
of f(x, y) [2]. The visual representation of the transform of an object is also often
called its sinogram, as the Radon transform of a delta function is a sinusoid. This
3
θ
y
x
f(x,y)Pθ,t
t
Figure 2.1: Geometry of a set of projection line integrals
can be seen in Figure 2.2.
Figure 2.3 shows the Radon transform applied to a more complex geome-
try. The Shepp-Logan phantom, Figure 2.3a, is frequently used as a standard
benchmark for the development of tomographic reconstruction algorithms. The
phantom’s sinogram, generated from its Radon transform over 180◦, is shown in
Figure 2.3b.
When applied to tomographic imaging, the 2-D function f(x, y) can be used
to describe a cross-sectional slice of a 3-D body. A ray of penetrating radiation
(typically an X-ray for computed tomography) travels through this 2-D slice along
the line described by θ and t. As this ray passes through the physical object,
it is absorbed or scattered by the matter it encounters. If f(x, y) describes the
absorption or scattering of the radiation ray due to the object at each spatial
location (x, y), then the line integral sums up the total attenuation due to the
object as the X-ray passes through. In this way, one can “calculate” a line
integral by measuring the total attenuation of the X-ray. The Radon transform
4
(a) Delta function image
Det
ecto
r
θ
50 100 150 200 250 300 350
50
100
150
200
250
Delta Sinogram
(b) Sinogram of delta taken over 360◦
Figure 2.2: Sinogram of a delta function
(a) Shepp-Logan phantom
20 40 60 80 100 120 140 160 180
50
100
150
200
250
300
350
400
Phantom Sinogram
Det
ecto
r
θ
(b) Sinogram of phantom taken over 180◦
Figure 2.3: Example of an image and its corresponding sinogram
5
Figure 2.4: Fan-beam projection scanning
of a physical object can therefore be taken by measuring the attenuations of
X-rays projected from multiple angles θ with different offsets t.
When considering the physical implementation of an imaging system, taking
the Radon transform using a series of offset line integrals can be inefficient. Such a
scanning geometry would require a series of linear scans for every angular rotation
in order to cover the object of interest. A simpler scanning geometry uses a single
point source to project rays in a fan-beam (Figure 2.4). If received by a bank of
detectors then full coverage of the object can be achieved more rapidly.
2.1.2 Filtered Backprojection
Inverting the Radon transform comprises the heart of computed tomography; by
undoing the Radon transform of the scanning process, an image of the original
object can be recovered from projection data. It can be seen as “unsmearing”
6
the image data back along each projection line to re-form the original image.
By using the Fourier slice (or projection-slice) theorem, it is possible to derive
a reconstruction function for f(x, y) given as
f(x, y) =
∫ π
0
Qθ(x cos θ + y sin θ)dθ (2.2)
where Qθ(t) is
Qθ(t) =
∫ ∞−∞
Sθ(w)|w|e2πjwtdw (2.3)
and Sθ(w) is the Fourier transform of a line projection along a projection angle
θ. Qθ thus represents a filtering operation on Sθ using a filter with frequency
response |w| [2]. Therefore, the reconstruction process is generally referred to as
“filtered backprojection,” (FBP) where backprojection alludes to the reversal of
the “forward projection” of the line integral.
This expression is readily discretized for real-world applications as
img(x, y) =K∑θ=0
Qθ(x cos θ + y sin θ) (2.4)
More importantly, this gives the basic form for the computed tomography
filtered backprojection algorithm:
foreach θ do
filter sino(θ, ∗);
foreach x do
foreach y do
n = x cos θ + y sin θ;
img(x, y) = sino(θ, n) + img(x, y);
end
end
end
7
Figure 2.5: Imaging a 3-D object using stack of 2-D image slices
This parallel-beam filtered backprojection can also be applied to fan-beam
scanners if the fan-beam images are first converted to parallel-beam sinograms [3].
In this way, the same backprojection technique can be applied to different scanner
geometries.
2.2 Cone-Beam Single-Slice Rebinning
Because CT scanning is based on cross-sectional slices of an object, it is an
inherently two-dimensional process. In order to image all parts of an object, or to
create a three-dimensional reconstruction, a stack of 2-D image slices are needed
(Figure 2.5). The basic approach for this is a step-and-repeat methodology,
in which a scanner system moves linearly along an object, taking projections at
small distance intervals. However, this solution presents several major drawbacks,
primarily that it can be extremely slow, and that it greatly increases the amount
of exposure to ionizing X-ray radiation.
Several different scanning geometries and accompanying reconstruction tech-
niques have been proposed in order to overcome these downsides. The approach
8
Figure 2.6: Helical scan trajectory
presented here is a proprietary adaptation [4] of the cone-beam single-slice rebin-
ning method discussed by Noo et al. [5].
To summarize, the cone-beam single-slice rebinning (CB-SSRB) technique
uses helical cone-beam scanning that traverses a continuous helical path around
the object of interest (Figure 2.6). Total coverage of the scanned volume is
achieved by using a volumetric cone-beam projection. Because the cone-beam
projects along the z -axis as well as the x - and y-axes, it can account for the gaps
along the helical scan path.
The different geometric setup of helical cone-beam scanning clearly produces
scan images different from the 2-D Radon transform, as seen in Figure 2.7. There-
fore, in order to recover images using conventional FBP techniques, additional
pre-processing of helical cone-beam scan images must be performed. This pro-
cedure is termed “rebinning,” with the end result being a conversion of helical
cone-beam scan images into a sequence of 2-D projection images.
Details on the mathematics of rebinning are found in the literature [2,5], but
an intuition of its mechanism can be achieved by thinking of it as a form of
9
20 40 60 80 100 120 140 160 180
50
100
150
200
250
Cone-Beam Helical Scan Projections
Det
ecto
r
Projection Source
Figure 2.7: Phantom sinogram using helical cone-beam scanning
vector projection. As a cone beam sweeps over a volume, the desired result is
to form image slices from within that volume. To the first order, recovering an
image slice on a rebinning center λ is the projection of the scan projection vector
onto the λ-plane. As such, it is performed by weighting the cone-beam scan
detector value with a geometry-dependent scaling factor. Because the original
scanner is performed with a conical beam, slices taken from within that volume
will be fan-beam images. These can be reconstructed natively using fan-beam
backprojection techniques, or converted into parallel-beam images.
For high-speed imaging applications, traditional rotating scanners are being
replaced with new geometries [6]. In a traditional scanner, where a source/detector
pair rotate, scanning speeds are limited by the G-forces acting on the rotation
mechanism. The new geometries use scanner rings studded with numerous sta-
tionary scanners and detectors. In such a system, rotational scanning is simulated
10
by firing the sources in sequence around the ring. Cone-beam helical scanning
can be achieved by extending these rings in the lateral direction. In this way, the
cone beam can be received over an area, and the firing sequence can emulate a
helical path.
11
CHAPTER 3
FPGA Implementation
CT reconstruction has always been a computationally demanding process, due
to backprojection being an O(n3) algorithm. Despite its high time complexity,
the calculation requirements are actually relatively low. Backprojection itself is
highly repetitive, but using very simple arithmetic operations. If the sin/cos
functions are pre-computed and stored as look-up tables, then each iteration
only requires two multiplications and two additions. These characteristics make
the algorithm a prime candidate for FPGA acceleration. The remainder of this
chapter details the implementation of an FPGA reconstruction engine using an
algorithm [4] suitable for the helical cone-beam CT system described by Morton
et al. in 2009 [6].
The reference system uses 768 projector sources and 1152 detectors per ring,
with 8 rings in a row to simulate helical rotation. Each revolution around the
rings is used to rebin 32 image slices, from which 800×800-pixel images are gener-
ated. Because of limited hardware resources, the reconstruction engine presented
here processes a reduced image size. The design is for a helical cone-beam scan-
ner having 192 sources and 288 detectors per ring, but maintains 8 rings per
revolution. Using this projection data, 200×200-pixel images are reconstructed.
12
3.1 Prior Work
Methods of accelerating filtered backprojection for CT imaging is fairly well-
studied, owing to its extensive use in medical and industrial applications. In
particular, the benefits of FPGA acceleration have been explored. In an early
work using Xilinx’s first-generation Virtex FPGAs [7], CT reconstruction was
performed on moderately-sized images (512×512). Backprojection was acceler-
ated using the FPGA using a 4×-parallel architecture that was capable of up
to a 20× speedup versus software backprojection. In addition to demonstrating
a parallel backprojection architecture, this work also showed that a fixed-point
implementation using moderate bitwidths can have as low as 0.015% error of the
software floating-point computation
In two fairly recent works [8,9], an Altera Stratix-II FPGA [10] was used as a
computational co-processor by connecting it to the system’s CPU via the Hyper-
Transport interconnect protocol [11]. A multi-threaded software implementation
was used as the reference, against which up to 103× speedup was achieved by
using a 128×-parallel architecture.
While these works have clearly demonstrated the capabilities of FPGAs for
accelerating backprojection, their architectures are not suited for real-time appli-
cations. In particular, the systems presented do not provide a complete solution
for filtered backprojection; filtering is performed as a pre-processing step before
offloading to the FPGA for accelerating backprojection, and its computational
needs are not taken into account. As will be examined in further detail in Chap-
ter 4, filtering represents a non-negligible portion of computation, and one that
benefits greatly from FPGA acceleration. More importantly, the systems are op-
timized for offline acceleration in which all data is readily available. Projection
data must be loaded into memory buffers accessible to the FPGA on a sinogram-
13
by-sinogram basis using the CPU of the host system. These software-based archi-
tectures do not properly address the needs of a real-time implementation suitable
for scanner integration. Furthermore, these works do not demonstrate the feasi-
bility of implementing the rebinning function in hardware to allow use with helical
cone-beam scanners, and thus can only reconstruct parallel-beam sinograms.
The reference system provides a complete real-time solution that includes the
necessary rebinning and filtering operations, implemented on four Cell Broad-
band Engines (CBEs) [12]. While Cell processors are capable of providing an
accelerated solution suitable for real-time applications, it is still inherently a
software implementation, subject to the non-deterministic runtime behavior of a
software stack. For example, the real-time requirements can only be guaranteed
by over-provisioning the system’s capabilities and allowing for more than a 10%
safety margin in performance. More importantly, the Cell processor has a unique
architecture that poses significant challenges for customizing applications that
can properly extract its full performance capabilities. This had lead to limited
adoption, and an uncertainty in the longevity of the microprocessor architecture.
This work, therefore, seeks to present a solution that can bridge the two bodies of
work, offering a complete hardware solution for the required algorithm suitable
for integrating into a real-time scanner system.
3.2 Platform
The entirety of this reconstruction algorithm was demonstrated on the open-
source academic research platform “ROACH” (Reconfigurable Open Architecture
Computing Hardware) [13]. ROACH was developed as part of an international
collaboration of primarily radio astronomy research instutitions as a common
high-performance computing platform for real-time digital signal processing ap-
14
Xilinx Virtex-5 SX95TFPGA
PowerPC 440EPx
QDR-II+SRAM
QDR-II+SRAM
DDR2 DRAM
GPIO10Gb
Ethernet1Gb
Ethernet
OPB Bus
DDR2 DRAM
Figure 3.1: ROACH system block diagram
plications [14]. A block diagram of the system is shown in Figure 3.1. While
much of the information about the ROACH is found among CASPER documen-
tation, the memory and communications interfaces figure heavily into the system
design, and are thus highlighted here.
3.2.1 FPGA
ROACHs are equipped with FPGAs from the Xilinx Virtex-5 family; this partic-
ular instance had a Virtex-5 SX95T component optimized for DSP applications
with a large number of embedded multiplers (“DSP48E”) and dual-ported SRAM
blocks (“BlockRAM” or “BRAM”). In addition to its 14,720 logic SLICEs1, an
SX95T has 640 DSP48Es and 488 18-kbit BRAMs [15]. A Virtex-5’s DSP48E
is a “hard macro”; that is, it is a circuit-level implementation of a 25×18-bit
1A Virtex-5 SLICE is four 6-input lookup tables (LUTs) and four flip-flops (FFs)
15
multiplier-accumulator, and consumes no reconfigurable logic resources in order
to create a multiplier.
3.2.2 Memory
To aid in real-time DSP applications, ROACH boards are designed with high-
bandwidth memory interfaces connected directly to its FPGA. Each board can
be outfitted with a commercial-standard DDR2 SDRAM DIMMs up to 1 GB and
two QDR-II+ SRAM chips of up to 72 Mb.
Having both SDRAM and QDR SRAM helps cover different usage models
for memory requirements. As a DRAM technology, DDR2 provides high storage
capacity, at the cost of increased access complexity. DRAM carries very high
random-access latencies due to its structure and internal architecture, and as
such the highest bandwidth is achieved when accessed in large sequential bursts.
DDR2 DIMMs (dual inline memory modules) run a 64-bit dual data rate (DDR)
interface, meaning that data is transferred on both the rising and falling edges of
the clock. Data must be accessed in small bursts of consecutive memory locations.
ROACH uses ECC (error-correcting code) memory modules which have an
extra parity bit per byte, giving a 72-bit interface. The FPGA’s DDR2 controller
provides a 144-bit, two-word-burst interface to the FPGA fabric. Because of the
way DDR2 memory is implemented at the physical level, the ROACH can only
support operating it at a limited number of fixed frequencies: 150, 200, 266,
300, or 333 MHz. Read and write requests are executed in the order of issue,
but due to low-level actions taken by the controller such as row switching or
refreshing, their execution latencies are non-deterministic. A FIFO buffer sits at
the interface boundary, providing some elasticity to absorb the non-determinism.
In contrast, the QDR (quad data rate) SRAM comes in significantly lower
16
densities, but provides much more flexibility in its interface. Because it is an
SRAM technology, there is no penalty for fully-random access patterns. QDR
achieves its quadrupled data rate by providing independent read and write ports,
sharing address and control signals, operating at DDR. The highest bandwidth
is achieved by properly interleaving read and write accesses so that, for a given
clock cycle, two words are written concurrently with two words being read.
Like DRAM, the QDR interface specification also requires bursted accesses,
so that switching on the control lines is reduced and that requests are spaced out
so reads and writes can be interleaved. The QDR memories on ROACH have
18-bit data buses with a 4-word DDR burst of consecutive memory locations.
As with the DRAM used, the non-multiple-of-eight data width is nominally for
ECC parity bits; in both interfaces 9 bits is considered to be a single byte. The
FPGA’s QDR controller abstracts this physical layer and presents an interface
with a two-word burst of 36 bits each. The SRAM structure allows the controller
to have a fixed 10-cycle read latency and 1-cycle write latency. The QDR chips
have the flexibility to run at any clock frequency between 150 and 400 MHz.
3.2.3 Communications Interfaces
In addition to the computational and memory resources provided by the FPGA,
the ROACH platform was also selected for its multitude of interface capabilities.
Data can be transferred to the FPGA at three levels: as raw electrical signals,
network packets, or data files.
Included on the ROACH board is a PowerPC 440EPx embedded processor
with 512 MB of dedicated DDR2 DRAM, non-volatile Flash memory, and a
Gigabit Ethernet interface. This enables the PowerPC to boot a full Linux op-
erating system, providing higher-level communications protocols like telnet and
17
FPGA
PowerPCOPB Bus
Bu
s B
rid
ge
User Logic SW Register
SharedBRAM
SW Register
Figure 3.2: ROACH shared memory interfaces
SSH/SCP. The FPGA is in turn connected directly to the procesor’s External Bus
Controller (EBC), allowing some memory spaces on the FPGA to be mapped onto
the On-chip Periperal Bus (OPB). This ability, coupled with the full Linux pro-
gramming environment, provides a good balance between standalone operation
of the FPGA and ease of use and communication. Including the local software
stack, the PowerPC has roughly 7 MB/s of bandwidth to the FPGA [16].
A variety of 32-bit devices can be mapped onto the PowerPC bus from the
FPGA. The simplest is a unidirectional single-word software register, which can
be used for sending control settings to the FPGA, or for reporting statuses with
low rates of change. BRAMs can also be configured as Shared BRAMs, which
are bidirectional; the dual-ported memories are connected to both the PowerPC
bus and the FPGA fabric, allowing block data transfer between the two domains.
The shared memory interfaces are diagrammed in Figure 3.2. Finally, the DRAM
and QDR controllers can be attached to bus interfaces as well, (Figure 3.3) for
data sizes beyond the capacity of the on-FPGA BRAMs. Because of the 32-bit
bus interface, the parity bits of the DRAM and QDR memories are discarded
when they are used in this way.
18
Bu
s B
rid
ge
UserLogic
PowerPCOPB Bus
DRAMController
QDRController
DDR2DRAM
QDR-II+SRAM
FPGA
Figure 3.3: ROACH bus-attached external memory
3.2.4 Design Tools and Libraries
In addition to the board itself, a part of the ROACH platform is a set of design
tools and libraries. Rather than code in an HDL like Verilog or VHDL, as would
be the case for most FPGA-based designs, most ROACH development is done
using System Generator, a Xilinx toolbox for the MATLAB Simulink graphical
modeling environment [17, 18]. Figure 3.4 shows an example of designing in the
Simulink/System Generator environment.
System Generator provides bit-/cycle-accurate simulation, as well as auto-
matic data type inference for arithmetic precision and bit growth. The afore-
mentioned ROACH-specific memory controllers and interfaces have been ported
to this environment, along with a large library of commonly-used DSP modules
such as FIRs and FFTs [19]. This allowed development to focus on the system
architecture, instead of low-level computational and interface blocks.
19
Figure 3.4: Designing in the System Generator environment
Rebin Filter Backproject
Figure 3.5: FPGA reconstruction flow
3.3 System Architecture
The processing flow necessary for reconstructing an image from a helical cone-
beam scanner is broken up into modules that can be assembled sequentially. This
kind of flow-through architecture lends itself well to a streaming implementation.
Scanner data is passed along through the rebinning, filtering, and backprojection
modules (Figure 3.5).
3.3.1 Rebinning
As discussed in Section 2.2 and shown in Figure 3.6, the rebinning function can
be realized as a coefficient weighting of selectively-accessed data points from scan-
ner input. Two weighted samples are then weighted again and added together
20
Rebin Filter Backproject
Indices Weights
DRAM
Rev0 Buffer
Rev1 Buffer
Rev2 Buffer
SRAM
Figure 3.6: Rebinning architecture
in order to perform the fan-to-parallel conversion. This can be seen in the more
detailed block diagram of the rebinning module, Figure 3.7. Because both co-
efficients are applied linearly to the data samples, they can be combined into a
single rebinning/fan-to-parallel weight. Doing so reduces the number of hardware
multiplications, memory capacity, and memory bandwidth.
Rebinning indices and weights are determined by the particular rebinning
function to be performed and the specific parameters of the scanner geometry.
For a given setup, they are considered to be static, and can be loaded into the
system once at initialization. Because these rebinning factors have a one-to-one
relationship with the rebinning output and are pre-computed offline, they can be
phrased in a sequential manner so as to be suitable for accessing from DRAM.
On the other hand, projection data must be accessed according to the rebinning
indices. As such, the random-access capabilities of SRAM are preferrable.
21
DRAM
BRAM
Latency
Match
QDR
SRAM
Circ.
Buffer
Map
Z-2
BRAM
Projection data
Weights,
Indices
Weight
IndexFan-to-Parallel
Folding
Figure 3.7: Rebinning module implementation block diagram
The rebinning function uses data from three consecutive scan revolutions to
reconstruct a stack of rebinned images on their rebinning center λ. Each pixel
in the rebinned pseudo-fan-beam image2 is reconstructed by weighting one point
from the scan data, and two fan-beam images are needed to synthesize a single
parallel-beam sinogram. Therefore, for each revolution forming nλ image slices
of npix pixels using nsrc sources, rebinning needs
2× nλ × nsrc × npix (3.1)
rebinning indices and
2× nλ × nsrc × npix (3.2)
rebinning weights. This particular implementation uses the same rebinning func-
tion for all revolutions of data, so only one set of rebinning factors is needed.
Each stack of nλ images needs to draw repeatedly from three revolutions of
scan data. As each revolution is composed of nring rings of nsrc projection sources
2Termed pseudo because the extra scaling does not mathematically synthesize a fan-beamimage, but a scaled version of one
22
and ndet detectors,
3× nring × nsrc × ndet (3.3)
scan projection words to be buffered. This imposes a minimum bitwidth of
dlog2 (3× nring × nsrc × ndet)e on the index words.
In the particular scanner parameters being designed for, nλ = 32, nsrc =
192, ndet = 288, npix = 200. Thus 1,327,104 projections points need to be buffered
to construct 32 rebinned images. 2,457,600 21-bit indices are needed to select data
from the scan buffer to be paired with 2,457,600 rebinning weights. So far, all
quantities have been expressed in terms of words, and determining actual memory
requirements is determined by bitwidth selection.
That the index bitwidth, even in this case of a very reduced geometry, is
21 bits sets a very firm bound on determining the bitwidths for the rebinning
factors. Two additional considerations are taken into account: that rebinning
indices and weights are used in pairs, and that both must be loaded into DRAM.
Because of this, powers-of-two bitwidths are preferred, as they are more easily
packed together when using a 32-bit interface (the ECC bits are dropepd). Since
the 21-bit index precludes a smaller encoding, both indices and weights are stored
as 32-bit words when written into DRAM. Each DRAM transaction made from
the FPGA thus gives four rebinning index/weight pairs. Because data is loaded
into the FPGA via the PowerPC using a telnet protocol, a TCP/IP API from
the MATLAB Instrumentation Control Toolbox can be used. The MATLAB
environment makes it straightforward to read the indices and weights from files,
then work with the data as needed to load it via the API.
When the rebinning factors are read from DRAM into the FPGA, they are
first buffered in a pair of BRAMs using a ping-pong scheme. This helps to further
mask the non-deterministic accesses and smooth out data bursts to help ensure
23
.BRAM
.
.
DRAM
128
128
64
Index
WeightCPU
64
Figure 3.8: Rebinning factor buffering and readout
a controlled, continuous data flow from DRAM during rebinning. A buffer depth
of 256 index/weight pairs is used, as that corresponds to fitting in exactly a
single 18-kbit BlockRAM each. The Xilinx BlockRAMs are dual-ported and can
support different memory aspect ratios on the two ports—allowing the memory
to be filled up as 128 128-bit words, but read out as 256 64-bit words. In this
way, a burst of 128-bit words, each composed of two index/weight pairs, is read
sequentially from DRAM and written into a buffer. The buffers are read out as
64-bit words, which are bit-sliced into the 32-bit indices and weights. Figure 3.8
diagrams this flow. Whenever a buffer has been emptied, DRAM acceses are
immediately initiated to refill it while data is drawn from the other buffer.
As previously discussed, three consecutive revolutions of scan data are stored
in QDR to implement the rebinning function. As each stack of nλ images is
processed, the buffer is updated with a new revolution of scan data on a first-in,
24
first-out basis. For each revolution, nring × nsrc × ndet words must be replaced,
which is too large a block of data to be practically done using a true FIFO.
Therefore, the QDR buffer is accessed as a circular buffer, where the revolution
to be discarded is overwritten with new data. To account for the circular buffer
addressing, the rebinning indices coming from DRAM are passed through a block
to map the logical rebinning indices n to a physical QDR address D:
D = (n+N) mod 3N (3.4)
where N is the number of words per revolution. As shown in Figure 3.7, rebinning
weights must be delayed to match the latency of this remapping.
While QDR does not incur latency penalties for random accesses as DRAM
does, its four-word-burst architecture limits access granularity. Physically the
memory array is read and written as a 4M×18-bit array, but the effective interface
is that of a 1M×72-bit array which is accessed every other cycle. This interface
requires that multiple words of 16-bit projection data be packed and unpacked
when transferring in and out of QDR. This can be made transparent to the
rebinning by packing in powers-of-two: each 72-bit QDR word can fit four 16-bit
projection points, allowing the upper dlog2 (3× nring × nsrc × ndet)e − 2 bits of
the index to be used as the QDR address and lower 2 bits to select between data
sliced from within the QDR word.
After a projection sample is selected by this indexing method, it is multiplied
with its latency-matched weight using an embedded multiplier. In a nominal
rebinning design each weighted sample would form one pixel in a synthesized
fan-beam sinogram, two of which would be added together with a fan-to-parallel
scaling factor to synthesize one pixel in a parallel-beam sinogram. However, this
implementation outputs a pseudo-fan-beam pixel at this point, and corresponding
pairs must be added together to form the desired parallel-beam image. This can
25
be done simply by interleaving the rebinning factors for the two pseudo-fan-
beams when loading into DRAM so that pixel pairs are selected and weighted
consecutively.
When combined with the interleaving of two fan-beam pixels to form one
parallel-beam pixel, this results in a reconstruction rate reduction of one rebinned,
parallel-beam image pixel every four clock cycles. Though it adds to processing
latency, this hardware limitation does not impact overall system throughput,
which is bounded by the backprojection stage.
3.3.2 Filtering
An additional optimization step is performed after rebinning to reduce the com-
putational complexity of backprojection by exploiting the symmetry in scanning
geometry. As a scanner rotates around an object, it is easy to visualize that the
projections taken from 180◦ to 359◦ scan along the same lines as those taken from
0◦ to 179◦. The resulting sinogram symmetry, which can be seen in Figure 2.2b,
allows a 360◦ scan can be folded and overlaid upon itself to reduce the sinogram in
half; the number of projection angles in the sinogram nθ therefore becomes nsrc
2.
This eliminates half of the computations required to perform backprojection, and
is easily implemented inline by following a specific read and write scheme using
a dual-port BRAM.
In a 2-D image or matrix-based implementation of the algorithm, this op-
eration would be achieved by splitting the image into two halves on the θ-axis
θ = [0 : nsrc
2− 1] and θ = [nsrc
2: nsrc − 1]. One half would be flipped up/down
around the θ-axis, then the two half-matrices added to each other. The hardware
implementation notes that the matrix is being processed as a long 1-D vector,
appending the matrix columns consecutively. Flipping a matrix in the up/down
26
Rebin Filter Backproject
FFT IFFT
ROM
Filter Coefficients
Figure 3.9: Filtering architecture
direction means reversing the order of the vector within the column boundaries.
It is also noted that this flip-add operation can also be performed by flipping the
first half of the matrix onto the second half; this simply results in a flipped result
that can be corrected later.
As the rebinned, parallel-beam sinogram data is being streamed from the
rebinning module, the firstnpix×nsrc
2samples are written into a BRAM of depth
2dnpix×nsrc
2e. As the second half of the vector is output, the BRAM is read out in an
inverted-column manner: a down-counter continuously counts down from npix−1
to 0, addressing each row within a column. Every npix reads, npix is added to the
address to increment to the next column. The sample output from the BRAM
using this read address scheme is added to the sample oming from rebinning. In
this way, the flip-add operation can be performed within the same time window
27
of rebinning output. The whole sinogram is then written into another BRAM
buffer in order to feed the filter module; the same inverted-column scheme is used
at this point to undo the aforementioned image flip.
Once the sinogram has been folded in this way, images are reconstructed
using a serial implementation of standard filtered backprojection (FBP). High-
pass filtering is accomplished in the frequency domain using the prototypal FFT-
multiply-IFFT technique, as shown in Figure 3.9. A streaming, radix-2 complex
FFT block is used from the ROACH DSP design library [20]. To account for the
fact that real data is being used for a complex FFT, data from each projection
angle is zero-padded to a length of 2dlog2(2×npix)e samples. Rather than expend
memory resources, the data stream is zero-padded by muxing in a constant zero
for the appropriate samples. For this image size 512-point FFT blocks are used for
both the FFT and IFFT operations. In both transforms, data is cast into 18-bit
numbers, as that offers the best efficiency in the FPGA’s embedded multipliers.
This demonstrator uses a simple ramp filter, with pre-computed filter coeffi-
cients being supplied from an on-FPGA ROM. Some FBP implementations use
more sophisticated high-pass filters to achieve different results in image quality.
Building the filtering stage in the frequency domain allows for flexibility in this
module, as the filtering function can be changed simply by changing the coef-
ficients of filter function with no structural changes. This can even be done at
runtime, if necessary, by replacing the ROM shown in Figure 3.9 with a Shared
BRAM.
3.3.3 Backprojection
Following the algorithm detailed in Section 2.1.2, each angle of sinogram data
coming out of the filter is used in backprojection. The basic backprojection al-
28
Rebin Filter Backproject
Buffer Addressing
Sinogram Buffer
BRAM
xcosθ+ysinθ
xy
Image Accumulation
Buffer
QDR SRAM
Figure 3.10: Backprojection architecture
gorithm, while iterative, is highly sequential in that there are no dependencies
within the loop. This means that the loops can be re-ordered in any way for
a serial implementation. Taking results directly from the filter output makes
a by-θs reconstruction approach the most natural, in which the outermost loop
increments on θ. Figure 3.10 illustrates the architecture of this type of backpro-
jection.
Though data is output from the filter at a deterministic rate, the filtered
vectors are buffered in two BRAMs in a ping-pong scheme, as with the rebinning
factors discussed in Section 3.3.1. The ping-pong or double-buffered approach
alternates using each BRAM for reading and writing, such that when one buffer
is being read from, the other is being written into. While there is no need to
absorb non-deterministic latencies in this case, double-buffering here allows the
29
0
projections
00
Indexing bounds
BRAM
2 - 2+pix pix pixn n n
2 - 2pix pixn n
2pixn
2log 2
2 pixn
Figure 3.11: Sinogram projection in-memory zero-padding
backprojector to run without interruption. As soon as the loop for one θ has
been completed, the vector of the next has already been written to BRAM and
its loop can begin without pausing.
For each filtered vector of npix samples buffered in the BRAM, the appropriate
pixels can be selected using the index
n = x cos θ + y sin θ (3.5)
A restriction on this addressing is that it must access a vector at least as long
as the output image diagonal—npix√
2. In this case a projection of length npix
is being used to reconstruct an image npix pixels on a side, so the vector must
be zero-padded to avoid indexing out-of-bounds. This is done inline by writing
the samples into the zero-initialized BRAM with an address offsetnpix
√2
2. In this
way, the samples are centered in a zero-vector of length npix√
2 (Figure 3.11).
A lookup table, indexed by an incrementing θ, is used to store pre-computed
sine and cosine values for the buffer address calculation. As the implementation
30
uses a straightforward loop ordering, simple counters can be used for generating
the x, y, and sin θ/cos θ values.
The accumulation function of the backprojection process is carried out us-
ing the second QDR SRAM on the ROACH, and is greatly facilitated by the
memory’s dual-ported, fixed-latency architecture. The pixel matrix of the image
is mapped onto a single long array in the SRAM, where a pixel location (x, y)
corresponds to the word at memory location (x−1)npix+y. The full image needs
to be stored in the QDR, requiring that
n2pix (3.6)
pixels be stored.
The serial, nearest-neighbor implementation of the backprojection loops means
that there are no dependencies or conflicts on a given pixel within the same θ
loop—giving an access spacing of n2pix − 1 clock cycles. This allows the back-
projection to continuously step through the QDR memory using a single address
counter.
In the steady-state, one counter leads the other and reads out pixels from the
accumulating image. Samples are concurrently selected from the projection buffer
according to the indexing given in Equation 3.5. After latency-matching to the
image pixels being read from QDR, the projection sample is added to the pixel
and written back into the accumulation buffer. Addressing for the writing phase
is simply a delayed version of the read address—in this instance 10 cycles for the
QDR latency, plus an additional 3 cycles for the accumulation adder latency.
The system’s streaming dataflow architecture means that the backprojector
runs continuously with no gaps. Because the system’s throughput is currently lim-
ited by backprojection, each backprojected image must be immediately followed
31
by a new one to achieve maximum performance. The read and write accesses to
the accumulation memory can be run in the same way at all times if the data
flow is handled specially at the image boundaries. Since every pixel location is
continuously read out during the for accumulation, the completed backprojection
will automatically be output from the memory at the (nθ + 1)th read iteration.
This set of data is simply flagged as valid, with no other control structures nec-
essary. At this point, the accumulator must be cleared to begin backprojecting
the next image. The memory can be cleared of the previous image by muxing in
a zero to the accumulation adder for the subsequent n2pix writes, with no changes
to the QDR access pattern.
3.3.4 Runtime Testing
One of the key features of this implementation is that it is designed to be directly
integrated into a scanner system for real-time image reconstruction, rather than
be an offline post-processing accelerator. However, due to the lack of a suitable
system for such integration, two minor enhancements allow testing the design
without a CT scanner. Because the platform is standalone in terms of the re-
sources it needs to carry out the reconstruction function, the only missing pieces
are interfaces to get raw scan data into and reconstructed image data out of the
design.
This modification is greatly simplified by the streaming architecture of the
reconstruction engine. Because the data input interface assumes nothing more
than data samples accompanied by a valid flag, alternate test data can easily be
multiplexed into the input stream. Likewise, the output interface is a data/valid
pair, which can be fanned out into an alternate endpoint. Shared BRAMs can
be used in both cases: at the input side one is loaded with test projection data
32
Reconstruction Engine
SharedBRAM
Sample projections Reconstructed image
Rebinning factors
SharedBRAM
Figure 3.12: Reconstruction engine runtime test architecture
from a server, then streamed out into the reconstruction engine; at the output
reconstructed images are written into another memory, then read back to the
server (Figure 3.12).
33
CHAPTER 4
Analysis
4.1 Performance
4.1.1 Hardware Implementation
The fully-pipelined streaming architecture of the reconstruction engine allows us
to consider its performance primarily from a throughput perspective. The engine
is structured in such a way so as to provide the maximum possible single-pipeline
performance. After some initial latency to completely fill the pipeline, the design
is able execute one iteration of the backprojection loop every FPGA clock cycle,
and can sustainably stream out a reconstructed image every
n2pix × nθfclk
(4.1)
seconds. The implemented system demonstrates a 200×200-pixel image backpro-
jected from a 96-angle sinogram running on an FPGA clocked at 200 MHz. This
results in a sustained backprojection throughput of 52.08 images per second—one
image every 0.0192 seconds.
The reconstruction steps preceeding theO(n3) backprojection phase—rebinning,
weighting, folding, and filtering—all work on sinograms, and so tend to be O(n2)
calculations. They can therefore be masked by processing time of backprojection.
In order to facilitate the streaming architecture, upstream modules are throttled
to provide single images of data for processing on a by-θs basis.
34
For each reconstructed image, the rebinning module must be able to supply
one sinogram. As discussed in Section 3.3.1, the combination of QDR burst-
access restrictions and fan-to-parallel sample interleaving reduces output from the
rebinning function to one pixel every four clock cycles. The rebinning modules
can thus produce one sinogram every
fclk4× npix × nsrc
(4.2)
seconds, for a maximum of 1302 images per second. However, once the rebinned
sinogram is complete, the module idles in order to match the backprojection
throughput. This leaves the rebinning module with a 4% duty cycle: it performs
the rebinning function in a burst of 4 × npix × nsrc cycles, then waits for the
remaining 96% of the image reconstruction operation. The current design is able
to increase its output to allow a single rebinning module to feed accelerated or
multiple backprojectors up to a factor of 24× increased throughput.
Folding the sinogram to account for projection symmetry halves the effective
data rate in the pipeline, but is still bound by the 4× npix× nsrc performance of
rebinning to get all of the necessary data points. As such, its nominal throughput
is considered to be the same as that of the rebinning module.
The bridge between the O(n2) and O(n3) domains lies in the buffer that
preceeds the filter for filtered backprojection. Unlike the preceeding modules, this
buffer has a more distributed active/idle duty cycle. It forms the outermost loop
of the filtered backprojection flow, incrementing on θ and dispensing bursts of
2dlog2 2×npixe (zero-padded from npix) data samples to the FBP modules every n2pix
cycles. The filtering operation also works in this intermediate domain, producing
2dlog2 2×npixe samples per n2pix cycles as detailed in Section 3.3.2. The filter can
processfclk
nθ × 2dlog2 2×npixe(4.3)
35
sinograms per second—4069 per second for the given geometries. From here, each
filtered dataset moves to the backprojection module that processes the inner
two loops of the algorithm. This final module, as previously described, is the
primary throughput bottleneck in the reconstruction pipeline, at only 52 images
per second.
4.1.2 Software Reference
Because of the developmental nature of the particular CB-SSRB reconstruction
algorithm used, the available software reference is a MATLAB implementation.
Backprojection, however, is available as a library function via the inverse Radon
transform. The software can compute backprojection using either the built-in
MATLAB iradon function or a compiled C executable. All data and rebinning
tables are accessed from MATLAB MAT-files stored on local disk. The software
implementation was evaluated on an eight-core (dual-socket, quad-core) server
with Intel Xeon E5420 CPUs and 16 GB of RAM. Each processor has 12 MB of
cache and runs at 2.5 GHz.
MATLAB’s matrix representation for all data structures makes it a natural
environment for working with images, which are treated as matrices of pixels.
Beyond the ability to substitute a C program to perform backprojection, no fur-
ther optimizations were made to the MATLAB implementation. Software perfor-
mance is summarized in Table 4.1. Even with a single-pipeline implementation,
the FPGA provides about a 9.5× speedup versus the native MATLAB iradon
function for backprojection, almost 5× speedup versus the C implementation.
Rebinning sees a similar speedup with the FPGA, while the FPGA filter can
provide 18× increased throughput.
36
Function Software ( imgsec
) FPGA ( imgsec
) Speedup
Rebinning 251.3 1302 5.181×
Filtering 221.1 4069 18.53×
Backprojection (MATLAB) 5.435 52.08 9.583×
Backprojection (C-accelerated) 10.45 52.08 4.984×
Table 4.1: Image Throughput Comparison
4.2 Hardware Resource Utilization
One of the primary downsides of FPGAs, in comparison to software, is the finite
number of resources per device. The resource utilization on a Xilinx Virtex-5
SX95T device is summarized in Table 4.2. While 70% of the SLICEs are occu-
pied, logic utilization (lookup tables and registers) is at about 40%. The num-
bers presented account for all overhead associated with the fully-running design,
including memory and communications controllers for the devices described in
Section 3.2. In particular, BRAM usage is very high (95%) due to the runtime
testing framework (Section 3.3.4): The Shared BRAMs used for loading test data
and for capturing final output data need 128 BlockRAMs each. An additional
32 are used for capturing intermediate data for debugging and analysis. In a
deployment system these BRAMs would not be necessary, and the number of
BRAMs used would be reduced to about 37% of the total available.
Total off-chip memory usage is shown in Table 4.3. As seen, raw memory
usage for all functions totals in excess of 20 MB, far outstripping the 8,784 kb
(≈ 1.07 MB) available from the on-chip BRAMs. From an FPGA standpoint,
reconstruction is very clearly a memory-bound application requiring less than
37
Resource Used Available % Used
Look-up Tables 20,850 58,880 35%
Registers 24,813 58,880 42%
SLICEs 10,332 14,720 70%
BlockRAMs468 488 95%
(180 488 37%)
DSP48Es 61 640 9%
Table 4.2: FPGA Hardware Resource Utilization Summary
Module Resource Available Used % Used
RebinningDRAM 1 GB 18.75 MB 1.8%
QDR SRAM 8 MB 2.53 MB 31.6%
Backprojection QDR SRAM 8 MB 0.1526 MB 1.97%
Table 4.3: External Memory Utilization Summary
half of its on-chip logic resources, but depending heavily on external memory.
However, though the memory requirements are large relative to BRAM capacities,
utilization of each external memory device is quite low. The only exception is
QDR SRAM available for the rebinning buffer. At 31.6% utilization, this was
the limiting factor for scaling the system up to the full size of reconstructing
800×800-pixel images from 768 sources and 1152 detectors—an operation that
would have required 16× more memory in each buffer.
38
4.3 Full-Scale System
The implemented system demonstrates the feasibility of an FPGA-based solution
to real-time 3-D tomographic reconstruction. However, the reduced geometries
would not be particularly useful for practical applications. The architecture pre-
sented would need to be scaled to match the specifications of the reference sys-
tem [6] to find real-world use. The performance and resource results discussed
in the preceeding sections serve as a baseline for this scaling.The scanner geome-
tries would be increased four-fold in each dimension. Rather than reconstruct a
200×200-pixel image from 192 sources and 288 detectors, an 800×800-pixel image
would be constructed from 8 rings of 768 sources and 1152 detectors. Each revolu-
tion would still be used synthesize 32 rebinning centers. In addition, the real-time
reconstruction target imposes minimum performance requirements. The system
would need to be able to reconstruct 480 images per second at those geometries.
4.3.1 Single-Engine Scaling
One of the primary benefits of using FPGAs for real-time applications is its de-
terministic behavior. This is especially true for dataflow-type applications, such
as this one, in which computations are carried out in a simple sequence of opera-
tions with no dynamic control or branching. The deterministic behavior further
allows scaling performance characteristics for different reconstruction sizes, based
on Eqs. 4.1, 4.2, and 4.3.
Given that the design is fully parameterized for any nsrc, ndet, nλ, npix, and
nθ, these can simply be increased to match the desired geometry. The presented
implementation architecture would have similar speedup factors, resulting in a
system that would have performance as summarized in Table 4.4.
39
Function Software ( imgsec
) FPGA ( imgsec
) Speedup
Rebinning 9.351 81.38 8.703×
Filtering 14.01 254.3 18.15×
Backprojection (MATLAB) 0.05295 0.8138 15.34×
Backprojection (C-accelerated) 0.1676 0.8138 4.856×
Table 4.4: Full-Scale Image Throughput Comparison
Module Resource Required Available % Available
RebinningDRAM 300 MB 1 GB 29.3%
SRAM 40.5 MB 8 MB 506.3%
Backprojection SRAM 2.441 MB 8 MB 30.52%
Table 4.5: Full-Scale External Memory Requirements
Such an engine design would have memory requirements as shown in Table 4.5.
As previously mentioned the rebinning buffer dominates memory requirements
for the full-scale system, and is the only external memory subsystem for which the
current platform is insufficient. This can be remedied by taking advantage of the
technology scaling progress since the ROACH platform was developed; 144 Mbit
QDR SRAMs have since become available [21]. With that advancement, a six-
fold increase in memory capacity would not be unfeasible by replacing the current
8 MB device with three 16 MB devices; for this purpose only additional capacity
is needed, not bandwidth, so the three devices can be bussed on a single bank
so that only three additional I/O pins are needed from the FPGA to cover the
increased address space.
40
Function Expected ( imgsec
) Required ( imgsec
) Gap
Rebinning 81.38
480
5.898×
Filtering 254.3 1.888×
Backprojection 0.8138 589.8×
Table 4.6: Expected Real-Time Performance Gap
If sufficient memory were available to scale up to the full-sized image using this
implementation scheme, the engine would have a sustained throughput of about
0.8 images per second. From this starting point, an engine would require a nearly
600× increase in order to meet the real-time performance target (Table 4.6).
4.3.2 Multi-Engine Scaling
As expected, it is clear that backprojection performance must be accelerated
in order to meet the real-time throughput requirements. As has been shown,
parallelization is an effective way of achieving this [8, 9, 22]. The existing by-
θs architecture, in particular, can be adapted to using parallel reconstruction
engines for each θ.
It can be observed from the backprojection algorithm that, there is no θ-
dependency in the sinogram indexing operation. In other words, each θ-slice of
the sinogram can be operated on independently of any other. Potential hazards
only arise with the sinogram pixels need to be accumulated into the reconstructed
image. The advantages of sequential loop ordering and the pipelined, streaming
reconstruction architecture can be combined to avoid this hazard. If all sinogram
indexing were to be performed in parallel, then it is easy to see that x and
y will increment simultaneously. If this is case, then accumulating the image
41
pixels can be achieved using an adder tree, rather than an accumulation memory.
If the operation cannot be made sufficiently parallel, a hybrid solution using
both an adder tree and an accumulation memory can still be implemented. The
accumulation is still safe from data hazards due to the sequential memory access
pattern described in Section 3.3.3.
The only caveat for this architecture is that all parallel θs must be indexed
with the same x and y, meaning the engine would need to wait until all are
available: for an N× parallel implementation the first filtered projection would
need to be buffered until the Nth projection is output from the filter. However,
the current implementation already buffers an entire sinogram before the filtering
stage in order to match the throughput of the backprojector. This buffer could
be broken up and pushed down the pipeline to buffer filtered projections, rather
than unfiltered projections.
If considering backprojection, the limiting factor in parallelization will be
the amount of memory available for these filtered sinogram buffers. For a fully-
parallelized implementation, nθ buffers of depth
2dlog2(npix
√2)e (4.4)
would be needed; an 800×384 sinogram would therefore need 384 2048-word
memories. If filtered sinograms remain at 18 bits to take advantage of the DSP48E
architecture, then each buffer would be exactly 36 kb. While 384 36-kb BRAMs
are not available on the SX95T being used, the larger SX240T has 516 such
memories available [15]. In the newer Virtex-6 family all but 2 of the 13 device
have more than 384 36-kb BlockRAMs [23].
As the parallelization factor n‖ increases the backprojection throughput, the
42
idle factor of upstream modules erodes. A n‖× parallelized backprojector needs
n2pix × nθn‖
(4.5)
cycles to complete a backprojection, whereas the filter needs
n‖ × 2dlog2 2×npixe (4.6)
cycles to supply the necesary portion of the filtered sinogram. As n‖ increases,
there is a crossover point at which the filter cannot keep up with the backprojector
(Figure 4.1). At a 192× parallelization factor, the filter has sufficient cycles
to filter all of the necessary projection θs. At 384× parallelization, increased
filter throughput will be needed. Fortunately, the biplex architecture of the
FFT engines used provides the ability to process two fully independent streams
simultaneously without using additional DSP resources [20]. Therefore, a second
filter can be implemented for only the cost of an extra filter coefficient multiplier.
Further upstream, rebinning is performed in the O(n2) domain, rather than
the O(n3) domain. Because it rebins an entire sinogram in order to perform the
flip/add operation for symmetry folding, it requires a constant
4× npix × nsrc (4.7)
cycles independent of the parallelization factor. As Figure 4.2 shows, the crossover
due to backprojection parallelization occurs much sooner than filtering, due
largely to the factor of four. If the throughput can be doubled, then the sys-
tem will have sufficient throughput for a 192×-parallel backprojector; removing
the factor of four altogether will provide sufficient throughput for a 384×-parallel
backprojector.
The first factor of two is a limitation of the QDR-II+ SRAM four-word-burst
architecture. If the SRAM used for the rebinning buffer were changed to one
43
Parallelization Factor n||
Com
pu
tati
on
Tim
e (C
ycl
es)
Parallelization Tradeoff
101 102 103
0
1
2
3
4
5
x 106
Backprojection
Filtering
Figure 4.1: Filtering trade-off in parallelized computation times
using a two-word burst (at 18 bits) architecture, then rebinning indexing could be
performed with fully-random access a the full SRAM bandwidth. Such memories
are currently available [24,25]. Removing the other factor of two involves changing
the interleaving of adjacent fan-to-parallel samples. This is a more expensive
change, requiring two independent buffers. The same projection data would
be written in simultaneously, but they would be read using different rebinning
indices. This change also doubles the amount of DRAM bandwidth required to
load rebinning factors, as each 128-bit word read from DRAM would be used in
a single cycle.
The 384× speedup afforded by parallelization goes a long way towards clos-
ing the performance gap. The remaining factor of 1.536 can be made up using
simply increasing the clock frequency of the FPGA. Upclocking from 200 MHz to
44
Parallelization Factor n||
Com
pu
tati
on
Tim
e (C
ycl
es)
Parallelization Tradeoff
101 102 103
0
1
2
3
4
5
x 106
Backprojection
Rebinning
Rebinning (2x)
Rebinning (4x)
Figure 4.2: Rebinning trade-off in parallelized computation times
333 MHz will net a 1.665× throughput improvement, and will be feasible on the
newer, faster Virtex-6 devices. Because the algorithm has already been mapped
to a flexible hardware architecture, further speed increases can be attained by
directly porting the FPGA design to a suitable ASIC implementation [26,27].
45
CHAPTER 5
Conclusion
5.1 Summary of Research Contributions
This thesis demonstrates the feasibility of a hardware system architecture capable
of handling the full image reconstruction flow for processing projection data from
a helical cone-beam scanner. It can handle the rebinning, filtering, and backpro-
jection steps necessary for such a reconstruction. The system is designed with a
pipelined, streaming architecture targeting integration for real-time applications,
as opposed to offline post-processing acceleration. The design is implemented in
hardware on a Xilinx SX95T FPGA using the academic ROACH reconfigurable
instrumentation platform.
The hardware implementation is capable of reconstructing more than 52 im-
ages per second. Despite being a single-pipeline design intended to establish a
baseline performance, the FPGA achieved a roughly 5× performance improve-
ment versus a software reference for a reduced system geometry. It is expected
to maintain a similar speedup for the full-sized system, at 0.8 images per second.
The engine architecture is also seen to be flexible and scalable.
46
5.2 Future Work
The architecture laid out in this thesis can be used as the basis for a full-scale
implementation that can meet the real-time performance requirement of 480 im-
ages per second for 800×800-pixel images. This will require the design of a
new hardware platform with increased SRAM memory capacity and bandwidth.
Enhancements to engine throughput can be achieved through the flexiblity for
increased parallelism enabled by the streaming architecture. The hardware-based
design scheme is also amendable to an ASIC port of the engine.
47
References
[1] C. Chang, J. Wawrzynek, and R. W. Brodersen, “BEE2: A High-End Re-configurable Computing System,” IEEE Design & Test of Computers, pp.114–125, March 2005. 1
[2] A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging.IEEE Press, New York, 1987. 2.1.1, 2.1.2, 2.2
[3] P. Dreike and D. P. Boyd, “Convolution Reconstruction of Fan Beam Projec-tions,” Computer Graphics and Image Processing, vol. 5, no. 4, pp. 459–469,1976. 2.1.2
[4] M. M. Betcke and B. R. Lionheart, “Optimal Surface Rebinning for ImageReconstruction from Asymmetrically Truncated Cone Beam Projections,”in preparation. 2.2, 3
[5] F. Noo, M. Defrise, and R. Clackdoyle, “Single-slice rebinning method forhelical cone-beam CT,” Physics in Medicine and Biology, vol. 44, no. 2, p.561, 1999. [Online]. Available: http://stacks.iop.org/0031-9155/44/i=2/a=019 2.2, 2.2
[6] E. Morton, K. Mann, A. Berman, M. Knaup, and M. Kachelrieß, “Ultrafast3D Reconstruction for X-Ray Real-Time Tomography (RTT),” in NuclearScience Symposium Conference Record (NSS/MIC), 2009 IEEE, 24 2009-nov. 1 2009, pp. 4077 –4080. 2.2, 3, 4.3
[7] “Xilinx Virtex 2.5 V Field Programmable Gate Arrays.” [Online]. Available:http://www.xilinx.com/support/documentation/data sheets/ds003.pdf 3.1
[8] J. Xu, “A FPGA Hardware Solution for Accelerating Tomographic Recon-struction,” Master’s thesis, Electrical Engineering Department, Universityof Washington, 2009. 3.1, 4.3.2
[9] N. Subramanian, “A C-to-FPGA Solution for Accelerating Tomographic Re-construction,” Master’s thesis, Electrical Engineering Department, Univer-sity of Washington, 2009. 3.1, 4.3.2
[10] “Altera Stratix-II Device Handbook.” [Online]. Available: http://www.altera.com/literature/hb/stx2/stratix2 handbook.pdf 3.1
[11] “HyperTransport Consortium.” [Online]. Available: http://www.hypertransport.org 3.1
48
[12] “Wikipedia: Cell (microprocessor).” [Online]. Available: http://en.wikipedia.org/wiki/Cell (microprocessor) 3.1
[13] “ROACH.” [Online]. Available: https://casper.berkeley.edu/wiki/ROACH3.2
[14] “Collaboration for Astronomy Signal Processing and Engineering Research(CASPER).” [Online]. Available: http://casper.berkeley.edu 3.2
[15] “Virtex-5 Family Overview.” [Online]. Available: http://www.xilinx.com/support/documentation/data sheets/ds100.pdf 3.2.1, 4.3.2
[16] “CASPER Mailing List Archive MSG #01418.” [Online]. Available: http://www.mail-archive.com/[email protected]/msg01418.html 3.2.3
[17] “System Generator for DSP.” [Online]. Available: http://www.xilinx.com/tools/sysgen.htm 3.2.4
[18] “Simulink - Simulation and Model-Based Design.” [Online]. Available:http://www.mathworks.com/products/simulink/ 3.2.4
[19] “CASPER Block Documentation.” [Online]. Available: https://casper.berkeley.edu/wiki/Block Documentation 3.2.4
[20] “CASPER FFT Library Block Documentation.” [Online]. Available:https://casper.berkeley.edu/wiki/Fft 3.3.2, 4.3.2
[21] “Cypress CY7C1643KV18-400BZC.” [Online]. Available: http://www.cypress.com/?mpn=CY7C1643KV18-400BZC 4.3.1
[22] S. Coric, M. Leeser, E. Miller, and M. Trepanier, “Parallel-BeamBackprojection: An FPGA Implementation Optimized for MedicalImaging,” in Proceedings of the 2002 ACM/SIGDA tenth internationalsymposium on Field-programmable gate arrays, ser. FPGA ’02. NewYork, NY, USA: ACM, 2002, pp. 217–226. [Online]. Available: http://doi.acm.org/10.1145/503048.503080 4.3.2
[23] “Virtex-6 Family Overview.” [Online]. Available: http://www.xilinx.com/support/documentation/data sheets/ds150.pdf 4.3.2
[24] “Cypress CY71612KV18-333BZXC.” [Online]. Available: http://www.cypress.com/?mpn=CY7C1612KV18-333BZXC 4.3.2
[25] “Cypress CY71618KV18-333-BZXC.” [Online]. Available: http://www.cypress.com/?mpn=CY7C1618KV18-333BZXC 4.3.2
49
[26] D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, and R. W. Brodersen,“ASIC Design and Verification in an FPGA Environment,” in Custom Inte-grated Circuits Conference, 2007. CICC ’07. IEEE, sept. 2007, pp. 737–740.4.3.2
[27] R. Nanda, “DSP Architecture Optimization in MATLAB/Simulink Envi-ronment,” Master’s thesis, Electrical Engineering Department, University ofCalifornia, Los Angeles, 2008. 4.3.2
50