an automated flow to generate hardware computing …pc/research/publications/wang.thesis08.pdfan...
TRANSCRIPT
An Automated Flow to Generate Hardware
Computing Nodes from C for an FPGA-Based
MPI Computing Network
by
D.Y. Wang
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
BACHELOR OF APPLIED SCIENCE
DIVISION OF ENGINEERING SCIENCE
FACULTY OF APPLIED SCIENCE AND ENGINEERING
UNIVERSITY OF TORONTO
Supervisor: Paul Chow
April 2008
Abstract
Recently there have been initiatives from both the industry and academia to explore
the use of FPGA-based application-specific hardware acceleration in high-performance
computing platforms as traditional supercomputers based on clusters of generic CPUs
fail to scale to meet the growing demand of computation-intensive applications due
to limitations in power consumption and costs. Research has shown that a heteroge-
neous system built on FPGAs exclusively that uses a combination of different types
of computing nodes including embedded processors and application-specific hardware
accelerators is a scalable way to use FPGAs for high-performance computing. An ex-
ample of such a system is the TMD [11], which also uses a message-passing network
to connect the computing nodes. However, the difficulty in designing high-speed
hardware modules efficiently from software descriptions is preventing FPGA-based
systems from being widely adopted by software developers. In this project, an auto-
mated tool flow is proposed to fill this gap. The AUTO flow is developed to auto-
matically generate a hardware computing node from a C program that can be used
directly in the TMD system. As an example application, a Jacobi heat-equation solver
is implemented in a TMD system where a soft processor is replaced by a hardware
computing node generated using the AUTO flow. The AUTO-generated hardware
module shows equivalent functionality and some improvement in performance over
the soft processor. The AUTO flow demonstrates the feasibility of incorporating au-
tomatic hardware generation into the design flow of FPGA-based systems so that
such systems can become more accessible to software developers.
i
Acknowledgment
I acknowledge Synfora and Xilinx for hardware, tools and technical support, and my
supervisor, Professor Paul Chow, for his guidance, patience, and insights, all of which
are very valuable for the completion of this project. Thanks to Chris Madill and Arun
Patel for their help in setting up the development environment, and Manuel Saldana
for help with the MPE network and scripts, and patiently answering all my questions
during the many unscheduled drop-by visits. Also many thanks to Henry Wong for
discussions, suggestions and debugging tips, and Ryan Fung for proofreading the final
report. Finally, I would like to thank my mother for her love and support as always.
ii
Contents
1 Introduction 1
2 Related Work 4
2.1 FPGA-Based Computing . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 The TMD-MPI Approach . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Behavioral Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 System Setup 9
3.1 TMD Platform Architecture . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 C-to-HDL Using PICO . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Implementation of the Tool Flow 15
4.1 Flow Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 MPI Library Implementation . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Preprocessing Script . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.2 Packaging Script . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.1 Floating-Point Support . . . . . . . . . . . . . . . . . . . . . . 26
4.6.2 Looping Structure . . . . . . . . . . . . . . . . . . . . . . . . 26
iii
4.6.3 Pointer Support . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6.4 Division Support . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6.5 Performance Specification . . . . . . . . . . . . . . . . . . . . 27
4.6.6 Hardware Debugging . . . . . . . . . . . . . . . . . . . . . . . 28
4.6.7 Exploitable Parallelism . . . . . . . . . . . . . . . . . . . . . . 28
5 The Heat-Equation Application 30
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Conclusion and Future Work 36
Appendix 37
A Hardware Controller for PICO PPA 37
A.1 Control FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.2 Stream Interface Translation . . . . . . . . . . . . . . . . . . . . . . . 39
B Using PICO: Tips and Workarounds 43
B.1 Stream Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Bibliography 47
iv
Glossary
The glossary contains the acronyms that are used in this report.
• CAD – Computer Aided Design
• CPE – Cycles Per Element
• DCM – Digital Clock Manager
• FSL – Fast Simplex Link. Xilinx’s FIFO stream IP block.
• FSM – Finite State Machine
• HDL – Hardware Description Language
• HPC – High-Performance Computing
• IP – Internet Protocol
• IP – Intellectual Property
• MHS – Microprocessor Hardware Specification
• MSS – Microprocessor Software Specification
• MPI – Message Passing Interface
• MPE – Message Passing Engine. Provides MPI functionality to
• NetIf – Network Interface used in the TMD network
v
• PICO – Program-In Chip-Out. An algorithmic synthesis tool from Synfora,
Inc.
• PPA – Pipeline of Processing Arrays. The top-level hardware block generated
from a function by the PICO flow.
• TCAB – Tightly Coupled Accelerator Blocks. A hardware module generated
by PICO from a C procedure that can be used as a black box when generating
a higher-level hardware block.
• TMD – Originally the Toronto Molecular Dynamics machine; now refers to the
exclusively FPGA-based HPC platform developed at the University of Toronto.
hardware accelerators in a TMD system.
• VLSI – Very Large Scale Integrated Circuit
• XPS – Xilinx Platform Studio. Xilinx’s embedded processor system design
tool.
• XST – Xilinx Synthesis Technology. Xilinx’s synthesis tool.
• XUP – Xilinx University Program
vi
List of Figures
3.1 TMD platform architecture ([13]) . . . . . . . . . . . . . . . . . . . . 10
3.2 Network configuration for different node types ([13]) . . . . . . . . . . 11
3.3 TMD design flow ([13]) . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 PICO design flow ([15], p.5) . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 The AUTO tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Stream operations required to implement MPI behaviour . . . . . . . 17
4.3 TMD system testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1 A simple two-node TMD implementation of a Jacobi heat-equation solver 32
5.2 Main loop execution time per element with different iteration lengths 34
A.1 Design of the control block . . . . . . . . . . . . . . . . . . . . . . . 38
A.2 State transition diagram of the control block FSM . . . . . . . . . . 40
A.3 The PICO stream interface . . . . . . . . . . . . . . . . . . . . . . . . 41
A.4 The FSL bus interface . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
List of Tables
4.1 Implemented families of MPI functions . . . . . . . . . . . . . . . . . 18
5.1 Normalized computing power of the reference and test systems . . . . 35
A.1 I/O ports exported by the control block . . . . . . . . . . . . . . . . 38
A.2 Raw control ports on the PPA module . . . . . . . . . . . . . . . . . 39
viii
Chapter 1
Introduction
Much of today’s scientific research relies heavily on numerical computations and
demands high performance. Computational fluid dynamics, molecular simulation,
finite-element structural analysis and financial trading algorithms are examples of
computation-intensive applications that would not have been possible without the
advances in computing infrastructure. Since the 1960s, generations of supercom-
puters have been built to address the growing needs of the scientific community for
more computing power. With the improved performance and availability of micro-
processors, clusters of conventional CPUs connected in a network using commercially
available interconnects became the dominant architecture used to build modern su-
percomputers. As of November 2007, 409 of the top 500 supercomputers were cluster
based [18].
However, as computing throughput requirements of new applications continue
to increase, supercomputers based on clusters of generic CPUs become increasingly
limited by power budgets and escalating costs and cannot scale further to keep up with
the demand. As a result, specialized hardware accelerators became popular. In recent
years there have been significant development in both GPU-based and FPGA-based
computing models. While GPUs demonstrated remarkable performance improvement
in highly data-parallel stream-based applications [1], FPGAs, with the flexibility they
offer, are good candidates for specialized hardware acceleration systems.
In order to leverage FPGAs in high-performance computing systems, hardware
1
accelerators need to be built from software specifications. The primary challenge in
this is that hardware design is intricate and software developers typically do not have
the expertise to design high-performance hardware. On the other hand, having to
have both software and hardware designers working on the same project is costly and
inefficient. As a result, hardware acceleration have not been adopted more widely
among software developers. A tool flow that allows the software designers to eas-
ily harness the power of hardware acceleration is hence essential to make hardware
acceleration feasible in non-high-end applications.
To address this need, we show in this project an automated tool flow, AUTO,
that generates a hardware accelerator from a C program directly. This work builds on
previous work on TMD, which is a scalable multi-FPGA high-performance computing
system [11] that consists of a collection of computing nodes, where each node can be a
soft processor or a hardware engine. The TMD uses the TMD-MPI message-passing
programming model [12] for inter-node communication. The AUTO flow takes in an
MPI program written in C as input and produces a hardware computing node that
can be used directly in the TMD system.
As a proof-of-concept prototype, the main objective of this project is to explore
the possibility of algorithmic synthesis to target an FPGA-based system, with a
focus on the feasibility of an automated tool flow. A Jacobi heat equation solver is
implemented on TMD as an example application to demonstrate the functionality
of the AUTO flow. With little designer intervention, we are able to automatically
generate a functional hardware block that performs better than the soft processor
node it replaces. Our eventual goal is to completely automate the design flow that
generates a system of hardware accelerators from a parallel program as opposed to a
single hardware computing node at a time.
The rest of the report is organized as follows. Chapter 2 reviews existing research
work in FPGA-based computing and algorithmic synthesis, which provides context
to our work. Chapter 3 describes the TMD platform, the TMD-MPI design flow
and AUTO’s role in it. Chapter 4 explains the implementation of the AUTO flow.
The limitations of the implementation are also outlined. In Chapter 5, a sample
2
application is presented with some performance results. Finally, in Chapter 6, we
summarize our findings and give suggestions for future work.
3
Chapter 2
Related Work
Recent research has shown that FPGA-based high-performance computing models
have the potential to speedup certain computing tasks significantly using application-
specific hardware acceleration. The disadvantage is that it sacrifices the generality
offered by CPUs. This is remedied by the reconfigurability of FPGAs, which allows
them to be reprogrammed for different computing tasks. Consequently, the success
of FPGA-based systems hinges on an efficient underlying computing infrastructure
and a flexible design flow, of which the AUTO flow tries to address. This section
presents an overview of existing research in the areas of FPGA-based computing and
algorithmic synthesis. It provides context for our work on the AUTO flow.
2.1 FPGA-Based Computing
With power consumption becoming an increasingly critical design constraint for high-
performance computing systems, many vendors of traditional cluster based systems
have started to incorporate hardware acceleration using FPGAs. Examples include
the Cray XD1 [5] and the HP ProLiant DL145 servers using Celoxica’s RCHTX FPGA
acceleration boards [3]. These systems use FPGAs as coprocessors to exploit fine-
grained parallelism in algorithms to improve the overall performance. The fork-join
control flow is most natural to these systems due to the master-slave relationship
between the processor and the FPGA. The master processor is responsible for the co-
4
ordination and synchronization among the computing slaves. It executes instructions
sequentially, and opportunistically farms out computation-intensive tasks to hard-
ware accelerators on the FPGA coprocessors. Since the inherently parallel hardware
structures on the FPGA are controlled by the sequential processor, maximizing ef-
ficiency to amortize the overhead of transferring data and synchronization requires
significant effort from both hardware and software designers.
Starting from the 90nm processing node, FPGAs have been built with high enough
density and speed to make them possible contenders for high-performance computing
platforms. The BEE/BEE2 system [4], TMD [11] and the one presented in [2] are
examples of high-performance computing platforms built on FPGA based technologies
exclusively. Some of these systems only use application-specific hardware modules,
others use soft processors that are embedded in the FPGA fabric and provide low-
latency communication channels between the embedded processors and the hardware
computing nodes. The latter heterogeneous architecture enables more efficient use
of on-chip resources to improve the throughput vs. area ratio under a given power
constraint for specific applications. The advantage of using FPGAs to build high-
performance computing platforms is that the application-specific portions can be
easily reconfigured to suit the need of a variety of applications.
2.2 The TMD-MPI Approach
The Toronto Molecular Dynamics (TMD) system is a scalable multi-FPGA high-
performance computing platform developed at the University of Toronto. The original
motivation for the system was to address the increasing demand for computing power
in molecular dynamics simulation. However, as the platform developed, it is no
longer limited to molecular dynamics. Today it is a testbed for FPGA-based high-
performance computing systems and design flows for such systems.
As mentioned earlier, the TMD is a heterogeneous system that consists of comput-
ing nodes, which could be soft processors and application-specific hardware modules
(hardware engines). A TMD system uses a distributed-memory architecture where
5
each computing node has its own memory and address space. This architecture is sim-
ple and scalable for highly parallel systems since it does not need to consider memory
coherency issues or memory bus congestion, which would be critical for a shared-
memory system. For distributed-memory systems, message passing has been proven
to be an efficient programming model. The de facto message passing API used in the
high-performance computing community is the Message Passing Interface (MPI) [10].
The MPI API provides a generic platform-independent interface by specifying only
the functionality and syntax of the interface. The actual implementation depends
completely on the host platform. MPICH is a popular C implementation of MPI for
computer clusters using Linux or Windows [6]. The TMD-MPI is a lightweight subset
of MPI designed for embedded systems on the TMD. It contains two components: a
software library for use with soft processors, and the Message-Passing Engine (MPE),
which is a hardware implementation of the MPI API that can be used with hardware
engines [12]. The software component does not require an operating system and has
a very small memory footprint. With TMD-MPI, C programs written using MPI can
be ported to embedded processors in a TMD system with minimal modification [13].
TMD-MPI provides an abstraction of inter-node communication so that software
developers do not need to be aware of the details of the communication infrastructure.
For the TMD, the underlying network is realized using point-to-point unidirectional
links implemented as FIFOs. Each hardware node is connected to a dedicated MPE,
which in turn connects to the rest of the network. Soft processors can connect directly
to the network or through an MPE; both are supported in the TMD-MPI implemen-
tation. When the MPE is used, the type of the computing node behind the MPE
is hidden from the rest of the system. The benefit of such a modular network setup
is two fold; multi-threaded programs developed for a computer cluster can be easily
ported to the TMD by instantiating a soft processor for each thread of the cluster,
and soft processors hosting computation-intensive programs can then be identified
and replaced by hardware engines without the rest of the system noticing. If the gen-
eration of the hardware engines can be automated, this TMD architecture will enable
software developers to easily leverage hardware acceleration. This is the motivation
6
behind the AUTO flow.
2.3 Behavioral Synthesis
Designing a hardware accelerator from a software description often requires hardware
designers to work closely with software designers in order to arrive at an efficient
design. The engineering effort required often deters software developers from using
hardware acceleration. In order to make FPGA-based high-performance computing
systems more accessible to software developers, the conversion of software into hard-
ware needs to be automated. This is the field of behavioral synthesis, sometimes
also known as high-level synthesis. It refers to the generation of a logic circuit, often
in the form of a Hardware Description Language (HDL), such as Verilog or VHDL,
from high-level functional descriptions of the desired system. Behavioral synthesis is
not a new problem. Exponential growth of the number of transistors in integrated
circuits has led to increased complexity of VLSI systems and the engineering effort
required to design them. As a result, a great deal of research effort has been spent
in the past three decades on developing high-performing and robust CAD tools that
will create the logic circuits based on a description of the desired functionality of the
circuit, which is often specified in a high-level software language such as C/C++, or
Matlab.
The challenge in behavioral synthesis comes from the inherent difference in the
software and hardware design paradigms. A software developer is more familiar with
a data-centric view, where a program is seen as a sequence of tasks performed on
a set of data. On the other hand, the hardware designer uses a time-centric view
and thinks about the hardware resources used in each clock cycle [16]. The data-
centric view does not easily expose the parallelism contained in the algorithm. A
behavioral synthesis tool needs to understand the data-centric view described by the
software, and schedules the operations, allocates hardware resources, and generates
the necessary control logic to provide the functionality of the software, while exploiting
concurrency in the algorithm. This process involves many optimization decisions and
7
trade-offs that cannot be easily automated.
Recently there have been some commercial realizations of behavioral synthesis
tools for applications in specific domains. Impulse C [7], Handel C [3], Catapult C
[9] and PICO [17] are a few examples. Impulse C is a subset of ANSI C that can be
given to an Impulse C compiler to generate HDL output. It allows a software to be
partitioned into software processes and hardware processes. A C-compatible library
is supplied to support a parallel, stream-based programming model. The library
functions are used to facilitate stream operations, such as open, close, read, and write,
as well as communication of control messages between the software and hardware
processes. The compiler generates hardware from Impulse C library functions and
other C statements. Handel C is very similar to Impulse C. It also has library functions
to support floating point operations. Catapult C is based on ANSI C++ instead of
C. All three perform the best with data-oriented stream-based parallel applications.
By making use of non-standard extensions to specify parallelism, the compiler can
be better guided to produce more efficient hardware. However, the disadvantage is
that programs that are ported to one of these languages can no longer be compiled
or debugged using standard C compilers or debuggers.
The Synfora PICO (Program-in Chip-out) is slightly different from the three other
tools mentioned above. Instead of non-standard extensions, it uses pragmas to specify
parameters related to hardware generation. Because it ignores these pragmas, a
C compiler can compile a C program prepared for PICO synthesis. The minimal
deviation between the original C version and the PICO-compliant version of the
software makes PICO a suitable tool for this project.
8
Chapter 3
System Setup
The AUTO flow has been developed to address the need for an automated CAD flow to
convert soft-processor computing nodes in a TMD system to an equivalent hardware
computing node. This section describes the architecture of the TMD testbed used
for this project and the design flow, which incorporates AUTO.
3.1 TMD Platform Architecture
The TMD system can contain multiple FPGAs connected in a 3-tier network as
shown in Figure 3.1. Tier 1 is the on-chip network used for intra-FPGA commu-
nication among computing nodes located on a single FPGA. This network is based
on unidirectional point-to-point links, which are implemented using Xilinx FSLs [21].
Each FSL is a 32-bit wide, 16-word deep FIFO. Network interface blocks (NetIf)
are used in the tier-1 network to route packets from source to destination through
a multi-hop path based on a unique ID of each node, called the rank. The routing
table is contained in the NetIf. Each FPGA also contains several gateway nodes, one
for each neighbouring FPGA. The NetIfs forward packets whose destination ranks
reside outside of the current FPGA to the appropriate gateway node. The tier 2 net-
work consists of several FPGAs in a cluster placed on the same print circuit board.
High-speed serial I/O links, such as the RocketIO MGT (Multi-Gigabit Transceiver)
are used for inter-FPGA communication. There are no network routing components
9
Figure 3.1: TMD platform architecture ([13])
in this tier. The tier 3 network facilitates communication between FPGA clusters
and can be implemented using standardized high-speed switches such as InfiniBand
switches [8]. This 3-tier system masks the implementation details of the network and
provides a uniform view to individual computing nodes. It also allows the TMD to
be easily scaled up to meet the demand of the application.
Because of the network abstraction provided by the 3-tier system, a simple system
with only the tier 1 network can be used as the testbed for the development of
the AUTO flow without loss of generality. A Xilinx University Program (XUP)
Development Board [20] with a single Virtex-II Pro XCV2P30 is used to implement
the TMD testbed for this project.
There are two types of computing nodes that are of interest to the AUTO flow:
processor nodes, which are implemented using the Xilinx MicroBlaze soft processor
core, and hardware engine nodes, which can be either hand designed or generated by
the AUTO flow. The network configuration of each type is shown in Figure 3.2. The
TMD-MPI library implementation allows a MicroBlaze to be connected to the NetIf
directly, or through an MPE. A hardware engine always uses an MPE. When an MPE
is used, two sets of FSLs connect the computing node to the MPE, where one set
carries the MPE command traffic and the other carries the MPE data traffic. There
are two individual FSLs within each set; one for the inbound traffic and the other
10
Figure 3.2: Network configuration for different node types ([13])
for the outbound traffic. The NetIfs are connected to each other in a partial-mesh
topology, where two FSLs, one for each direction, are used for each pair of NetIfs that
are connected.
With this setup, an MPI program written in C can be ported to a soft processor
node using the TMD-MPI library. The network abstraction allows this node to be re-
placed by a functionally equivalent hardware engine to improve performance without
any impact on the rest of the system. The objective of the AUTO flow is to automate
this conversion process. The AUTO flow analyzes the C program on the MicroBlaze
to prepare it for C-to-HDL conversion using Synfora’s PICO algorithmic conversion
tool. When the hardware is generated, the AUTO flow packages it into a peripheral
that can be used as a hardware engine node directly in the TMD system.
3.2 Design Flow
A four-stage design flow is proposed for the TMD system in [13]. This is illustrated in
Figure 3.3. In stage 1, the user prototypes the application in C on a workstation. In
stage 2, the application is parallelized using an well-known MPI distribution, such as
MPICH, and tested on a cluster of workstations. In stage 3, a TMD system is created
by mapping each MPI process in the parallel version of the application from stage 2
to a soft processor node. The TMD-MPI library is used instead of MPICH starting
11
Figure 3.3: TMD design flow ([13])
from this step. Because both TMD-MPI and MPICH implement the same API, the
porting process requires minimal changes to the original C source code. In stage 4,
the soft processor nodes that are executing the most computation-intensive portion
of the application are identified for hardware acceleration and subsequently replaced
by faster hardware blocks. The decision of which soft processors to replace requires
detailed profiling of the system performance and an understanding of the tradeoffs
between different system resources. Hence it will remain a job of the system designer
for now. However, the laborious process of designing hardware from a functional
description in software will be automated by AUTO.
The TMD system in stages 3 and 4 is designed using the Xilinx EDK/ISE 9.1 suite
of tools. Design entry can be done manually using the EDK GUI for simple systems.
However, for large systems, this manual process is error-prone. An automated tool
such as the System Generator [14] is recommended. All hardware engines, including
the MPE and the NetIf, are available as custom peripherals that can be imported into
EDK. The hardware components of the system are specified by the Microprocessor
12
Hardware Specification (MHS) file while the software components are specified by the
Microprocessor Software Specification (MSS) file. The system is completely defined
by the MHS and MSS, and is implemented using the Xilinx tool flow. After this,
a custom script compiles the software using a specified version of TMD-MPI and
initializes the routing tables in the NetIfs. The output is the final bitstream that is
used to program the FPGA.
Synfora’s PICO Express design suite is used to generate hardware from C pro-
grams. The PICO flow is explained in the next section.
3.3 C-to-HDL Using PICO
The PICO C-to-HDL flow is illustrated in Figure 3.4. The input to PICO Express is
a C program that contains the module to be converted to hardware, which is App.c in
the figure, and driver code the calls the target procedure. Both App.c and the driver
code can be written in ANSI C, with some restrictions; for example, no floating-point
numbers or pointers are supported. A complete list can be found in [16].
At the start of the flow, PICO performs the Golden Simulation, where the input C
source code is compiled with the driver code and executed on the workstation on which
PICO is running. The output from the golden simulation are checked against reference
output provided by the user. This step is useful if the original C program was modified
to make it PICO-compliant. To guard against bugs introduced in this process, the
user supplies reference input and output that are used to verify the golden simulation.
If the golden simulation results agree with the reference, the golden results are saved
for future reference. After the golden simulation, PICO proceeds through several
steps, transforming the source code into the final output HDL. Simulation is run at
each step and results are checked against the golden results to ensure correctness.
In order to incorporate PICO into the TMD design flow, several additional pro-
cessing steps and components are needed. These are provided in the AUTO flow, and
described in the next chapter.
13
Chapter 4
Implementation of the Tool Flow
This section describes the implementation of the AUTO flow. The objective of the
AUTO flow is to take a soft processor in the TMD system and generate a functionally
equivalent hardware engine. Efficient conversion from software to hardware is tricky
and has traditionally been done manually. The AUTO flow automates this process
by providing a set of tools, a library and hardware components.
4.1 Flow Overview
The AUTO flow contains three components: an MPI implementation that is compli-
ant with PICO (PICO MPI), a lightweight hardware control block, and the scripts
that perform the operations involved in the flow. The actual flow consists of three
steps as illustrated in Figure 4.1.
Step 1 is the preprocessing of the user source file, which contains the top-level
function to be converted to hardware. This should be the main function from the
soft processor that is to be replaced by the hardware engine. The output of step 1
includes a PICO source file, the driver code, and a flow script. The PICO source
file is created based on the user input file by adding some pragma settings and the
PICO MPI library implementation. The driver code is the testing code. It contains
functions that redirect the stream I/O to file I/O so that the module in the PICO
source file can be tested as a standard C program. In step 2, PICO Express uses the
15
Figure 4.1: The AUTO tool flow
flow script and files from step 1 to generate the PPA core hardware. An iterative
method may be used in this step to explore the design space for best performance.
Verilog files that describe the generated hardware core are produced at the end of
step 2. In step 3, the control block is custom fitted to the generated core. The
resultant Verilog files are then packaged into a custom peripheral core that can be
used in the EDK flow. Since step 2 is not guaranteed to be successful in the first pass,
the user may need to manually adjust performance parameters when running step 2
iteratively. As a result, these three steps are not integrated into a single push-button
flow. However, all transformations done in step 1 and step 3 are encapsulated into
two scripts, which are provided as part of the AUTO flow. The next few sections
describe each component of the AUTO flow.
4.2 MPI Library Implementation
As described in Section 3.1, the MPI API is supported in the TMD system through
the TMD-MPI library and the MPE, where the underlying communication network
is built on FSLs. PICO does not support FSL directly as a native interface for the
hardware it generates. Instead, it provides a generic FIFO stream interface. Given
an MPI program that runs on a MicroBlaze node, we need a way to tell PICO to
16
interpret calls to MPI interface functions in the source code as a set of operations on
the streams. Therefore, a PICO-compliant MPI implementation is needed. Currently,
only MPI Send and MPI Recv are implemented in this PICO MPI library. However,
all other MPI functions build on these two functions and can be easily added to
the PICO MPI implementation. Figure 4.2 shows the stream operations involved
in the MPI Recv and MPI Send operations, where cmd in is the input command
stream, cmd out is the output command stream, data in is the input data stream,
and data out is the output data stream. The numbers in the parentheses indicate the
ordering of the stream operations.
(a) MPI Send (b) MPI Recv
Figure 4.2: Stream operations required to implement MPI behaviour
Due to the restrictions on input C code imposed by PICO, two modifications to
the MPI API had to be made. First, the concept of a family of MPI functions is
introduced. A family of MPI functions refers to a collection of MPI functions that
provide the same functionality, but operate on different data arrangements. This is
needed because PICO does not support pointers. In a C implementation of the MPI
API, the function prototypes for MPI Recv and MPI Send look like the following:
int MPI_Send (void *buffer, int count, MPI_Datatype type, int dest, ...);
int MPI_Recv (void *buffer, int count, MPI_Datatype type, int source, ...);
The generic pointer buffer specifies the starting location in memory. Together
with count, they specify a vector. In general, any memory within the address space
17
of the calling program is acceptable. For example, buffer may point to the middle
of a 2D array. However, PICO does not support pointer access to memory. Hence
alternatives are needed.
There are two approaches to accommodate the usage of MPI functions on different
data arrangements. One is to introduce another wrapper layer on top of the basic
PICO MPI implementation. For operation on 1D arrays, the default PICO MPI
implementation is used. For other data arrangements, the operation is performed
on a temporary buffer, and the values copied between the temporary buffer and the
actual data location. Clearly this is slow as every MPI operation on a buffer of size
N would require 2*N memory accesses. Alternatively, a different “flavour” can be
designed to target a particular data arrangement for each MPI function. All variants
of the same MPI function form a family. The three most common data arrangements
used in software are scalars, 1-dimensional vectors, and 2-dimensional arrays. The
PICO MPI implementation includes variants of MPI functions that target each of
these data arrangements, as shown in Table 4.1.
Table 4.1: Implemented families of MPI functions
MPI Function Family OperandMPI OP (int buf[], int count, ...) 1D vectorMPI OP Scalar (int *buf, int count, ...) A single scalar variableMPI OP2D (int buf[][maxc], int row, int count, ...) A row in a 2D array* OP can be Send or Recv.
The second modification to the MPI API is particular to MPI Recv. In the API
specification, MPI Recv takes in an MPI Status data structure as an argument and
populates it with information related to the receive operation. In C distributions
of MPI, MPI Status is implemented as a struct, which is not supported in PICO.
Since the most often queried fields of MPI Status are the source rank and tag, as a
workaround, the function prototypes of the MPI Recv family of functions are modified
to take in two int references instead of an MPI Status to pass back the source rank
and tag information.
Even though the PICO MPI implementation is designed for hardware synthesis,
18
it still follows the ANSI C standard. The AUTO flow generates the driver code that
redirects the stream I/O to file I/O when the target program is compiled and executed
as a standard C program. The user is still responsible for providing the input file
and verifying the correctness of the output file. The generation of test stimuli and
reference output is described in Section 4.5.
4.3 Control Block
After PICO generates the hardware core (referred to hereafter as the PPA, which
stands for Pipeline of Processing Arrays and is the PICO lingo for the hardware
core), a control block is needed to initialize and start the PPA on system startup,
and to translate the PICO stream interface to the FSL interface used to communicate
with the MPE. A simple-finite-state-machine-based hardware controller is designed
to provide such functionality. The control block module is a wrapper around the
PPA hardware. It operates on the system clock and reset, and provides four FSL
bus interfaces that can be connected to the corresponding FSL bus interfaces on the
MPE. On system startup, the rank 0 node in the TMD system sends rank information
to all hardware nodes in the system. The control block receives the rank from the
network, initializes the PPA, and enables the translation between the FSL interfaces
and the PICO stream interfaces. All activities on the FSL interfaces are translated
into the corresponding control and data signals on the stream interface of the PPA
automatically. The control block sits idle until the PPA completes its current task.
Then it restarts the PPA to process the next task. A more detailed discussion on the
operations of the control block can be found in Appendix A.
4.4 Scripts
Two scripts are used to encapsulate the operations in the AUTO flow. They are
described in this section.
19
4.4.1 Preprocessing Script
The preprocessing script (auto pico.pl) processes the user C program and generates
the files that are needed for the PICO Express flow. The preprocessing script is
invoked using the following command:
auto_pico.pl <src_C_file> <mpi_rank> <mpi_size> <transcript_file>
[Optional: <mem_option_file>]
Here src C file is the user C program. The mpi rank and mpi size parameters are
the rank of the hardware node to be generated and the size of the TMD systems.
These two parameters are for simulation purposes only. When the hardware is im-
plemented in the TMD system, they will be supplied as part of system initialization
procedure. The transcript file contains the expected input and output on the FSL
interfaces of the hardware block for a test application. This is used by the AUTO
flow to generate input stimuli and reference outputs. Section 4.5 describes how this
transcript can be obtained. The mem option file provides optional information about
how the arrays in the program should be mapped to hardware. It is optional. More
details on this are given later in this section.
There are five tasks involved in the preprocessing step. They are listed below.
The first two steps result in a new C file that is based on the user input file, with
added pragma settings and the PICO MPI implementations.
1. Parse through the user program and add #pragma settings
2. Attach PICO MPI implementations
3. Generate the driver code, input and reference output files
4. Generate Makefile
5. Generate run script for PICO
The input to the preprocessing script is a C program that contains one top level
function to be converted into PPA hardware, and any number of helper functions, all
20
contained in a single C file. The input program has to follow the constraints imposed
by PICO. The preprocessing script does not check for these. However, if there is a
violation, the PICO Express flow will error out when it is run in step 2. The user
can fix the source program and restart from this step. The first task in this step is to
parse out all arrays declarations and add the following pragma settings immediately
after the line in which the array is declared:
#pragma bitsize <size> <array_name>
#pragma host_access <array_name> none
The first line indicates the word size when the array is mapped to a memory. The
second line tells PICO not to expose any access ports for this memory as external
ports on the PPA because all communication with other modules in the system for a
TMD hardware engine is through the stream interfaces.
Another setting associated with arrays is the type of memory that the array maps
to. PICO supports three types of memories: internal fast, internal block RAM,
and user supplied. Internal fast memories are registers-based. Internal block RAM,
as its name suggests, is a block RAM instantiated within the PPA hardware. User
supplied memories are placed outside the PPA. PICO will generate a standard SRAM
interface to interact with the external memory. The user is responsible for supplying
the external SRAM. To explicitly map an array to a particular type of memory, the
following pragma setting can be used:
#pragma (internal_fast|internal_blockram|user_supplied_fpag) <array_name>
When not specified, PICO chooses the memory type based on the size of the array.
However, the user can supplied a memory option file to tell the preprocessing script
to set the type explicitly. A memory option file may contain lines that look like the
following, where a # at the beginning of a line indicates a comment:
# Specifying the memory type for the following arrays
array_1 blockram
array_2 fast
array_3 user_supplied
21
After the annotation of the array, the PICO MPI library implementations are
attached to the user code. Recall from Section 4.2 that each MPI function has a
different variant that targets a particular data arrangement. In the C program, all
MPI calls use the standard prototype. It is the preprocessing script’s job to replace
each call by the appropriate variant according to the type of the operand. This is
illustrated below. The same type of transformation will be done for MPI Send.
/* Original Program */
void ppa_function (void)
{
int buf[BUFSIZE];
int buf2[NUM_ROWS][NUM_COLS];
int buf3;
...
// (1) Operand is a 1D vector
MPI_Recv (buf, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
// (2) Operand is a 2D array
MPI_Recv (buf2[i], count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
// (3) Operand is a scalar
MPI_Recv (&buf3, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
...
}
/* PICO-Ready Program */
void ppa_function (void)
{
int buf[BUFSIZE];
int buf2[maxr][maxc];
int buf3;
...
// (1) 1D variant of MPI_Recv
MPI_Recv (buf, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
// (2) 2D variant of MPI_Recv; row = i
MPI_Recv2D (buf2, i, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
// (3) Scalar variant of MPI_Recv
MPI_Recv_Scalar (&buf3, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
...
22
}
With all MPI calls updated, the PPA program is ready to be used with the PICO
Express flow. The preprocessing script then generates the driver code that is used
by PICO for simulation and testing purposes during various stages of the PICO flow.
The driver code is generic and the same template can be used for any PPA function.
A Makefile is also generated so that the driver and the PPA program can be compiled
as a standard C application and tested prior to submitting to the PICO Express flow.
The final item generated by the preprocessing script is a flow script (run.tcl) that
contains the commands to set up the PICO project and invoke the flow. The flow
scripts tells PICO to generate the PPA core in a hierarchical fashion. Each MPI
function is generated as a sub module and instantiated in the PPA module. This
hierarchical approach provides better modularity and area efficiency, as opposed to
inlining all the modules.
Finally, with all the files generated, the user can start step 2 using the following
command:
prompt> pico_extreme_fpga run.tcl
4.4.2 Packaging Script
The packaging script (pack pico.pl) takes the Verilog files generated in the PICO
Express flow and produces a custom peripheral core that can be directly imported
into Xilinx’s EDK. The packaging script is invoked using the following command:
pack_pico.pl <pico_experiment_directory> <name_of_peripheral>
Here pico experiment directory specifies where to retrieve the Verilog generated by
PICO. There are two steps involved in the packaging script: generating the control
logic and packaging the HDL into an EDK-importable peripheral. The control logic
is generated from a pre-designed template by setting a few parameters, including
the peripheral name and the PPA module name. Making a custom peripheral core
involves putting the source Verilog files in a specific directory structure, as shown
below:
23
mpi_jacobi_v1_00_a/
data/
mpi_jacobi_v2_1_0.mpd
mpi_jacobi_v2_1_0.pao
hdl/
verilog/
mpi_jacobi.v (Control logic)
mpi_jacobi_ppa.v (PPA top-level wrapper)
... (Other PPA files)
The packaging script creates the mpi jacobi v1 00 a/ directory for version 1.00.a
of a peripheral named mpi jacobi. This directory can be placed in any custom IP
repository to make it available use with the EDK tools. All source code files are
collected under the hdl/verilog/ sub-directory. The data/ sub-directory contains the
Microprocessor Peripheral Description file (.mpd), and the Peripheral Analysis Order
file (.pao). The MPD describes the interface exported by the peripheral as well as
the parameters that can be set in the MHS file of an embedded system. The PAO
contains the list of all source files required to build this peripheral. Both files are
generated automatically by the packaging script.
4.5 Test Generation
As mentioned earlier, in order to run the PICO flow, the user needs to provide input
stimuli for the input streams and reference output for the output streams. These can
be generated from the TMD system directly. Figure 4.3 shows a TMD prototyping
system where one of the two MicroBlaze processors is to be replaced by a hardware
engine. Since the replacement of the MicroBlaze by a hardware engine is transparent
to the rest of the system, the node behind the MPE can be viewed as a black box that
can be implemented either as a soft processor node or a hardware engine. Since all
communication with the black box is through the FSL interfaces, by snooping the FSL
interfaces and capturing the activities, both the input stimuli and the corresponding
reference output can be obtained. To do this, the TMD-MPI library is instrumented
24
Figure 4.3: TMD system testbed
so that it prints out the raw data sent to and received from the FSL interfaces when a
compiler flag is turned on. A transcript obtained this way is then given to the AUTO
flow to produce the input and reference output files that are used by PICO during
hardware generation.
4.6 Limitations
The AUTO flow was developed as a proof-of-concept prototype to demonstrate the
feasibility of automatically generating a hardware accelerator from software descrip-
tions. Consequently, there are still some limitations of the current version. Most of
the issues described in this section are due to limitations of the various tools inte-
grated into the AUTO flow. Many of them are likely to be resolved in the future.
The others are inherent to the shift in design strategies from a software paradigm to
a hardware paradigm. Hopefully, more sophisticated automation tool flows will help
bridge this gap.
25
4.6.1 Floating-Point Support
A major limitation of the current version of the PICO tool is the lack of support
for floating-point numbers. Currently, only fixed-point numbers (char, int, long) are
accepted. Floating points are more complicated to support, hence hardware-based
applications have traditionally used fixed-point-based approaches. On the other hand,
all modern microprocessors have floating-point support. It is rare to see software,
especially for scientific-computing purposes, that does not use floating-point numbers.
As a result of this limitation, the range of applications that can be converted to
hardware using the PICO flow is severely limited. In order to use the PICO flow as it
is now, the designer will have to rewrite the original software application using fixed-
point numbers. There algorithmic synthesis tools that are commercially available now
that support floating-point operations. An example is Handel C [3]. Perhaps in the
future PICO will also incorporate this feature.
4.6.2 Looping Structure
Using more than one loops inside another loop is not allowed. This limits the com-
plexity of the software that can be put through PICO. An alternative to this is to
turn each internal loop-nest into a Tightly Coupled Accelerator Block (TCAB). This
seems to be a cleaner way when working with large designs due to the nice hardware
design hierarchy provided by its usage; although programs with complicated looping
structures should probably be avoided because it makes timing performance analysis
harder.
4.6.3 Pointer Support
The pointer in C is a powerful concept that allows flexible access to memories. Many
C programs make extensive use of pointers. However, general pointer accesses are
currently not supported in PICO, possibly to avoid complex pointer-aliasing analyses.
Consequently, programs that use pointers need to be manually inspected to replace
those instances of pointer usage that are illegal in PICO with equivalent alternatives.
26
This process may involve rewriting portions of the code. Due to the dynamic nature
of such analysis, AUTO does not attempt to do this automatically. It is the user’s
responsibility to provide AUTO a source program that follow PICO’s guidelines in
terms of pointer usage.
4.6.4 Division Support
Division with an arbitrary divisor is not supported when using the Xilinx XST syn-
thesis tool. XST can only synthesize a divider when the operand is a power of 2.
However, PICO instantiates a general-purpose divider even when the divisor can be
determined to be a power of 2 at compile time. PICO recommends Synplify Pro to
be used to synthesize generic dividers. However, we do not have access to a license
of Synplify Pro, hence we cannot verify this recommendation. As a result of this
limitation, applications with general division is not supported by the AUTO flow.
A workaround exists for fixed point division when the divisor is a power of 2. This
is done by replacing the division with a bitwise right shift (>>) by the appropriate
number of bits.
4.6.5 Performance Specification
The AUTO flow currently does not optimize for performance. It is up to the user
to decide what the appropriate performance parameters should be. One reason is
that the user should be more familiar with the required performance constraints for
the target hardware. Secondly, for complex software, the PICO flow may need to be
invoked iteratively to find a good set of performance constraints that deliver the best
area-delay-power tradeoff. Performance constraint can be specified in terms of MITI
(Minimum Inter-Task Interval) to control the amount of task overlap, or II (Initiation
Interval) for a loop to affect how tightly loop iterations can be scheduled. When a
constraint is not specified, PICO tries aggressively to obtain a compact schedule.
Sometimes this schedule may be impossible to meet during synthesis. Therefore, an
iterative approach might be needed to manually find the sweet spot.
27
Another parameter that has shown to improve performance is the trip count for
loops, which is the expected number of iterations in the loop. When this is specified,
PICO can better optimize for both area and speed. However, the trip count cannot
be determined from static analysis of the code in general. Hence, the AUTO flow does
not set this option for loops. It is up to the developer to tune this option manually
if desired.
4.6.6 Hardware Debugging
In the current flow, once the software is converted to hardware, it is very difficult to
debug if the hardware is nonfunctional. There are two main causes of a malfunctioning
hardware module: bugs in the user C program and bugs in the PICO flow. For bugs
in the PICO flow, the user needs to use standard hardware debugging tools to track
down the problem. This requires knowledge of Verilog and hardware design and
debugging in general, which an average software developer may not possess. On the
other hand, if the bug is in the user program, standard C debuggers can be used. This
is likely to be the case in the long run, when the PICO flow is expected to be relatively
stable. In this case, it would be helpful if the live input and output of the hardware
module in its working environment can be captured and used to generate test stimuli
for the C program. This way, the software developer can test the program with real
hardware data in a C debugging environment, which he will be familiar with.
4.6.7 Exploitable Parallelism
The amount of parallelism that can be exploited by PICO is limited to that exposed
in the software. Badly designed software will result in inefficient hardware. Software
optimization, and automatic parallelism extraction is a separate research field. The
AUTO flow only prepares a piece of software for use in the PICO C-to-HDL flow.
It does not optimize the source code. The software designer hence should use dis-
cretion when writing the program. PICO provides a number of options to automate
some basic optimizations such as full loop unrolling and multi-buffering of memory
28
through the use of pragmas [16]. However, these automated options have limited ap-
plicability, and cannot replace intelligent code design by the programmer. That said,
not all software optimization techniques apply when the final goal is hardware. For
example, any cache-related optimizations such as loop-reordering are unlikely to see
performance upside in the final hardware. Therefore, the designer should also have
some high-level understanding of the hardware architecture that PICO produces in
order to get the best result.
29
Chapter 5
The Heat-Equation Application
This section presents an example application that is built to demonstrate the function-
ality of the AUTO flow. The application chosen is a heat-equation solver. The heat
equation is a partial differential equation that describes the temperature variation in
a given region over time, given the initial temperature distribution and boundary con-
ditions. The thermal distribution is determined by the Laplace equation 5(x, y) = 0.
The solution to this equation can be found by the Jacobi iterations method [22],
which is a numerical method to solve a system of linear equations. This application
was chosen because a TMD system has been previously implemented for this method
[13], so a working MPI program already exists and can be used in the AUTO flow
directly.
The main objective of this project is not to produce a high-performance hardware
accelerator for the heat-equation application. Rather, the goal is to demonstrate the
feasibility of an automated flow. Therefore, it is expected that the resultant hard-
ware may not deliver the best performance in comparison to hand-designed hardware
modules.
5.1 Implementation
Because floating point is not supported in PICO, in this implementation of the heat-
equation solver, the temperatures are represented as fixed-point numbers. This is
30
acceptable because the precision of the computation is not important for this project,
as the objective is not to build a high-performance heat-equation solver. We will
accept the hardware as long as it produces the same result as the software running
on a soft processor.
In the original TMD implementation of the heat-equation application, nine com-
puting nodes were used in the system. For the simplicity, only two computing nodes
are used for the test application in this project. Rank 0 is a MicroBlaze and is the
root node. It generates the initial temperature map and sends the working data to
rank 1. Rank 1 is a computing node. The two nodes solve the heat equation together.
In the reference system, rank 1 is a MicroBlaze running the software implementation
of the Jacobi iterative solver. Rank 1 in the test system is the hardware generated
from the software version using the AUTO flow.
The original software implementation of the Jacobi iterative solver contains a sin-
gle program that is run on all soft-processor computing nodes. Depending on the
rank of the computing node, which is defined during compile time through a compiler
directive, sections of the program are selectively exercised. All non-root computing
nodes perform essentially the same operations. In addition to the computation loop
performed by all computing nodes, the root node is also responsible for data initial-
ization and finalization, as well as synchronization at the end of each Jacobi iteration.
The goal for the example application is to generate a hardware implementation for
a non-root node. The first step in the implementation is to extract the parts of the
program that pertain only to non-root nodes. The resultant program is a concise
version of that running on the non-root nodes originally. This program is then passed
through the three stages of the AUTO flow, as described in Section 4.4.
After obtaining a peripheral from the AUTO flow, the reference system and test
system are implemented using the design flow described in Section 3.2. The two sys-
tems are implemented on a Xilinx Virtex II Pro FPGA. The performance is measured,
and the results are documented in the next section.
31
5.2 Experiment Methodology
A few simple experiments are conducted to compare the performance of the Jacobi
hardware engine produced by the AUTO flow to that of the reference software imple-
mentation. The setup of the reference and test systems are shown in Figure 5.1. The
two-node implementation is chosen for simplicity purposes. It can be easily scaled
up to include more computing nodes by duplicating the rank 1 MicroBlaze in the
reference system or the Jacobi hardware engine in the test system. Since each node
only communicates with two of its neighbours the individual behaviour of each node
is not affected with increased system size.
(a) Reference System
(b) Test System
Figure 5.1: A simple two-node TMD implementation of a Jacobi heat-equation solver
Two different programs are run on the rank 0 MicroBlaze to conduct two tests.
The first test verifies the correctness of the test system against the reference system.
The original TMD C implementation of the heat-equation solver is used on rank 0.
Recall that the program on the reference system rank 1, which is used to generate
the hardware, is originally extracted from the C implementation. In this setup, the
root node collects the results from the computing element after the system converges
and prints out the results to the UART, which is then captured on a PC connected
to the FPGA board through a serial link. A problem size of 40x40 is used. Each
node hence operates on a section of 20x40 elements. The output from both systems
are identical.
The second test measures the performance of the Jacobi hardware engine com-
32
pared against the reference system. In this case, rank 0 only performs data initializa-
tion, finalization and synchronization between the computing nodes. It does not run
the computation loop. In each iteration, rank 0 exchanges the boundary rows with
rank 1, and proceeds directly to the synchronization barrier and waits for rank 1 to
finish its computation loop. The number of cycles from the end of the row exchange
to the end of the iteration synchronization is measured for each iteration and accu-
mulated over the entire program. The cycles-per-element (CPE) is the total number
of cycles divided by the problem size (20 × 40 × Niterations). Normally, the Jacobi
iteration algorithm stops when the results converge. In this test, in order to emulate
a different problem size (i.e. total number of elements to process), the program on
rank 0 is changed slightly so the total number of Jacobi iterations to run before stop-
ping the computing can be controlled. Using this hack, the CPE of the reference and
test systems are measured for a few cases with different number of Jacobi iterations
executed in each. The results are documented in the next section.
5.3 Results
Both the test and reference system are implemented using Xilinx XPS 9.1. The
reference system uses a 100MHz clock, the maximum clock frequency provided on
the Xilinx University Program Development Board [20]. The test system uses a
50MHz clock. This is because the highest speed that could be achieved during the
hardware generation in the PICO flow was 80MHz, and the Digital Clock Manager
(DCM) block in the embedded system does not allow fractional division. So the
only available clock frequencies below 80MHz are 50MHz and 66MHz, and 50MHz is
chosen for convenience. The designs are tested on a Xilinx Virtex-II Pro XC2VP30
FPGA and the performance is measured.
A fair comparison between the hardware Jacobi solver and the MicroBlaze solu-
tion is the number of elements computed per second per LUT. The area cost of the
hardware Jacobi solver was obtained by synthesizing it alone using ISE 9.1. This
includes a a small overhead of about 9 LUTs introduced by the control block, which
33
is negligible for most designs. Since the hardware solver provides equivalent function-
ality to a complete MicroBlaze system, which includes a MicroBlaze soft processor,
2 Local Memory Busses (LMB), 2 LMB BRAM controllers, and 1 BRAM block, the
equivalent area cost is that of such a system. This is obtained by implementing a
single MicroBlaze system containing the above components using the EDK flow. The
total LUT count in each case is obtained from the Xilinx Mapping Report (.mrp).
Figure 5.2 shows the raw CPE measurements for the reference and test systems.
The high CPE observed in the reference system for short-running experiments that
perform a small number of iterations may be due to the initialization overhead in
the MicroBlaze that is not fully amortized. The performance of the two systems are
compared using the asymptotic CPE when the number of iterations is large. The
normalized computing power in terms of number of elements processed per second
per LUT is shown in Table 5.1.
Figure 5.2: Main loop execution time per element with different iteration lengths
The hardware Jacobi solver is actually 22% slower than the soft processor im-
plementation when running at 50MHz. However, theoretically the hardware Jacobi
solver can operate at 80MHz. This is shown as the third column in Table 5.1, in
which case we get 1.25x runtime improvement over the reference system. It is clear
that speedup like this is not enough for a hardware acceleration system. However, we
have achieved the objective of developing a working flow. Going forward, more effort
34
Table 5.1: Normalized computing power of the reference and test systems
MicroBlaze HW Jacobi HW Jacobi (Speculated)Clock Frequency 100 MHz 50 MHz 80 MHzAsymptotic CPE 41 17 17Total LUTs 2771 4267 4267Nelements per sec per LUT 880 689 1103Speed-Up 1x 0.78x 1.25x
will be spent in the optimization of the AUTO flow to provide more performance
improvement.
35
Chapter 6
Conclusion and Future Work
In this project, we have presented an automated flow that generates hardware com-
puting nodes from C directly that can be used in a TMD system directly. As a
result, soft processor nodes in a TMD system can be easily converted and replaced by
functionally equivalent hardware engines to achieve better performance. A working
hardware Jacobi heat equation solver is produced as an example to demonstrate the
feasibility of the AUTO flow. With little designer intervention, the hardware Jacobi
solver is generated automatically and shows some performance improvement over the
software implementation on a soft processor. The AUTO flow shows that FPGA-
based high-performance computing platforms such as the TMD can be made more
accessible to software developers who are unfamiliar with hardware design. We are
now one step closer towards the ultimate goal of a completely automated flow that
converts a parallel program into a complete TMD system.
Nevertheless, more work still needs to be done to address the limitations of the
current AUTO flow, which include the lack of support for the true MPI API and
the requirement for running the iterative flow through PICO manually. Hence, the
next step is to complete these features and focus on robustness and performance
optimization. The AUTO flow will be used with a wider range of applications to test
its robustness. Further optimized of the flow will be done to provide more performance
advantage for the hardware engines.
36
Appendix A
Hardware Controller for PICO
PPA
To integrate the hardware module generated by PICO into the TMD system, a
lightweight hardware control block is used. The control block has two responsibil-
ities. It generates the control signals to initialize and start the PPA module upon
system startup. This is done using a finite-state-machine (FSM). It also translates
the PICO stream interface to the FSL interface, which is done through combinational
logic. Figure A.1 shows the design of the control block. Table A.1 lists the I/O ports
exported by the control block, which include the clock, the reset, and four sets of FSL
ports for the four FSL bus interfaces. These ports are visible to the rest of the TMD
system.
A.1 Control FSM
A finite state machine (FSM) is used to generate the control signals for the PPA
module. Table A.2 lists the raw control ports of the PPA that need to be controlled
by the control block. The control block drives the input ports with the appropriate
values to initialize and start the PPA.
Upon system reset, the FSM receives the rank of the current hardware node
from the MPE, and stores it in an 8-bit word, which is connected to the raw-
37
Figure A.1: Design of the control block
Table A.1: I/O ports exported by the control block
Signal Direction (I/O) Bus Width Bus Interfaceclk Irst ITo mpe cmd data O [31:0] FSLTo mpe cmd ctrl O To mpe cmdTo mpe cmd write OTo mpe cmd full IFrom mpe cmd data I [31:0] FSLFrom mpe cmd ctrl I From mpe cmdFrom mpe cmd exists IFrom mpe cmd read OTo mpe data data O [31:0] FSLTo mpe data ctrl O To mpe dataTo mpe data write OTo mpe data full OFrom mpe data data I [31:0] FSLFrom mpe data ctrl I From mpe dataFrom mpe data exists IFrom mpe data read O
38
Table A.2: Raw control ports on the PPA module
Signal Direction (I/O) Bus Widthclk Ireset Ienable Istart task init Istart task final Iclear init done Iclear task done Ipsw task done Orawdatain self rank 0 I/O [0:7]
datain self rank 0 port on the PPA. This port is generated when the original C
function declares and uses the global variable self rank. The PPA is enabled after
the rank is obtained. The necessary sequence for each control signal is described in
Appendex C.3 of [15]. Figure A.2 shows the state transition diagram of the FSM.
When the FSM reaches the PPA READY state, the translation between the PICO
stream interface and the FSL interface is enabled. The FSM remains idle in this state
until psw task done is raised high by the PPA, at which point the PPA is restarted
to wait for the next task.
A.2 Stream Interface Translation
The second task of the control block is to translate the PICO stream interface to the
FSL interface. A detailed description of the PICO stream interface can be found in
Appendix C.5 of [15]. Information on the FSL interface can be find in [19]. Figure
A.3 shows a simple illustration of the PICO input and output stream interface. In
both cases, the req signal is raised high by the PPA when it wants to interact with the
stream. The ready signal is an input to the PPA. A high ready signal indicates that the
other end of the stream is ready to receive or send data. The FSL interface is shown
in Figure A.4. To read from an FSL, the data consumer raises FSL S Read. If there
is data in the FSL, the FSL S Exists signal will also be high. In this case, the data
word and control bit can be read through FSL S Data and FSL S Control in the next
39
(a) Input stream (b) Output stream
Figure A.3: The PICO stream interface
(a) Read from an FSL (b) Write to an FSL
Figure A.4: The FSL bus interface
clock cycle. Similarly, to write to an FSL, the data producer raise FSL M Write and
puts the data word and control bit on FSL M Data and FSL M Control respectively.
In the first clock cycle after FSL M Full becomes low, the data is pushed into the
FIFO.
Comparing the operations between the PICO stream interface and the FSL bus
interface, it is not difficult to see the similarity in functionality between instream req
and FSL S Read, instream ready and FSL S Exists, outstream req and FSL M Write,
outstream ready and the complement of FSL M Full. Since the PICO stream interface
does not contain a control bit, a 33-bit data bus is used, where the highest bit is
interpreted as the control bit in the FSL interface, and the lower 32 bits as the
data word. The translation between the PICO stream interface and the FSL bus
interface is summarized in the Verilog excerpt shown in Program 1. This translation
relationship is used for all four PICO streams (cmd in, cmd out, data in, data out)
and the corresponding FSL interfaces.
41
Program 1 Code excerpt to illustrate the stream interface translations
/∗ For outbound FSL ∗/FSL M Data = outstream data [ 3 1 : 0 ] ;FSL M Ctrl = outstream data [ 3 2 ] ;FSL M Write = outstream req & ˜FSL M Full ;outstream ready = ˜FSL M Full ;
/∗ For inbound FSL ∗/in s t ream data = {FSL S Control , FSL S Data } ;FSL S Read = ins t r eam req & FSL S Exists ;ins t ream ready = FSL S Exists ;
42
Appendix B
Using PICO: Tips and
Workarounds
This chapter describes some of the PICO-related workarounds and performance-
improving tips that were discovered during the development of the AUTO flow. Refer
to [16] for details on the PICO options mentioned in this chapter.
B.1 Stream Ordering
As described in Section 3.1, MPI operations are realized in hardware as a series of
stream operations. The order of these stream operations are defined by the MPE
protocol, and hence must be enforced for correct interaction with the MPE block.
However, stream ordering is tricky in PICO. Because PICO always tries to produce
a schedule that is as compact as possible, it will schedule two operations in the same
phase as long as it does not detect data dependency between the two operations. As
there is no direct data dependency between the streams, it is hard to make PICO
understand the sequential ordering constraint of the stream operations as imposed
by the semantics of the MPE protocol. A workaround that allows stream operations
to be ordered is to wrap each stream in a function and generating the function as
a Tightly Coupled Accelerator Block (TCAB). Since each TCAB is generated as an
independent sub module, dummy data dependency can be introduced between the
43
TCABs to force a sequential schedule. Program 2 shows an excerpt of the code for
the MPI Recv function that illustrates the techniques used to establish ordering in
the stream operations.
In the code excerpt shown in Program 2, the write operation on the output com-
mand stream is encapsulated in a TCAB called recv cmd out. The return value of
the TCAB is used in conditional statements to force the instructions enclosed in the
conditional statement to be scheduled at least one time unit after the execution of
recv cmd out. Program 2 also illustrates the technique of loop-sinking, where sequen-
tial instructions before a loop are sinked into the loop to provide hints about the
ordering of the operations. By writing the code this way, it tells PICO clearly that a
single instruction is to be executed every iteration so an instruction in a later iteration
will not be scheduled before one in an earlier iteration.
B.2 Improving Performance
With PICO the performance of the generated hardware can sometimes be improved
by turning on certain options manually. This section describes a few of them. More
details regarding the usage of these options can be found in [16].
Memory Type
There are three types of memories supported by PICO: internal fast register-based
memory, block RAM and user-supplied external memory, such as DDR. The memory
type can be specified using the following command in the C source file after the array
declaration:
#pragma (internal_fast | internal_blockram | user_supplied_fpga) <mem_name>
When the memory type is not explicitly specified, PICO chooses the type based
on the size of the array. However, some automatic optimizations are only available
for the register-based memories. Also, block RAMs can only have a maximum of 2
read/write ports, which may limit the amount of parallelism that can be exploited.
44
Program 2 Code excerpt to illustrate stream ordering
unsigned char recv cmd out (unsigned long long cmd out ){
#pragma b i t s i z e recv cmd out 1#pragma b i t s i z e cmd out 33pico stream output mpe cmd out ( cmd out ) ;return 1 ;
}
int MPI Recv ( int bu f f e r [ ] , int count , MPI Datatype datatype ,unsigned int source , unsigned int tag , MPI Comm comm,unsigned int ∗ i n t a g p t r , unsigned int ∗ i n s r c p t r )
{unsigned char done = 0 ;#pragma b i t s i z e done 1unsigned int i ;
#pragma num i te ra t i ons ( 3 , , )for ( i = 0 ; i < count + 3 ; i++) {
i f ( i == 0) {unsigned long long cmd out = MPE RECV OPCODE | ( source << 22) |
count | FSL CTRL MASK;#pragma b i t s i z e cmd out 33recv cmd out ( cmd out ) ;
}else i f ( i == 1) {
done = recv cmd out ( tag & FSL DATA MASK) ;i f ( done ) {
∗( i n s r c p t r ) = ( pico stream input mpe cmd in ( ) &NET SOURCE MASK) >> 24 ;
}}else i f ( i == 2) {
∗( i n t a g p t r ) = pico stream input mpe cmd in ( ) &FSL DATA MASK;
}else {
// Standard r e c e i v e data from streami f ( done ) {
bu f f e r [ i −3] = p i co s t r eam input mpe data in ( ) &FSL DATA MASK;
}}
}return MPI SUCCESS ;
}
45
Loop Trip Counts
Specifying the trip counts for loops saves area and improves speed. This is because
PICO can optimize better when it knows the expected iterations of a loop. The loop
trip count can be specified using the following pragma immediately before the start
of the loop:
#pragma num_iterations (<min>, <expected>, <max>)
Hand Perfectization
Perfectization in the PICO lingo is the process of transforming the original source code
into a sequence of loop nests, without any sequential code between the loops. This is
done automatically by PICO during the scheduling phase. In certain cases, the user
can do a better job perfectizing the code than PICO could. For example, in Program
2, hand perfectization is used to sink the initial sequential setup code into the loop.
The resultant hardware can start a new iteration at each cycle since it is clear that
only one instruction is executed per iteration. However, without hand perfectization,
PICO could choose to sink all initial sequential instructions to iteration 1 and as a
result, the scheduler will not be able to schedule an iteration per clock cycle.
Memory Porting Arbitration
Sometimes PICO will produce a schedule that results in the need for a memory with
more than 2 ports. Such memories are not synthesizable in FPGAs. A remedy for
this is to specify the following pragma on the array in question and re-synthesize the
design:
#pragma forward_boundary_register <array_name> always
This results in additional logic that can arbitrate the accesses to the memory to
reduce the actual port requirement. However, it does not always work. Sometimes
arbitration could result in miscomparison with the golden results in post-synthesis C
simulation or the final RTL simulation indicating erroneous hardware. Therefore, this
option should be used with caution and the two simulations are strongly recommended
when using this option.
46
Bibliography
[1] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Han-
rahan. Brook for GPUs: stream computing on graphics hardware. ACM Trans-
actions on Graphics (TOG), 23(3):777–786, 2004.
[2] C. Cathey, J. Bakos, and D. Buell. A Reconfigurable Distributed Computing
Fabric Exploiting Multilevel Parallelism. Proceedings of the 14th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines (FCCM06).
[3] Celoxica, low-latency and accelerated computing solutions for capital markets,
Curr. April 2008. http://www.celoxica.com.
[4] C. Chang, J. Wawrzynek, and R. Brodersen. BEE2: a high-end reconfigurable
computing system. Design & Test of Computers, IEEE, 22(2):114–125, 2005.
[5] Cray XD1 Supercomputer for Reconfigurable Computing, Technical report, Cray,
Inc. 2005. http://www.cray.com/downloads/FPGADatasheet.pdf.
[6] W. Gropp and E. Lusk. User’s Guide for MPICH, a Portable Implementation of
MPI. Argonne National Laboratory, 1994.
[7] Impulse Accelerated Technologies. Software Tools for an Accelerated World,
Curr. April 2008. http://www.impulsec.com.
[8] InfiniBand Trade Association. The InfiniBand Architecture Specification R1.2,
Technical Report, October 2004. http://www.infinibandta.org.
[9] Mentor Graphics. The EDA Technology Leader, Curr. April 2008.
http://www.mentor.com.
47
[10] The Message Passing Interface (MPI) Standard, Curr. April 2008. http://www-
unix.mcs.anl.gov/mpi/.
[11] A. Patel, C. Madill, M. Saldana, C. Comis, R. Pomes, and P. Chow. A Scalable
FPGA-based Multiprocessor. Proceedings of the 13th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines, 71:72, 2006.
[12] M. Saldana and P. Chow. TMD-MPI: An MPI implementation for multiple
processors across multiple FPGAs. IEEE 16th International Conference on Field
Programmable Logic and Applications, 2006.
[13] M. Saldana, D. Nunes, E. Ramalho, and P. Chow. Configuration and Program-
ming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD-
MPI. In the Proceedings of the 3rd International Conference on Reconfigurable
Computing and FPGAs, September 2006.
[14] L. Shannon and P. Chow. Maximizing system performance: using reconfigurabil-
ity to monitor system communications. Field-Programmable Technology, 2004.
Proceedings. 2004 IEEE International Conference on, pages 231–238, 2004.
[15] Synfora, Inc. PICO Express FPGA - PICO RTL: Synthesis, Verification and
Integration Guide, 8.01 edition, 2008.
[16] Synfora, Inc. PICO Express FPGA - Writing C Applications: Developer’s Guide,
8.01 edition, 2008.
[17] Synfora, Inc., Curr. April 2008. http://www.synfora.com.
[18] Top 500 Supercomputer Sites, November 2007. http://www.top500.org.
[19] Xilinx, Inc. Fast Implex Link (FSL) Bus (v2.00a) Product Specification, ds449
edition, December 2005.
[20] Xilinx, Inc. Xilinx University Program Virtex-II Pro Development System Hard-
ware Reference Manual, ug069 (v1.0) edition, March 2005.
48