an automated flow to generate hardware computing …pc/research/publications/wang.thesis08.pdfan...

An Automated Flow to Generate Hardware

Computing Nodes from C for an FPGA-Based

MPI Computing Network

by

D.Y. Wang

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

BACHELOR OF APPLIED SCIENCE

DIVISION OF ENGINEERING SCIENCE

FACULTY OF APPLIED SCIENCE AND ENGINEERING

UNIVERSITY OF TORONTO

Supervisor: Paul Chow

April 2008

Abstract

Recently there have been initiatives from both the industry and academia to explore

the use of FPGA-based application-specific hardware acceleration in high-performance

computing platforms as traditional supercomputers based on clusters of generic CPUs

fail to scale to meet the growing demand of computation-intensive applications due

to limitations in power consumption and costs. Research has shown that a heteroge-

neous system built on FPGAs exclusively that uses a combination of different types

of computing nodes including embedded processors and application-specific hardware

accelerators is a scalable way to use FPGAs for high-performance computing. An ex-

ample of such a system is the TMD [11], which also uses a message-passing network

to connect the computing nodes. However, the difficulty in designing high-speed

hardware modules efficiently from software descriptions is preventing FPGA-based

systems from being widely adopted by software developers. In this project, an auto-

mated tool flow is proposed to fill this gap. The AUTO flow is developed to auto-

matically generate a hardware computing node from a C program that can be used

directly in the TMD system. As an example application, a Jacobi heat-equation solver

is implemented in a TMD system where a soft processor is replaced by a hardware

computing node generated using the AUTO flow. The AUTO-generated hardware

module shows equivalent functionality and some improvement in performance over

the soft processor. The AUTO flow demonstrates the feasibility of incorporating au-

tomatic hardware generation into the design flow of FPGA-based systems so that

such systems can become more accessible to software developers.

i

Acknowledgment

I acknowledge Synfora and Xilinx for hardware, tools and technical support, and my

supervisor, Professor Paul Chow, for his guidance, patience, and insights, all of which

are very valuable for the completion of this project. Thanks to Chris Madill and Arun

Patel for their help in setting up the development environment, and Manuel Saldana

for help with the MPE network and scripts, and patiently answering all my questions

during the many unscheduled drop-by visits. Also many thanks to Henry Wong for

discussions, suggestions and debugging tips, and Ryan Fung for proofreading the final

report. Finally, I would like to thank my mother for her love and support as always.

ii

Contents

1 Introduction 1

2 Related Work 4

2.1 FPGA-Based Computing . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 The TMD-MPI Approach . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Behavioral Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 System Setup 9

3.1 TMD Platform Architecture . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 C-to-HDL Using PICO . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Implementation of the Tool Flow 15

4.1 Flow Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 MPI Library Implementation . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4.1 Preprocessing Script . . . . . . . . . . . . . . . . . . . . . . . 20

4.4.2 Packaging Script . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6.1 Floating-Point Support . . . . . . . . . . . . . . . . . . . . . . 26

4.6.2 Looping Structure . . . . . . . . . . . . . . . . . . . . . . . . 26

iii

4.6.3 Pointer Support . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6.4 Division Support . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6.5 Performance Specification . . . . . . . . . . . . . . . . . . . . 27

4.6.6 Hardware Debugging . . . . . . . . . . . . . . . . . . . . . . . 28

4.6.7 Exploitable Parallelism . . . . . . . . . . . . . . . . . . . . . . 28

5 The Heat-Equation Application 30

5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusion and Future Work 36

Appendix 37

A Hardware Controller for PICO PPA 37

A.1 Control FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.2 Stream Interface Translation . . . . . . . . . . . . . . . . . . . . . . . 39

B Using PICO: Tips and Workarounds 43

B.1 Stream Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

B.2 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 47

iv

Glossary

The glossary contains the acronyms that are used in this report.

• CAD – Computer Aided Design

• CPE – Cycles Per Element

• DCM – Digital Clock Manager

• FSL – Fast Simplex Link. Xilinx’s FIFO stream IP block.

• FSM – Finite State Machine

• HDL – Hardware Description Language

• HPC – High-Performance Computing

• IP – Internet Protocol

• IP – Intellectual Property

• MHS – Microprocessor Hardware Specification

• MSS – Microprocessor Software Specification

• MPI – Message Passing Interface

• MPE – Message Passing Engine. Provides MPI functionality to

• NetIf – Network Interface used in the TMD network

v

• PICO – Program-In Chip-Out. An algorithmic synthesis tool from Synfora,

Inc.

• PPA – Pipeline of Processing Arrays. The top-level hardware block generated

from a function by the PICO flow.

• TCAB – Tightly Coupled Accelerator Blocks. A hardware module generated

by PICO from a C procedure that can be used as a black box when generating

a higher-level hardware block.

• TMD – Originally the Toronto Molecular Dynamics machine; now refers to the

exclusively FPGA-based HPC platform developed at the University of Toronto.

hardware accelerators in a TMD system.

• VLSI – Very Large Scale Integrated Circuit

• XPS – Xilinx Platform Studio. Xilinx’s embedded processor system design

tool.

• XST – Xilinx Synthesis Technology. Xilinx’s synthesis tool.

• XUP – Xilinx University Program

vi

List of Figures

3.1 TMD platform architecture ([13]) . . . . . . . . . . . . . . . . . . . . 10

3.2 Network configuration for different node types ([13]) . . . . . . . . . . 11

3.3 TMD design flow ([13]) . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 PICO design flow ([15], p.5) . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 The AUTO tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Stream operations required to implement MPI behaviour . . . . . . . 17

4.3 TMD system testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 A simple two-node TMD implementation of a Jacobi heat-equation solver 32

5.2 Main loop execution time per element with different iteration lengths 34

A.1 Design of the control block . . . . . . . . . . . . . . . . . . . . . . . 38

A.2 State transition diagram of the control block FSM . . . . . . . . . . 40

A.3 The PICO stream interface . . . . . . . . . . . . . . . . . . . . . . . . 41

A.4 The FSL bus interface . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vii

List of Tables

4.1 Implemented families of MPI functions . . . . . . . . . . . . . . . . . 18

5.1 Normalized computing power of the reference and test systems . . . . 35

A.1 I/O ports exported by the control block . . . . . . . . . . . . . . . . 38

A.2 Raw control ports on the PPA module . . . . . . . . . . . . . . . . . 39

viii

Chapter 1

Introduction

Much of today’s scientific research relies heavily on numerical computations and

demands high performance. Computational fluid dynamics, molecular simulation,

finite-element structural analysis and financial trading algorithms are examples of

computation-intensive applications that would not have been possible without the

advances in computing infrastructure. Since the 1960s, generations of supercom-

puters have been built to address the growing needs of the scientific community for

more computing power. With the improved performance and availability of micro-

processors, clusters of conventional CPUs connected in a network using commercially

available interconnects became the dominant architecture used to build modern su-

percomputers. As of November 2007, 409 of the top 500 supercomputers were cluster

based [18].

However, as computing throughput requirements of new applications continue

to increase, supercomputers based on clusters of generic CPUs become increasingly

limited by power budgets and escalating costs and cannot scale further to keep up with

the demand. As a result, specialized hardware accelerators became popular. In recent

years there have been significant development in both GPU-based and FPGA-based

computing models. While GPUs demonstrated remarkable performance improvement

in highly data-parallel stream-based applications [1], FPGAs, with the flexibility they

offer, are good candidates for specialized hardware acceleration systems.

In order to leverage FPGAs in high-performance computing systems, hardware

1

accelerators need to be built from software specifications. The primary challenge in

this is that hardware design is intricate and software developers typically do not have

the expertise to design high-performance hardware. On the other hand, having to

have both software and hardware designers working on the same project is costly and

inefficient. As a result, hardware acceleration have not been adopted more widely

among software developers. A tool flow that allows the software designers to eas-

ily harness the power of hardware acceleration is hence essential to make hardware

acceleration feasible in non-high-end applications.

To address this need, we show in this project an automated tool flow, AUTO,

that generates a hardware accelerator from a C program directly. This work builds on

previous work on TMD, which is a scalable multi-FPGA high-performance computing

system [11] that consists of a collection of computing nodes, where each node can be a

soft processor or a hardware engine. The TMD uses the TMD-MPI message-passing

programming model [12] for inter-node communication. The AUTO flow takes in an

MPI program written in C as input and produces a hardware computing node that

can be used directly in the TMD system.

As a proof-of-concept prototype, the main objective of this project is to explore

the possibility of algorithmic synthesis to target an FPGA-based system, with a

focus on the feasibility of an automated tool flow. A Jacobi heat equation solver is

implemented on TMD as an example application to demonstrate the functionality

of the AUTO flow. With little designer intervention, we are able to automatically

generate a functional hardware block that performs better than the soft processor

node it replaces. Our eventual goal is to completely automate the design flow that

generates a system of hardware accelerators from a parallel program as opposed to a

single hardware computing node at a time.

The rest of the report is organized as follows. Chapter 2 reviews existing research

work in FPGA-based computing and algorithmic synthesis, which provides context

to our work. Chapter 3 describes the TMD platform, the TMD-MPI design flow

and AUTO’s role in it. Chapter 4 explains the implementation of the AUTO flow.

The limitations of the implementation are also outlined. In Chapter 5, a sample

2

application is presented with some performance results. Finally, in Chapter 6, we

summarize our findings and give suggestions for future work.

3

Chapter 2

Related Work

Recent research has shown that FPGA-based high-performance computing models

have the potential to speedup certain computing tasks significantly using application-

specific hardware acceleration. The disadvantage is that it sacrifices the generality

offered by CPUs. This is remedied by the reconfigurability of FPGAs, which allows

them to be reprogrammed for different computing tasks. Consequently, the success

of FPGA-based systems hinges on an efficient underlying computing infrastructure

and a flexible design flow, of which the AUTO flow tries to address. This section

presents an overview of existing research in the areas of FPGA-based computing and

algorithmic synthesis. It provides context for our work on the AUTO flow.

2.1 FPGA-Based Computing

With power consumption becoming an increasingly critical design constraint for high-

performance computing systems, many vendors of traditional cluster based systems

have started to incorporate hardware acceleration using FPGAs. Examples include

the Cray XD1 [5] and the HP ProLiant DL145 servers using Celoxica’s RCHTX FPGA

acceleration boards [3]. These systems use FPGAs as coprocessors to exploit fine-

grained parallelism in algorithms to improve the overall performance. The fork-join

control flow is most natural to these systems due to the master-slave relationship

between the processor and the FPGA. The master processor is responsible for the co-

4

ordination and synchronization among the computing slaves. It executes instructions

sequentially, and opportunistically farms out computation-intensive tasks to hard-

ware accelerators on the FPGA coprocessors. Since the inherently parallel hardware

structures on the FPGA are controlled by the sequential processor, maximizing ef-

ficiency to amortize the overhead of transferring data and synchronization requires

significant effort from both hardware and software designers.

Starting from the 90nm processing node, FPGAs have been built with high enough

density and speed to make them possible contenders for high-performance computing

platforms. The BEE/BEE2 system [4], TMD [11] and the one presented in [2] are

examples of high-performance computing platforms built on FPGA based technologies

exclusively. Some of these systems only use application-specific hardware modules,

others use soft processors that are embedded in the FPGA fabric and provide low-

latency communication channels between the embedded processors and the hardware

computing nodes. The latter heterogeneous architecture enables more efficient use

of on-chip resources to improve the throughput vs. area ratio under a given power

constraint for specific applications. The advantage of using FPGAs to build high-

performance computing platforms is that the application-specific portions can be

easily reconfigured to suit the need of a variety of applications.

2.2 The TMD-MPI Approach

The Toronto Molecular Dynamics (TMD) system is a scalable multi-FPGA high-

performance computing platform developed at the University of Toronto. The original

motivation for the system was to address the increasing demand for computing power

in molecular dynamics simulation. However, as the platform developed, it is no

longer limited to molecular dynamics. Today it is a testbed for FPGA-based high-

performance computing systems and design flows for such systems.

As mentioned earlier, the TMD is a heterogeneous system that consists of comput-

ing nodes, which could be soft processors and application-specific hardware modules

(hardware engines). A TMD system uses a distributed-memory architecture where

5

each computing node has its own memory and address space. This architecture is sim-

ple and scalable for highly parallel systems since it does not need to consider memory

coherency issues or memory bus congestion, which would be critical for a shared-

memory system. For distributed-memory systems, message passing has been proven

to be an efficient programming model. The de facto message passing API used in the

high-performance computing community is the Message Passing Interface (MPI) [10].

The MPI API provides a generic platform-independent interface by specifying only

the functionality and syntax of the interface. The actual implementation depends

completely on the host platform. MPICH is a popular C implementation of MPI for

computer clusters using Linux or Windows [6]. The TMD-MPI is a lightweight subset

of MPI designed for embedded systems on the TMD. It contains two components: a

software library for use with soft processors, and the Message-Passing Engine (MPE),

which is a hardware implementation of the MPI API that can be used with hardware

engines [12]. The software component does not require an operating system and has

a very small memory footprint. With TMD-MPI, C programs written using MPI can

be ported to embedded processors in a TMD system with minimal modification [13].

TMD-MPI provides an abstraction of inter-node communication so that software

developers do not need to be aware of the details of the communication infrastructure.

For the TMD, the underlying network is realized using point-to-point unidirectional

links implemented as FIFOs. Each hardware node is connected to a dedicated MPE,

which in turn connects to the rest of the network. Soft processors can connect directly

to the network or through an MPE; both are supported in the TMD-MPI implemen-

tation. When the MPE is used, the type of the computing node behind the MPE

is hidden from the rest of the system. The benefit of such a modular network setup

is two fold; multi-threaded programs developed for a computer cluster can be easily

ported to the TMD by instantiating a soft processor for each thread of the cluster,

and soft processors hosting computation-intensive programs can then be identified

and replaced by hardware engines without the rest of the system noticing. If the gen-

eration of the hardware engines can be automated, this TMD architecture will enable

software developers to easily leverage hardware acceleration. This is the motivation

6

behind the AUTO flow.

2.3 Behavioral Synthesis

Designing a hardware accelerator from a software description often requires hardware

designers to work closely with software designers in order to arrive at an efficient

design. The engineering effort required often deters software developers from using

hardware acceleration. In order to make FPGA-based high-performance computing

systems more accessible to software developers, the conversion of software into hard-

ware needs to be automated. This is the field of behavioral synthesis, sometimes

also known as high-level synthesis. It refers to the generation of a logic circuit, often

in the form of a Hardware Description Language (HDL), such as Verilog or VHDL,

from high-level functional descriptions of the desired system. Behavioral synthesis is

not a new problem. Exponential growth of the number of transistors in integrated

circuits has led to increased complexity of VLSI systems and the engineering effort

required to design them. As a result, a great deal of research effort has been spent

in the past three decades on developing high-performing and robust CAD tools that

will create the logic circuits based on a description of the desired functionality of the

circuit, which is often specified in a high-level software language such as C/C++, or

Matlab.

The challenge in behavioral synthesis comes from the inherent difference in the

software and hardware design paradigms. A software developer is more familiar with

a data-centric view, where a program is seen as a sequence of tasks performed on

a set of data. On the other hand, the hardware designer uses a time-centric view

and thinks about the hardware resources used in each clock cycle [16]. The data-

centric view does not easily expose the parallelism contained in the algorithm. A

behavioral synthesis tool needs to understand the data-centric view described by the

software, and schedules the operations, allocates hardware resources, and generates

the necessary control logic to provide the functionality of the software, while exploiting

concurrency in the algorithm. This process involves many optimization decisions and

7

trade-offs that cannot be easily automated.

Recently there have been some commercial realizations of behavioral synthesis

tools for applications in specific domains. Impulse C [7], Handel C [3], Catapult C

[9] and PICO [17] are a few examples. Impulse C is a subset of ANSI C that can be

given to an Impulse C compiler to generate HDL output. It allows a software to be

partitioned into software processes and hardware processes. A C-compatible library

is supplied to support a parallel, stream-based programming model. The library

functions are used to facilitate stream operations, such as open, close, read, and write,

as well as communication of control messages between the software and hardware

processes. The compiler generates hardware from Impulse C library functions and

other C statements. Handel C is very similar to Impulse C. It also has library functions

to support floating point operations. Catapult C is based on ANSI C++ instead of

C. All three perform the best with data-oriented stream-based parallel applications.

By making use of non-standard extensions to specify parallelism, the compiler can

be better guided to produce more efficient hardware. However, the disadvantage is

that programs that are ported to one of these languages can no longer be compiled

or debugged using standard C compilers or debuggers.

The Synfora PICO (Program-in Chip-out) is slightly different from the three other

tools mentioned above. Instead of non-standard extensions, it uses pragmas to specify

parameters related to hardware generation. Because it ignores these pragmas, a

C compiler can compile a C program prepared for PICO synthesis. The minimal

deviation between the original C version and the PICO-compliant version of the

software makes PICO a suitable tool for this project.

8

Chapter 3

System Setup

The AUTO flow has been developed to address the need for an automated CAD flow to

convert soft-processor computing nodes in a TMD system to an equivalent hardware

computing node. This section describes the architecture of the TMD testbed used

for this project and the design flow, which incorporates AUTO.

3.1 TMD Platform Architecture

The TMD system can contain multiple FPGAs connected in a 3-tier network as

shown in Figure 3.1. Tier 1 is the on-chip network used for intra-FPGA commu-

nication among computing nodes located on a single FPGA. This network is based

on unidirectional point-to-point links, which are implemented using Xilinx FSLs [21].

Each FSL is a 32-bit wide, 16-word deep FIFO. Network interface blocks (NetIf)

are used in the tier-1 network to route packets from source to destination through

a multi-hop path based on a unique ID of each node, called the rank. The routing

table is contained in the NetIf. Each FPGA also contains several gateway nodes, one

for each neighbouring FPGA. The NetIfs forward packets whose destination ranks

reside outside of the current FPGA to the appropriate gateway node. The tier 2 net-

work consists of several FPGAs in a cluster placed on the same print circuit board.

High-speed serial I/O links, such as the RocketIO MGT (Multi-Gigabit Transceiver)

are used for inter-FPGA communication. There are no network routing components

9

Figure 3.1: TMD platform architecture ([13])

in this tier. The tier 3 network facilitates communication between FPGA clusters

and can be implemented using standardized high-speed switches such as InfiniBand

switches [8]. This 3-tier system masks the implementation details of the network and

provides a uniform view to individual computing nodes. It also allows the TMD to

be easily scaled up to meet the demand of the application.

Because of the network abstraction provided by the 3-tier system, a simple system

with only the tier 1 network can be used as the testbed for the development of

the AUTO flow without loss of generality. A Xilinx University Program (XUP)

Development Board [20] with a single Virtex-II Pro XCV2P30 is used to implement

the TMD testbed for this project.

There are two types of computing nodes that are of interest to the AUTO flow:

processor nodes, which are implemented using the Xilinx MicroBlaze soft processor

core, and hardware engine nodes, which can be either hand designed or generated by

the AUTO flow. The network configuration of each type is shown in Figure 3.2. The

TMD-MPI library implementation allows a MicroBlaze to be connected to the NetIf

directly, or through an MPE. A hardware engine always uses an MPE. When an MPE

is used, two sets of FSLs connect the computing node to the MPE, where one set

carries the MPE command traffic and the other carries the MPE data traffic. There

are two individual FSLs within each set; one for the inbound traffic and the other

10

Figure 3.2: Network configuration for different node types ([13])

for the outbound traffic. The NetIfs are connected to each other in a partial-mesh

topology, where two FSLs, one for each direction, are used for each pair of NetIfs that

are connected.

With this setup, an MPI program written in C can be ported to a soft processor

node using the TMD-MPI library. The network abstraction allows this node to be re-

placed by a functionally equivalent hardware engine to improve performance without

any impact on the rest of the system. The objective of the AUTO flow is to automate

this conversion process. The AUTO flow analyzes the C program on the MicroBlaze

to prepare it for C-to-HDL conversion using Synfora’s PICO algorithmic conversion

tool. When the hardware is generated, the AUTO flow packages it into a peripheral

that can be used as a hardware engine node directly in the TMD system.

3.2 Design Flow

A four-stage design flow is proposed for the TMD system in [13]. This is illustrated in

Figure 3.3. In stage 1, the user prototypes the application in C on a workstation. In

stage 2, the application is parallelized using an well-known MPI distribution, such as

MPICH, and tested on a cluster of workstations. In stage 3, a TMD system is created

by mapping each MPI process in the parallel version of the application from stage 2

to a soft processor node. The TMD-MPI library is used instead of MPICH starting

11

Figure 3.3: TMD design flow ([13])

from this step. Because both TMD-MPI and MPICH implement the same API, the

porting process requires minimal changes to the original C source code. In stage 4,

the soft processor nodes that are executing the most computation-intensive portion

of the application are identified for hardware acceleration and subsequently replaced

by faster hardware blocks. The decision of which soft processors to replace requires

detailed profiling of the system performance and an understanding of the tradeoffs

between different system resources. Hence it will remain a job of the system designer

for now. However, the laborious process of designing hardware from a functional

description in software will be automated by AUTO.

The TMD system in stages 3 and 4 is designed using the Xilinx EDK/ISE 9.1 suite

of tools. Design entry can be done manually using the EDK GUI for simple systems.

However, for large systems, this manual process is error-prone. An automated tool

such as the System Generator [14] is recommended. All hardware engines, including

the MPE and the NetIf, are available as custom peripherals that can be imported into

EDK. The hardware components of the system are specified by the Microprocessor

12

Hardware Specification (MHS) file while the software components are specified by the

Microprocessor Software Specification (MSS) file. The system is completely defined

by the MHS and MSS, and is implemented using the Xilinx tool flow. After this,

a custom script compiles the software using a specified version of TMD-MPI and

initializes the routing tables in the NetIfs. The output is the final bitstream that is

used to program the FPGA.

Synfora’s PICO Express design suite is used to generate hardware from C pro-

grams. The PICO flow is explained in the next section.

3.3 C-to-HDL Using PICO

The PICO C-to-HDL flow is illustrated in Figure 3.4. The input to PICO Express is

a C program that contains the module to be converted to hardware, which is App.c in

the figure, and driver code the calls the target procedure. Both App.c and the driver

code can be written in ANSI C, with some restrictions; for example, no floating-point

numbers or pointers are supported. A complete list can be found in [16].

At the start of the flow, PICO performs the Golden Simulation, where the input C

source code is compiled with the driver code and executed on the workstation on which

PICO is running. The output from the golden simulation are checked against reference

output provided by the user. This step is useful if the original C program was modified

to make it PICO-compliant. To guard against bugs introduced in this process, the

user supplies reference input and output that are used to verify the golden simulation.

If the golden simulation results agree with the reference, the golden results are saved

for future reference. After the golden simulation, PICO proceeds through several

steps, transforming the source code into the final output HDL. Simulation is run at

each step and results are checked against the golden results to ensure correctness.

In order to incorporate PICO into the TMD design flow, several additional pro-

cessing steps and components are needed. These are provided in the AUTO flow, and

described in the next chapter.

13

Figure 3.4: PICO design flow ([15], p.5)

14

Chapter 4

Implementation of the Tool Flow

This section describes the implementation of the AUTO flow. The objective of the

AUTO flow is to take a soft processor in the TMD system and generate a functionally

equivalent hardware engine. Efficient conversion from software to hardware is tricky

and has traditionally been done manually. The AUTO flow automates this process

by providing a set of tools, a library and hardware components.

4.1 Flow Overview

The AUTO flow contains three components: an MPI implementation that is compli-

ant with PICO (PICO MPI), a lightweight hardware control block, and the scripts

that perform the operations involved in the flow. The actual flow consists of three

steps as illustrated in Figure 4.1.

Step 1 is the preprocessing of the user source file, which contains the top-level

function to be converted to hardware. This should be the main function from the

soft processor that is to be replaced by the hardware engine. The output of step 1

includes a PICO source file, the driver code, and a flow script. The PICO source

file is created based on the user input file by adding some pragma settings and the

PICO MPI library implementation. The driver code is the testing code. It contains

functions that redirect the stream I/O to file I/O so that the module in the PICO

source file can be tested as a standard C program. In step 2, PICO Express uses the

15

Figure 4.1: The AUTO tool flow

flow script and files from step 1 to generate the PPA core hardware. An iterative

method may be used in this step to explore the design space for best performance.

Verilog files that describe the generated hardware core are produced at the end of

step 2. In step 3, the control block is custom fitted to the generated core. The

resultant Verilog files are then packaged into a custom peripheral core that can be

used in the EDK flow. Since step 2 is not guaranteed to be successful in the first pass,

the user may need to manually adjust performance parameters when running step 2

iteratively. As a result, these three steps are not integrated into a single push-button

flow. However, all transformations done in step 1 and step 3 are encapsulated into

two scripts, which are provided as part of the AUTO flow. The next few sections

describe each component of the AUTO flow.

4.2 MPI Library Implementation

As described in Section 3.1, the MPI API is supported in the TMD system through

the TMD-MPI library and the MPE, where the underlying communication network

is built on FSLs. PICO does not support FSL directly as a native interface for the

hardware it generates. Instead, it provides a generic FIFO stream interface. Given

an MPI program that runs on a MicroBlaze node, we need a way to tell PICO to

16

interpret calls to MPI interface functions in the source code as a set of operations on

the streams. Therefore, a PICO-compliant MPI implementation is needed. Currently,

only MPI Send and MPI Recv are implemented in this PICO MPI library. However,

all other MPI functions build on these two functions and can be easily added to

the PICO MPI implementation. Figure 4.2 shows the stream operations involved

in the MPI Recv and MPI Send operations, where cmd in is the input command

stream, cmd out is the output command stream, data in is the input data stream,

and data out is the output data stream. The numbers in the parentheses indicate the

ordering of the stream operations.

(a) MPI Send (b) MPI Recv

Figure 4.2: Stream operations required to implement MPI behaviour

Due to the restrictions on input C code imposed by PICO, two modifications to

the MPI API had to be made. First, the concept of a family of MPI functions is

introduced. A family of MPI functions refers to a collection of MPI functions that

provide the same functionality, but operate on different data arrangements. This is

needed because PICO does not support pointers. In a C implementation of the MPI

API, the function prototypes for MPI Recv and MPI Send look like the following:

int MPI_Send (void *buffer, int count, MPI_Datatype type, int dest, ...);

int MPI_Recv (void *buffer, int count, MPI_Datatype type, int source, ...);

The generic pointer buffer specifies the starting location in memory. Together

with count, they specify a vector. In general, any memory within the address space

17

of the calling program is acceptable. For example, buffer may point to the middle

of a 2D array. However, PICO does not support pointer access to memory. Hence

alternatives are needed.

There are two approaches to accommodate the usage of MPI functions on different

data arrangements. One is to introduce another wrapper layer on top of the basic

PICO MPI implementation. For operation on 1D arrays, the default PICO MPI

implementation is used. For other data arrangements, the operation is performed

on a temporary buffer, and the values copied between the temporary buffer and the

actual data location. Clearly this is slow as every MPI operation on a buffer of size

N would require 2*N memory accesses. Alternatively, a different “flavour” can be

designed to target a particular data arrangement for each MPI function. All variants

of the same MPI function form a family. The three most common data arrangements

used in software are scalars, 1-dimensional vectors, and 2-dimensional arrays. The

PICO MPI implementation includes variants of MPI functions that target each of

these data arrangements, as shown in Table 4.1.

Table 4.1: Implemented families of MPI functions

MPI Function Family OperandMPI OP (int buf[], int count, ...) 1D vectorMPI OP Scalar (int *buf, int count, ...) A single scalar variableMPI OP2D (int buf[][maxc], int row, int count, ...) A row in a 2D array* OP can be Send or Recv.

The second modification to the MPI API is particular to MPI Recv. In the API

specification, MPI Recv takes in an MPI Status data structure as an argument and

populates it with information related to the receive operation. In C distributions

of MPI, MPI Status is implemented as a struct, which is not supported in PICO.

Since the most often queried fields of MPI Status are the source rank and tag, as a

workaround, the function prototypes of the MPI Recv family of functions are modified

to take in two int references instead of an MPI Status to pass back the source rank

and tag information.

Even though the PICO MPI implementation is designed for hardware synthesis,

18

it still follows the ANSI C standard. The AUTO flow generates the driver code that

redirects the stream I/O to file I/O when the target program is compiled and executed

as a standard C program. The user is still responsible for providing the input file

and verifying the correctness of the output file. The generation of test stimuli and

reference output is described in Section 4.5.

4.3 Control Block

After PICO generates the hardware core (referred to hereafter as the PPA, which

stands for Pipeline of Processing Arrays and is the PICO lingo for the hardware

core), a control block is needed to initialize and start the PPA on system startup,

and to translate the PICO stream interface to the FSL interface used to communicate

with the MPE. A simple-finite-state-machine-based hardware controller is designed

to provide such functionality. The control block module is a wrapper around the

PPA hardware. It operates on the system clock and reset, and provides four FSL

bus interfaces that can be connected to the corresponding FSL bus interfaces on the

MPE. On system startup, the rank 0 node in the TMD system sends rank information

to all hardware nodes in the system. The control block receives the rank from the

network, initializes the PPA, and enables the translation between the FSL interfaces

and the PICO stream interfaces. All activities on the FSL interfaces are translated

into the corresponding control and data signals on the stream interface of the PPA

automatically. The control block sits idle until the PPA completes its current task.

Then it restarts the PPA to process the next task. A more detailed discussion on the

operations of the control block can be found in Appendix A.

4.4 Scripts

Two scripts are used to encapsulate the operations in the AUTO flow. They are

described in this section.

19

4.4.1 Preprocessing Script

The preprocessing script (auto pico.pl) processes the user C program and generates

the files that are needed for the PICO Express flow. The preprocessing script is

invoked using the following command:

auto_pico.pl <src_C_file> <mpi_rank> <mpi_size> <transcript_file>

[Optional: <mem_option_file>]

Here src C file is the user C program. The mpi rank and mpi size parameters are

the rank of the hardware node to be generated and the size of the TMD systems.

These two parameters are for simulation purposes only. When the hardware is im-

plemented in the TMD system, they will be supplied as part of system initialization

procedure. The transcript file contains the expected input and output on the FSL

interfaces of the hardware block for a test application. This is used by the AUTO

flow to generate input stimuli and reference outputs. Section 4.5 describes how this

transcript can be obtained. The mem option file provides optional information about

how the arrays in the program should be mapped to hardware. It is optional. More

details on this are given later in this section.

There are five tasks involved in the preprocessing step. They are listed below.

The first two steps result in a new C file that is based on the user input file, with

added pragma settings and the PICO MPI implementations.

1. Parse through the user program and add #pragma settings

2. Attach PICO MPI implementations

3. Generate the driver code, input and reference output files

4. Generate Makefile

5. Generate run script for PICO

The input to the preprocessing script is a C program that contains one top level

function to be converted into PPA hardware, and any number of helper functions, all

20

contained in a single C file. The input program has to follow the constraints imposed

by PICO. The preprocessing script does not check for these. However, if there is a

violation, the PICO Express flow will error out when it is run in step 2. The user

can fix the source program and restart from this step. The first task in this step is to

parse out all arrays declarations and add the following pragma settings immediately

after the line in which the array is declared:

#pragma bitsize <size> <array_name>

#pragma host_access <array_name> none

The first line indicates the word size when the array is mapped to a memory. The

second line tells PICO not to expose any access ports for this memory as external

ports on the PPA because all communication with other modules in the system for a

TMD hardware engine is through the stream interfaces.

Another setting associated with arrays is the type of memory that the array maps

to. PICO supports three types of memories: internal fast, internal block RAM,

and user supplied. Internal fast memories are registers-based. Internal block RAM,

as its name suggests, is a block RAM instantiated within the PPA hardware. User

supplied memories are placed outside the PPA. PICO will generate a standard SRAM

interface to interact with the external memory. The user is responsible for supplying

the external SRAM. To explicitly map an array to a particular type of memory, the

following pragma setting can be used:

#pragma (internal_fast|internal_blockram|user_supplied_fpag) <array_name>

When not specified, PICO chooses the memory type based on the size of the array.

However, the user can supplied a memory option file to tell the preprocessing script

to set the type explicitly. A memory option file may contain lines that look like the

following, where a # at the beginning of a line indicates a comment:

# Specifying the memory type for the following arrays

array_1 blockram

array_2 fast

array_3 user_supplied

21

After the annotation of the array, the PICO MPI library implementations are

attached to the user code. Recall from Section 4.2 that each MPI function has a

different variant that targets a particular data arrangement. In the C program, all

MPI calls use the standard prototype. It is the preprocessing script’s job to replace

each call by the appropriate variant according to the type of the operand. This is

illustrated below. The same type of transformation will be done for MPI Send.

/* Original Program */

void ppa_function (void)

{

int buf[BUFSIZE];

int buf2[NUM_ROWS][NUM_COLS];

int buf3;

...

// (1) Operand is a 1D vector

MPI_Recv (buf, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

// (2) Operand is a 2D array

MPI_Recv (buf2[i], count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

// (3) Operand is a scalar

MPI_Recv (&buf3, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

...

}

/* PICO-Ready Program */

void ppa_function (void)

{

int buf[BUFSIZE];

int buf2[maxr][maxc];

int buf3;

...

// (1) 1D variant of MPI_Recv

MPI_Recv (buf, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

// (2) 2D variant of MPI_Recv; row = i

MPI_Recv2D (buf2, i, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

// (3) Scalar variant of MPI_Recv

MPI_Recv_Scalar (&buf3, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

...

22

}

With all MPI calls updated, the PPA program is ready to be used with the PICO

Express flow. The preprocessing script then generates the driver code that is used

by PICO for simulation and testing purposes during various stages of the PICO flow.

The driver code is generic and the same template can be used for any PPA function.

A Makefile is also generated so that the driver and the PPA program can be compiled

as a standard C application and tested prior to submitting to the PICO Express flow.

The final item generated by the preprocessing script is a flow script (run.tcl) that

contains the commands to set up the PICO project and invoke the flow. The flow

scripts tells PICO to generate the PPA core in a hierarchical fashion. Each MPI

function is generated as a sub module and instantiated in the PPA module. This

hierarchical approach provides better modularity and area efficiency, as opposed to

inlining all the modules.

Finally, with all the files generated, the user can start step 2 using the following

command:

prompt> pico_extreme_fpga run.tcl

4.4.2 Packaging Script

The packaging script (pack pico.pl) takes the Verilog files generated in the PICO

Express flow and produces a custom peripheral core that can be directly imported

into Xilinx’s EDK. The packaging script is invoked using the following command:

pack_pico.pl <pico_experiment_directory> <name_of_peripheral>

Here pico experiment directory specifies where to retrieve the Verilog generated by

PICO. There are two steps involved in the packaging script: generating the control

logic and packaging the HDL into an EDK-importable peripheral. The control logic

is generated from a pre-designed template by setting a few parameters, including

the peripheral name and the PPA module name. Making a custom peripheral core

involves putting the source Verilog files in a specific directory structure, as shown

below:

23

mpi_jacobi_v1_00_a/

data/

mpi_jacobi_v2_1_0.mpd

mpi_jacobi_v2_1_0.pao

hdl/

verilog/

mpi_jacobi.v (Control logic)

mpi_jacobi_ppa.v (PPA top-level wrapper)

... (Other PPA files)

The packaging script creates the mpi jacobi v1 00 a/ directory for version 1.00.a

of a peripheral named mpi jacobi. This directory can be placed in any custom IP

repository to make it available use with the EDK tools. All source code files are

collected under the hdl/verilog/ sub-directory. The data/ sub-directory contains the

Microprocessor Peripheral Description file (.mpd), and the Peripheral Analysis Order

file (.pao). The MPD describes the interface exported by the peripheral as well as

the parameters that can be set in the MHS file of an embedded system. The PAO

contains the list of all source files required to build this peripheral. Both files are

generated automatically by the packaging script.

4.5 Test Generation

As mentioned earlier, in order to run the PICO flow, the user needs to provide input

stimuli for the input streams and reference output for the output streams. These can

be generated from the TMD system directly. Figure 4.3 shows a TMD prototyping

system where one of the two MicroBlaze processors is to be replaced by a hardware

engine. Since the replacement of the MicroBlaze by a hardware engine is transparent

to the rest of the system, the node behind the MPE can be viewed as a black box that

can be implemented either as a soft processor node or a hardware engine. Since all

communication with the black box is through the FSL interfaces, by snooping the FSL

interfaces and capturing the activities, both the input stimuli and the corresponding

reference output can be obtained. To do this, the TMD-MPI library is instrumented

24

Figure 4.3: TMD system testbed

so that it prints out the raw data sent to and received from the FSL interfaces when a

compiler flag is turned on. A transcript obtained this way is then given to the AUTO

flow to produce the input and reference output files that are used by PICO during

hardware generation.

4.6 Limitations

The AUTO flow was developed as a proof-of-concept prototype to demonstrate the

feasibility of automatically generating a hardware accelerator from software descrip-

tions. Consequently, there are still some limitations of the current version. Most of

the issues described in this section are due to limitations of the various tools inte-

grated into the AUTO flow. Many of them are likely to be resolved in the future.

The others are inherent to the shift in design strategies from a software paradigm to

a hardware paradigm. Hopefully, more sophisticated automation tool flows will help

bridge this gap.

25

4.6.1 Floating-Point Support

A major limitation of the current version of the PICO tool is the lack of support

for floating-point numbers. Currently, only fixed-point numbers (char, int, long) are

accepted. Floating points are more complicated to support, hence hardware-based

applications have traditionally used fixed-point-based approaches. On the other hand,

all modern microprocessors have floating-point support. It is rare to see software,

especially for scientific-computing purposes, that does not use floating-point numbers.

As a result of this limitation, the range of applications that can be converted to

hardware using the PICO flow is severely limited. In order to use the PICO flow as it

is now, the designer will have to rewrite the original software application using fixed-

point numbers. There algorithmic synthesis tools that are commercially available now

that support floating-point operations. An example is Handel C [3]. Perhaps in the

future PICO will also incorporate this feature.

4.6.2 Looping Structure

Using more than one loops inside another loop is not allowed. This limits the com-

plexity of the software that can be put through PICO. An alternative to this is to

turn each internal loop-nest into a Tightly Coupled Accelerator Block (TCAB). This

seems to be a cleaner way when working with large designs due to the nice hardware

design hierarchy provided by its usage; although programs with complicated looping

structures should probably be avoided because it makes timing performance analysis

harder.

4.6.3 Pointer Support

The pointer in C is a powerful concept that allows flexible access to memories. Many

C programs make extensive use of pointers. However, general pointer accesses are

currently not supported in PICO, possibly to avoid complex pointer-aliasing analyses.

Consequently, programs that use pointers need to be manually inspected to replace

those instances of pointer usage that are illegal in PICO with equivalent alternatives.

26

This process may involve rewriting portions of the code. Due to the dynamic nature

of such analysis, AUTO does not attempt to do this automatically. It is the user’s

responsibility to provide AUTO a source program that follow PICO’s guidelines in

terms of pointer usage.

4.6.4 Division Support

Division with an arbitrary divisor is not supported when using the Xilinx XST syn-

thesis tool. XST can only synthesize a divider when the operand is a power of 2.

However, PICO instantiates a general-purpose divider even when the divisor can be

determined to be a power of 2 at compile time. PICO recommends Synplify Pro to

be used to synthesize generic dividers. However, we do not have access to a license

of Synplify Pro, hence we cannot verify this recommendation. As a result of this

limitation, applications with general division is not supported by the AUTO flow.

A workaround exists for fixed point division when the divisor is a power of 2. This

is done by replacing the division with a bitwise right shift (>>) by the appropriate

number of bits.

4.6.5 Performance Specification

The AUTO flow currently does not optimize for performance. It is up to the user

to decide what the appropriate performance parameters should be. One reason is

that the user should be more familiar with the required performance constraints for

the target hardware. Secondly, for complex software, the PICO flow may need to be

invoked iteratively to find a good set of performance constraints that deliver the best

area-delay-power tradeoff. Performance constraint can be specified in terms of MITI

(Minimum Inter-Task Interval) to control the amount of task overlap, or II (Initiation

Interval) for a loop to affect how tightly loop iterations can be scheduled. When a

constraint is not specified, PICO tries aggressively to obtain a compact schedule.

Sometimes this schedule may be impossible to meet during synthesis. Therefore, an

iterative approach might be needed to manually find the sweet spot.

27

Another parameter that has shown to improve performance is the trip count for

loops, which is the expected number of iterations in the loop. When this is specified,

PICO can better optimize for both area and speed. However, the trip count cannot

be determined from static analysis of the code in general. Hence, the AUTO flow does

not set this option for loops. It is up to the developer to tune this option manually

if desired.

4.6.6 Hardware Debugging

In the current flow, once the software is converted to hardware, it is very difficult to

debug if the hardware is nonfunctional. There are two main causes of a malfunctioning

hardware module: bugs in the user C program and bugs in the PICO flow. For bugs

in the PICO flow, the user needs to use standard hardware debugging tools to track

down the problem. This requires knowledge of Verilog and hardware design and

debugging in general, which an average software developer may not possess. On the

other hand, if the bug is in the user program, standard C debuggers can be used. This

is likely to be the case in the long run, when the PICO flow is expected to be relatively

stable. In this case, it would be helpful if the live input and output of the hardware

module in its working environment can be captured and used to generate test stimuli

for the C program. This way, the software developer can test the program with real

hardware data in a C debugging environment, which he will be familiar with.

4.6.7 Exploitable Parallelism

The amount of parallelism that can be exploited by PICO is limited to that exposed

in the software. Badly designed software will result in inefficient hardware. Software

optimization, and automatic parallelism extraction is a separate research field. The

AUTO flow only prepares a piece of software for use in the PICO C-to-HDL flow.

It does not optimize the source code. The software designer hence should use dis-

cretion when writing the program. PICO provides a number of options to automate

some basic optimizations such as full loop unrolling and multi-buffering of memory

28

through the use of pragmas [16]. However, these automated options have limited ap-

plicability, and cannot replace intelligent code design by the programmer. That said,

not all software optimization techniques apply when the final goal is hardware. For

example, any cache-related optimizations such as loop-reordering are unlikely to see

performance upside in the final hardware. Therefore, the designer should also have

some high-level understanding of the hardware architecture that PICO produces in

order to get the best result.

29

Chapter 5

The Heat-Equation Application

This section presents an example application that is built to demonstrate the function-

ality of the AUTO flow. The application chosen is a heat-equation solver. The heat

equation is a partial differential equation that describes the temperature variation in

a given region over time, given the initial temperature distribution and boundary con-

ditions. The thermal distribution is determined by the Laplace equation 5(x, y) = 0.

The solution to this equation can be found by the Jacobi iterations method [22],

which is a numerical method to solve a system of linear equations. This application

was chosen because a TMD system has been previously implemented for this method

[13], so a working MPI program already exists and can be used in the AUTO flow

directly.

The main objective of this project is not to produce a high-performance hardware

accelerator for the heat-equation application. Rather, the goal is to demonstrate the

feasibility of an automated flow. Therefore, it is expected that the resultant hard-

ware may not deliver the best performance in comparison to hand-designed hardware

modules.

5.1 Implementation

Because floating point is not supported in PICO, in this implementation of the heat-

equation solver, the temperatures are represented as fixed-point numbers. This is

30

acceptable because the precision of the computation is not important for this project,

as the objective is not to build a high-performance heat-equation solver. We will

accept the hardware as long as it produces the same result as the software running

on a soft processor.

In the original TMD implementation of the heat-equation application, nine com-

puting nodes were used in the system. For the simplicity, only two computing nodes

are used for the test application in this project. Rank 0 is a MicroBlaze and is the

root node. It generates the initial temperature map and sends the working data to

rank 1. Rank 1 is a computing node. The two nodes solve the heat equation together.

In the reference system, rank 1 is a MicroBlaze running the software implementation

of the Jacobi iterative solver. Rank 1 in the test system is the hardware generated

from the software version using the AUTO flow.

The original software implementation of the Jacobi iterative solver contains a sin-

gle program that is run on all soft-processor computing nodes. Depending on the

rank of the computing node, which is defined during compile time through a compiler

directive, sections of the program are selectively exercised. All non-root computing

nodes perform essentially the same operations. In addition to the computation loop

performed by all computing nodes, the root node is also responsible for data initial-

ization and finalization, as well as synchronization at the end of each Jacobi iteration.

The goal for the example application is to generate a hardware implementation for

a non-root node. The first step in the implementation is to extract the parts of the

program that pertain only to non-root nodes. The resultant program is a concise

version of that running on the non-root nodes originally. This program is then passed

through the three stages of the AUTO flow, as described in Section 4.4.

After obtaining a peripheral from the AUTO flow, the reference system and test

system are implemented using the design flow described in Section 3.2. The two sys-

tems are implemented on a Xilinx Virtex II Pro FPGA. The performance is measured,

and the results are documented in the next section.

31

5.2 Experiment Methodology

A few simple experiments are conducted to compare the performance of the Jacobi

hardware engine produced by the AUTO flow to that of the reference software imple-

mentation. The setup of the reference and test systems are shown in Figure 5.1. The

two-node implementation is chosen for simplicity purposes. It can be easily scaled

up to include more computing nodes by duplicating the rank 1 MicroBlaze in the

reference system or the Jacobi hardware engine in the test system. Since each node

only communicates with two of its neighbours the individual behaviour of each node

is not affected with increased system size.

(a) Reference System

(b) Test System

Figure 5.1: A simple two-node TMD implementation of a Jacobi heat-equation solver

Two different programs are run on the rank 0 MicroBlaze to conduct two tests.

The first test verifies the correctness of the test system against the reference system.

The original TMD C implementation of the heat-equation solver is used on rank 0.

Recall that the program on the reference system rank 1, which is used to generate

the hardware, is originally extracted from the C implementation. In this setup, the

root node collects the results from the computing element after the system converges

and prints out the results to the UART, which is then captured on a PC connected

to the FPGA board through a serial link. A problem size of 40x40 is used. Each

node hence operates on a section of 20x40 elements. The output from both systems

are identical.

The second test measures the performance of the Jacobi hardware engine com-

32

pared against the reference system. In this case, rank 0 only performs data initializa-

tion, finalization and synchronization between the computing nodes. It does not run

the computation loop. In each iteration, rank 0 exchanges the boundary rows with

rank 1, and proceeds directly to the synchronization barrier and waits for rank 1 to

finish its computation loop. The number of cycles from the end of the row exchange

to the end of the iteration synchronization is measured for each iteration and accu-

mulated over the entire program. The cycles-per-element (CPE) is the total number

of cycles divided by the problem size (20 × 40 × Niterations). Normally, the Jacobi

iteration algorithm stops when the results converge. In this test, in order to emulate

a different problem size (i.e. total number of elements to process), the program on

rank 0 is changed slightly so the total number of Jacobi iterations to run before stop-

ping the computing can be controlled. Using this hack, the CPE of the reference and

test systems are measured for a few cases with different number of Jacobi iterations

executed in each. The results are documented in the next section.

5.3 Results

Both the test and reference system are implemented using Xilinx XPS 9.1. The

reference system uses a 100MHz clock, the maximum clock frequency provided on

the Xilinx University Program Development Board [20]. The test system uses a

50MHz clock. This is because the highest speed that could be achieved during the

hardware generation in the PICO flow was 80MHz, and the Digital Clock Manager

(DCM) block in the embedded system does not allow fractional division. So the

only available clock frequencies below 80MHz are 50MHz and 66MHz, and 50MHz is

chosen for convenience. The designs are tested on a Xilinx Virtex-II Pro XC2VP30

FPGA and the performance is measured.

A fair comparison between the hardware Jacobi solver and the MicroBlaze solu-

tion is the number of elements computed per second per LUT. The area cost of the

hardware Jacobi solver was obtained by synthesizing it alone using ISE 9.1. This

includes a a small overhead of about 9 LUTs introduced by the control block, which

33

is negligible for most designs. Since the hardware solver provides equivalent function-

ality to a complete MicroBlaze system, which includes a MicroBlaze soft processor,

2 Local Memory Busses (LMB), 2 LMB BRAM controllers, and 1 BRAM block, the

equivalent area cost is that of such a system. This is obtained by implementing a

single MicroBlaze system containing the above components using the EDK flow. The

total LUT count in each case is obtained from the Xilinx Mapping Report (.mrp).

Figure 5.2 shows the raw CPE measurements for the reference and test systems.

The high CPE observed in the reference system for short-running experiments that

perform a small number of iterations may be due to the initialization overhead in

the MicroBlaze that is not fully amortized. The performance of the two systems are

compared using the asymptotic CPE when the number of iterations is large. The

normalized computing power in terms of number of elements processed per second

per LUT is shown in Table 5.1.

Figure 5.2: Main loop execution time per element with different iteration lengths

The hardware Jacobi solver is actually 22% slower than the soft processor im-

plementation when running at 50MHz. However, theoretically the hardware Jacobi

solver can operate at 80MHz. This is shown as the third column in Table 5.1, in

which case we get 1.25x runtime improvement over the reference system. It is clear

that speedup like this is not enough for a hardware acceleration system. However, we

have achieved the objective of developing a working flow. Going forward, more effort

34

Table 5.1: Normalized computing power of the reference and test systems

MicroBlaze HW Jacobi HW Jacobi (Speculated)Clock Frequency 100 MHz 50 MHz 80 MHzAsymptotic CPE 41 17 17Total LUTs 2771 4267 4267Nelements per sec per LUT 880 689 1103Speed-Up 1x 0.78x 1.25x

will be spent in the optimization of the AUTO flow to provide more performance

improvement.

35

Chapter 6

Conclusion and Future Work

In this project, we have presented an automated flow that generates hardware com-

puting nodes from C directly that can be used in a TMD system directly. As a

result, soft processor nodes in a TMD system can be easily converted and replaced by

functionally equivalent hardware engines to achieve better performance. A working

hardware Jacobi heat equation solver is produced as an example to demonstrate the

feasibility of the AUTO flow. With little designer intervention, the hardware Jacobi

solver is generated automatically and shows some performance improvement over the

software implementation on a soft processor. The AUTO flow shows that FPGA-

based high-performance computing platforms such as the TMD can be made more

accessible to software developers who are unfamiliar with hardware design. We are

now one step closer towards the ultimate goal of a completely automated flow that

converts a parallel program into a complete TMD system.

Nevertheless, more work still needs to be done to address the limitations of the

current AUTO flow, which include the lack of support for the true MPI API and

the requirement for running the iterative flow through PICO manually. Hence, the

next step is to complete these features and focus on robustness and performance

optimization. The AUTO flow will be used with a wider range of applications to test

its robustness. Further optimized of the flow will be done to provide more performance

advantage for the hardware engines.

36

Appendix A

Hardware Controller for PICO

PPA

To integrate the hardware module generated by PICO into the TMD system, a

lightweight hardware control block is used. The control block has two responsibil-

ities. It generates the control signals to initialize and start the PPA module upon

system startup. This is done using a finite-state-machine (FSM). It also translates

the PICO stream interface to the FSL interface, which is done through combinational

logic. Figure A.1 shows the design of the control block. Table A.1 lists the I/O ports

exported by the control block, which include the clock, the reset, and four sets of FSL

ports for the four FSL bus interfaces. These ports are visible to the rest of the TMD

system.

A.1 Control FSM

A finite state machine (FSM) is used to generate the control signals for the PPA

module. Table A.2 lists the raw control ports of the PPA that need to be controlled

by the control block. The control block drives the input ports with the appropriate

values to initialize and start the PPA.

Upon system reset, the FSM receives the rank of the current hardware node

from the MPE, and stores it in an 8-bit word, which is connected to the raw-

37

Figure A.1: Design of the control block

Table A.1: I/O ports exported by the control block

Signal Direction (I/O) Bus Width Bus Interfaceclk Irst ITo mpe cmd data O [31:0] FSLTo mpe cmd ctrl O To mpe cmdTo mpe cmd write OTo mpe cmd full IFrom mpe cmd data I [31:0] FSLFrom mpe cmd ctrl I From mpe cmdFrom mpe cmd exists IFrom mpe cmd read OTo mpe data data O [31:0] FSLTo mpe data ctrl O To mpe dataTo mpe data write OTo mpe data full OFrom mpe data data I [31:0] FSLFrom mpe data ctrl I From mpe dataFrom mpe data exists IFrom mpe data read O

38

Table A.2: Raw control ports on the PPA module

Signal Direction (I/O) Bus Widthclk Ireset Ienable Istart task init Istart task final Iclear init done Iclear task done Ipsw task done Orawdatain self rank 0 I/O [0:7]

datain self rank 0 port on the PPA. This port is generated when the original C

function declares and uses the global variable self rank. The PPA is enabled after

the rank is obtained. The necessary sequence for each control signal is described in

Appendex C.3 of [15]. Figure A.2 shows the state transition diagram of the FSM.

When the FSM reaches the PPA READY state, the translation between the PICO

stream interface and the FSL interface is enabled. The FSM remains idle in this state

until psw task done is raised high by the PPA, at which point the PPA is restarted

to wait for the next task.

A.2 Stream Interface Translation

The second task of the control block is to translate the PICO stream interface to the

FSL interface. A detailed description of the PICO stream interface can be found in

Appendix C.5 of [15]. Information on the FSL interface can be find in [19]. Figure

A.3 shows a simple illustration of the PICO input and output stream interface. In

both cases, the req signal is raised high by the PPA when it wants to interact with the

stream. The ready signal is an input to the PPA. A high ready signal indicates that the

other end of the stream is ready to receive or send data. The FSL interface is shown

in Figure A.4. To read from an FSL, the data consumer raises FSL S Read. If there

is data in the FSL, the FSL S Exists signal will also be high. In this case, the data

word and control bit can be read through FSL S Data and FSL S Control in the next

39

Figure A.2: State transition diagram of the control block FSM

40

(a) Input stream (b) Output stream

Figure A.3: The PICO stream interface

(a) Read from an FSL (b) Write to an FSL

Figure A.4: The FSL bus interface

clock cycle. Similarly, to write to an FSL, the data producer raise FSL M Write and

puts the data word and control bit on FSL M Data and FSL M Control respectively.

In the first clock cycle after FSL M Full becomes low, the data is pushed into the

FIFO.

Comparing the operations between the PICO stream interface and the FSL bus

interface, it is not difficult to see the similarity in functionality between instream req

and FSL S Read, instream ready and FSL S Exists, outstream req and FSL M Write,

outstream ready and the complement of FSL M Full. Since the PICO stream interface

does not contain a control bit, a 33-bit data bus is used, where the highest bit is

interpreted as the control bit in the FSL interface, and the lower 32 bits as the

data word. The translation between the PICO stream interface and the FSL bus

interface is summarized in the Verilog excerpt shown in Program 1. This translation

relationship is used for all four PICO streams (cmd in, cmd out, data in, data out)

and the corresponding FSL interfaces.

41

Program 1 Code excerpt to illustrate the stream interface translations

/∗ For outbound FSL ∗/FSL M Data = outstream data [ 3 1 : 0 ] ;FSL M Ctrl = outstream data [ 3 2 ] ;FSL M Write = outstream req & ˜FSL M Full ;outstream ready = ˜FSL M Full ;

/∗ For inbound FSL ∗/in s t ream data = {FSL S Control , FSL S Data } ;FSL S Read = ins t r eam req & FSL S Exists ;ins t ream ready = FSL S Exists ;

42

Appendix B

Using PICO: Tips and

Workarounds

This chapter describes some of the PICO-related workarounds and performance-

improving tips that were discovered during the development of the AUTO flow. Refer

to [16] for details on the PICO options mentioned in this chapter.

B.1 Stream Ordering

As described in Section 3.1, MPI operations are realized in hardware as a series of

stream operations. The order of these stream operations are defined by the MPE

protocol, and hence must be enforced for correct interaction with the MPE block.

However, stream ordering is tricky in PICO. Because PICO always tries to produce

a schedule that is as compact as possible, it will schedule two operations in the same

phase as long as it does not detect data dependency between the two operations. As

there is no direct data dependency between the streams, it is hard to make PICO

understand the sequential ordering constraint of the stream operations as imposed

by the semantics of the MPE protocol. A workaround that allows stream operations

to be ordered is to wrap each stream in a function and generating the function as

a Tightly Coupled Accelerator Block (TCAB). Since each TCAB is generated as an

independent sub module, dummy data dependency can be introduced between the

43

TCABs to force a sequential schedule. Program 2 shows an excerpt of the code for

the MPI Recv function that illustrates the techniques used to establish ordering in

the stream operations.

In the code excerpt shown in Program 2, the write operation on the output com-

mand stream is encapsulated in a TCAB called recv cmd out. The return value of

the TCAB is used in conditional statements to force the instructions enclosed in the

conditional statement to be scheduled at least one time unit after the execution of

recv cmd out. Program 2 also illustrates the technique of loop-sinking, where sequen-

tial instructions before a loop are sinked into the loop to provide hints about the

ordering of the operations. By writing the code this way, it tells PICO clearly that a

single instruction is to be executed every iteration so an instruction in a later iteration

will not be scheduled before one in an earlier iteration.

B.2 Improving Performance

With PICO the performance of the generated hardware can sometimes be improved

by turning on certain options manually. This section describes a few of them. More

details regarding the usage of these options can be found in [16].

Memory Type

There are three types of memories supported by PICO: internal fast register-based

memory, block RAM and user-supplied external memory, such as DDR. The memory

type can be specified using the following command in the C source file after the array

declaration:

#pragma (internal_fast | internal_blockram | user_supplied_fpga) <mem_name>

When the memory type is not explicitly specified, PICO chooses the type based

on the size of the array. However, some automatic optimizations are only available

for the register-based memories. Also, block RAMs can only have a maximum of 2

read/write ports, which may limit the amount of parallelism that can be exploited.

44

Program 2 Code excerpt to illustrate stream ordering

unsigned char recv cmd out (unsigned long long cmd out ){

#pragma b i t s i z e recv cmd out 1#pragma b i t s i z e cmd out 33pico stream output mpe cmd out ( cmd out ) ;return 1 ;

}

int MPI Recv ( int bu f f e r [ ] , int count , MPI Datatype datatype ,unsigned int source , unsigned int tag , MPI Comm comm,unsigned int ∗ i n t a g p t r , unsigned int ∗ i n s r c p t r )

{unsigned char done = 0 ;#pragma b i t s i z e done 1unsigned int i ;

#pragma num i te ra t i ons ( 3 , , )for ( i = 0 ; i < count + 3 ; i++) {

i f ( i == 0) {unsigned long long cmd out = MPE RECV OPCODE | ( source << 22) |

count | FSL CTRL MASK;#pragma b i t s i z e cmd out 33recv cmd out ( cmd out ) ;

}else i f ( i == 1) {

done = recv cmd out ( tag & FSL DATA MASK) ;i f ( done ) {

∗( i n s r c p t r ) = ( pico stream input mpe cmd in ( ) &NET SOURCE MASK) >> 24 ;

}}else i f ( i == 2) {

∗( i n t a g p t r ) = pico stream input mpe cmd in ( ) &FSL DATA MASK;

}else {

// Standard r e c e i v e data from streami f ( done ) {

bu f f e r [ i −3] = p i co s t r eam input mpe data in ( ) &FSL DATA MASK;

}}

}return MPI SUCCESS ;

}

45

Loop Trip Counts

Specifying the trip counts for loops saves area and improves speed. This is because

PICO can optimize better when it knows the expected iterations of a loop. The loop

trip count can be specified using the following pragma immediately before the start

of the loop:

#pragma num_iterations (<min>, <expected>, <max>)

Hand Perfectization

Perfectization in the PICO lingo is the process of transforming the original source code

into a sequence of loop nests, without any sequential code between the loops. This is

done automatically by PICO during the scheduling phase. In certain cases, the user

can do a better job perfectizing the code than PICO could. For example, in Program

2, hand perfectization is used to sink the initial sequential setup code into the loop.

The resultant hardware can start a new iteration at each cycle since it is clear that

only one instruction is executed per iteration. However, without hand perfectization,

PICO could choose to sink all initial sequential instructions to iteration 1 and as a

result, the scheduler will not be able to schedule an iteration per clock cycle.

Memory Porting Arbitration

Sometimes PICO will produce a schedule that results in the need for a memory with

more than 2 ports. Such memories are not synthesizable in FPGAs. A remedy for

this is to specify the following pragma on the array in question and re-synthesize the

design:

#pragma forward_boundary_register <array_name> always

This results in additional logic that can arbitrate the accesses to the memory to

reduce the actual port requirement. However, it does not always work. Sometimes

arbitration could result in miscomparison with the golden results in post-synthesis C

simulation or the final RTL simulation indicating erroneous hardware. Therefore, this

option should be used with caution and the two simulations are strongly recommended

when using this option.

46

Bibliography

[1] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Han-

rahan. Brook for GPUs: stream computing on graphics hardware. ACM Trans-

actions on Graphics (TOG), 23(3):777–786, 2004.

[2] C. Cathey, J. Bakos, and D. Buell. A Reconfigurable Distributed Computing

Fabric Exploiting Multilevel Parallelism. Proceedings of the 14th Annual IEEE

Symposium on Field-Programmable Custom Computing Machines (FCCM06).

[3] Celoxica, low-latency and accelerated computing solutions for capital markets,

Curr. April 2008. http://www.celoxica.com.

[4] C. Chang, J. Wawrzynek, and R. Brodersen. BEE2: a high-end reconfigurable

computing system. Design & Test of Computers, IEEE, 22(2):114–125, 2005.

[5] Cray XD1 Supercomputer for Reconfigurable Computing, Technical report, Cray,

Inc. 2005. http://www.cray.com/downloads/FPGADatasheet.pdf.

[6] W. Gropp and E. Lusk. User’s Guide for MPICH, a Portable Implementation of

MPI. Argonne National Laboratory, 1994.

[7] Impulse Accelerated Technologies. Software Tools for an Accelerated World,

Curr. April 2008. http://www.impulsec.com.

[8] InfiniBand Trade Association. The InfiniBand Architecture Specification R1.2,

Technical Report, October 2004. http://www.infinibandta.org.

[9] Mentor Graphics. The EDA Technology Leader, Curr. April 2008.

http://www.mentor.com.

47

[10] The Message Passing Interface (MPI) Standard, Curr. April 2008. http://www-

unix.mcs.anl.gov/mpi/.

[11] A. Patel, C. Madill, M. Saldana, C. Comis, R. Pomes, and P. Chow. A Scalable

FPGA-based Multiprocessor. Proceedings of the 13th Annual IEEE Symposium

on Field-Programmable Custom Computing Machines, 71:72, 2006.

[12] M. Saldana and P. Chow. TMD-MPI: An MPI implementation for multiple

processors across multiple FPGAs. IEEE 16th International Conference on Field

Programmable Logic and Applications, 2006.

[13] M. Saldana, D. Nunes, E. Ramalho, and P. Chow. Configuration and Program-

ming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD-

MPI. In the Proceedings of the 3rd International Conference on Reconfigurable

Computing and FPGAs, September 2006.

[14] L. Shannon and P. Chow. Maximizing system performance: using reconfigurabil-

ity to monitor system communications. Field-Programmable Technology, 2004.

Proceedings. 2004 IEEE International Conference on, pages 231–238, 2004.

[15] Synfora, Inc. PICO Express FPGA - PICO RTL: Synthesis, Verification and

Integration Guide, 8.01 edition, 2008.

[16] Synfora, Inc. PICO Express FPGA - Writing C Applications: Developer’s Guide,

8.01 edition, 2008.

[17] Synfora, Inc., Curr. April 2008. http://www.synfora.com.

[18] Top 500 Supercomputer Sites, November 2007. http://www.top500.org.

[19] Xilinx, Inc. Fast Implex Link (FSL) Bus (v2.00a) Product Specification, ds449

edition, December 2005.

[20] Xilinx, Inc. Xilinx University Program Virtex-II Pro Development System Hard-

ware Reference Manual, ug069 (v1.0) edition, March 2005.

48

[21] Xilinx, Inc., Curr. April 2008. http://www.xilinx.com.

[22] J. Zhu. Solving Partial Differential Equations on Parallel Computers. World

Scientific, 1994.

49

an automated flow to generate hardware computing …pc/research/publications/wang.thesis08.pdfan...

Documents