virtualization of heterogeneous machines hardware description in a

Virtualization of Heterogeneous MachinesHardware Description in a Synthesizable Object-Oriented Language

Joshua Auerbach David F. Bacon Perry Cheng Rodric Rabbah Sunil ShuklaIBM Research

{josh,dfb,perry,rabbah,skshukla}@us.ibm.com

AbstractLime is a new Java-compatible and object-oriented language de-signed to make programming of reconfigurable hardware signifi-cantly more accessible to skilled software developers. Lime pro-grams may run either in software (via Java bytecodes) or in hard-ware (via behavioral and logic synthesis). This paper illustrates thesalient synthesis-oriented features of the language using a photo-mosaic algorithm with inherent bit, pipeline, and data parallelism.The result is a virtual machine abstraction that extends across aheterogeneous architecture comprising a CPU, FPGA, and othercomputational structures.

Categories and Subject Descriptors B.6.3 [Design Aids]: Hard-ware Description Languages; D.3.3 [Programming Languages]:Language Constructs and Features; D.1.3 [Programming Tech-niques]: Concurrent Programming,

General Terms Design, Languages

Keywords object oriented, value type, streaming, functional pro-gramming, reconfigurable architecture, FPGA, high level synthesis

1. IntroductionFor the past few years, the microprocessor industry has shifted to-ward multicore architectures to overcome the prohibitive physicalrealities associated with clock frequency scaling. This shift has alsolead to a number of hybrid computing architectures which promiseto deliver performance through architecture specialization. For ex-ample, there is a proliferation of graphics processing units (GPU)that are now tightly coupled to general purpose processors. As an-other example, there are a number of vendors that offer computingappliances that are specialized FPGA-based implementations.

Spurred by the ubiquity of GPUs, there are now several widelyused programming standards that enable general purpose comput-ing on graphics processors. For example, OpenCL [5] is supportedby both AMD and NVIDIA for many of their GPUs. FPGAs onthe other hand, despite their promise of high performance and lowpower, have largely remained a niche market with much of the inno-vation taking place at small companies, start-ups, and in academia.While the general practice of programming FPGAs involves a hard-ware description language (HDL) such as Verilog or VHDL, there

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.DAC’11, June 5-10, 2011, San Diego, California, USACopyright c© 2011 ACM 978-1-4503-0636-2/11/06...$10.00. . . $10.00

are a number of C-to-gates compilers [3] that can compile ker-nels written in C into HDL. In addition, new languages such asBluespec [7] offer the intriguing possibility of new programmingparadigms for FPGAs. Despite these efforts however, FPGAs re-main largely beyond the reach of skilled software programmers.This is due in part to the fact that writing efficient software is verydifferent from writing efficient hardware descriptions. For FPGAprogrammers, the algorithm becomes the architecture whereas soft-ware programmers are afforded many abstractions and program-ming luxuries that create a very wide semantic gap between soft-ware and hardware.

Lime is a new language [1] developed at IBM Research toaddress these challenges. It is a Java-compatible, object-oriented,and synthesizable language with functional-programming featuresthat facilitate the expression of bit, data, and pipeline parallelism.The Lime compiler can generate Java bytecode so that the programcan run on any platform that supports a Java virtual machine.In addition, the compiler may perform behavioral synthesis andgenerate hardware designs for FPGAs. Backends are also availablefor GPU and manycore (IBM PowerEN) systems.

Lime provides a small number of additional abstractions whichare designed to virtualize the computational structures of diverseunderlying compute substrates. These include data-parallel opera-tors and streaming computation. Both of these paradigms provide avery high-level abstraction of data movement, so that the program-mer does not explicitly code low-level details of transfer via FIFO,message, DMA, or other interfaces.

Streaming and data-parallelism are also recursively composi-tional and apply down to the level of single bits. This allows con-venient and natural expression of multi-level parallelism.

In this paper, we provide a walk-through of a Lime program,and describe how the features of the language facilitate FPGA syn-thesis. A full description and specification of the language and itsdesign can be found in [1, 2]. We have implemented a number ofkernels and applications in Lime. They include digital signal pro-cessing and encryption algorithms, an N-body simulator, a networkintrusion detector [8], and a photo-mosaic application which is thesubject of this paper. We have implemented a native implementa-

D.1.5 Object-oriented Programming

!"

reference image! tile library! image mosaic!

Figure 1. Photo-mosaic out of flowers of Albert Einstein.

890

50.1

1 for each reference tile rt2 for each library tile lt3 score = 04 for each pixel i in tile row5 for each pixel j in tile column6 score += distance(rt[i][j], lt[i][j])7 score = score / 648 if score is best for rt then9 replace rt with lt

Figure 2. Mosaic pseudo-code.

tion of the photo-mosaic algorithm in Verilog, to use as a referenceagainst which we benchmark the efficacy of the Lime compiler.

2. Photo-Mosaic in LimeThe photo-mosaic application is illustrated in Figure 1: a referenceimage is partitioned into a set of non-overlapping tiles, and each tilefrom the reference image is replaced with a tile from an image li-brary. A tile is, in our example, an 8×8 square partition comprising64 pixels, with each pixel is described using the RGB color space.The mosaic algorithm compares each reference tile to the imagesin the tile library to find the best match using a scoring function.The end result is a mosaic that resembles the reference image butis constituted entirely from library images.

The scoring and tile selection algorithm can be described asshown in Figure 2 where the distance is a Euclidean metriccomparing the RGB values in the reference tile to those in thelibrary tile.

We consider three simple implementations of the mosaic algo-rithm. The first is a simple pipelined design (Section 2.1). The sec-ond (Section 2.2) improves on this design using point-to-point mes-saging to reduce bandwidth requirements. The third (Section 2.3)introduces coarse-grained data parallelism as a way of increasingoverall throughput.

2.1 Describing Top Level StructureThe first implementation of the mosaic algorithm that we consideris a pipelined design consisting of three modules: a source thatproduces a stream of tile pairs, one each for the reference andlibrary tiles (implementing lines 1-2 of Figure 2); a scoring modulethat reads the tiles and computes the distance between tiles (lines3-7); and a sink that performs the tile selection and replacement(lines 8-9).

Realizing this implementation in a FPGA poses some chal-lenges however. The size of each tile is large enough that it cannotbe exchanged over wires in a single transaction. Instead, the pixelsin a tile have to be communicated between the source and scoringmodules at finer granularity. It is also possible that the source gen-erates a stream of indices for the reference and library tiles and thescoring module reads the pixels from memory. In Verilog HDL, thetop level module corresponding to this design may be written asfollows:

1 module top;2 wire [1:32] rt, [1:32] lt, [1:32] score;

3 source source(clk, reset, rt, lt);4 scorer scorer(clk, reset, rt, lt, score);5 mosaic mosaic(clk, reset, rt, lt, score);6 endmodule;

A Lime programmer may describe the same implementation asshown in Figure 3. The type of the variables on lines 2-4 is Task,

1 public static void main(...) {2 var source = task Source(...).generate;3 var scorer = task Tile.score;4 var mosaic = task Mosaic(...).tileSelect;5 var top = source => scorer => mosaic;6 top.finish();7 }

Figure 3. Lime version of the top level mosaic module.

a Lime type that represents an autonomous dataflow actor1. Limeperforms local type inference as a convenience to the programmer,and so the explicit types are eschewed and replaced by the keywordvar.

In Lime, the task operator creates an instance object of thespecified type and binds the method specified after the dot as theactor method. For example, the mosaic task is an instance of theclass Mosaic and its actor method is tileSelect which is definedby the class. The scorer task (line 3 of Figure 3), unlike thesource and mosaic tasks, is created from a static method definedby the Tile class.

The Lime connect operator (=>) connects the actors with FIFOsso that the output of one task is streamed and becomes the inputto the next task (line 5). Each actor operates autonomously, writingdata into its output FIFO and reading data from its input FIFO. Theconstructed task graph top is executed using the finish method.This starts the task graph and runs it until terminated, which isusually indicated by throwing a StreamUnderflow exception. Thefinish method blocks until the task graph terminates.

Lime tasks may be connected as long as the types are compat-ible. Namely, for two connected tasks A => B, the return type ofthe actor method in A has to match the type of the arguments forthe actor method in B. This property is enforced by the compiler. Incontrast, the primary types in hardware description languages areusually aggregates of bits, and programmers resort to strict codingconventions to avoid semantically erroneous module assembly. Theavailability of rich compound types (e.g., tuples, classes and multi-dimensional arrays) and strong typing in Lime allows task graphsto be safely constructed. The richer types in the language afforda Lime programmer considerable expressive power that does notoverly burden behavioral synthesis.

The Source class is shown in Figure 4. The generate methodreturns a tuple type consisting of two tiles (lines 10-14). Tuplesallow actor methods to return multiple arguments in a single struc-ture that is transmitted between tasks. The tuple is streamed fromthe source to the score method in the Tile class.

The Lime code for the Tile class is shown in Figure 6. It showsthat a Tile is essentially an 8 × 8 bounded value array of RGBpixels (line 4). An RGB pixel is itself a tuple consisting of three8-bit data values. In Lime, bit is an enumerated type (with thevalues zero and one) and arrays may be accessed using ranges asin Verilog and VHDL.

2.1.1 Sequential tasksLime tasks may be stateful or stateless. The former correspond tosequential logic and the latter are equivalent to combinatorial logic.The source task is an example of a stateful task: every invocationof the generate method depends on the previous invocation sincethe method updates its refCount and libCount fields.

Lime provides strong isolation guarantees for tasks, comparableto module isolation in HDLs. In Lime, this is achieved in part bydesignating methods as local (Figure 4 line 10). As such, themethod asserts that it does not access any global data, or call anynon-local methods. Localness is a statically checkable property and

1 User tasks in Lime operate using a synchronous dataflow model [6]

891

50.1

1 public class Source {2 private int refCount = 0;3 private int libCount = 0;4 final Tile[[]] image;5 final Tile[[]] library;

6 public local Source(Tile[[]] image,Tile[[]] library) {

7 this.image = image;8 this.library = library;9 }

10 public local ‘(Tile, Tile) generate() {11 var pair = ‘(image[refCount],

library[libCount]);12 ... // update refCount and libCount13 return pair;14 }15 }

Figure 4. Lime code for Source class.

provides strong isolation guarantees that the compiler may rely onfor optimization. Lime permits tasks that are not isolated – andcan perform I/O – as long as they are sources and sinks in thetask graphs. These are comparable to peripheral I/O modules inhardware.

The Source class contains the reference image and tile library.They are represented as arrays that imply memory storage require-ments. The class constructor accepts the image and tile library asvalue arrays of Tiles. A value array is indicated by the doublebrackets and is deeply immutable, and thus assignments to any ofits elements are illegal. Knowing that a data-type is immutable af-fords the compiler flexibility when synthesizing Lime code intohardware: the arrays may be safely allocated to the FPGA blockRAM for fast access instead of off-chip memory where access todata tends to be prohibitively slow for FPGA computation.

The constructor code in Lime generally corresponds to codethat is executed on a reset signal in HDL as illustrated below usingVerilog:

always @(posedge clk) beginif (initialize) begin

// load image and library tiles// to local memory

endend

The compiler infers the reset logic automatically. Further, itanalyzes the fields used in the class, and infers latches and flip-flopsfor the stateful tasks as needed.

2.1.2 Combinatorial tasksLime tasks created from pure functions are stateless and may berealized as combinatorial logic. The scorer task is a stateless task.Its actor method is shown on lines 5-10 in Figure 6.

The scoring method uses a loop with a statically known iter-ation count (i.e., eight rows as defined on line 2) and performs amap operation, applying the distance method to each element inrt[i] and lt[i] using the Lime collective operator (@). The re-sult of a collection is an array of pixel scores for the correspondingrow, and it is in turn reduced to a scalar value using the reduc-tion operator (!). It performs a binary addition over the elementsof the array from the collective operation. These operators, whencombined with the valueness and boundedness properties, allow thecompiler to easily explore a range of implementations for the scor-ing function. For example, the loop (lines 7-8) may be unrolled toexpose instruction-level parallelism; the collective operation (line

reference tile rt[0]..rt[7]

library tile lt[0]..lt[7]

RGBpixel

distance(@rt[i], lt[i]

rt[i]

lt[i]

++! binary reduction

score

for (Row i)

+ + +

+++

Figure 5. Collection and reduction operators permit datapathwidening and critical path reduction of circuit.

1 public value class Tile {2 typedef Row = enum<8>;3 typedef RGB = ‘(bit[[8]],

bit[[8]],bit[[8]]);

4 RGB[[8][8]] pixels;

5 static local int score(Tile rt, Tile lt) {6 score = 0;7 for (Row i : rt)8 score += +! distance(@ rt[i], lt[i]);9 return score / 64;

10 }

11 static local int distance(RGB rp, RGB lp) {12 score = 0;13 ... // compute Euclidean score14 return score;15 }16 }

Figure 6. Lime code for Tile class.

8) expresses fine-grained data-level parallelism; the reduction op-eration (line 8) allows for critical path reduction of the scoring cir-cuit. These optimizations are illustrated in Figure 5 and the imple-mentation space may be readily explored by the Lime behavioralsynthesis compiler, thus freeing the programmer from the tediousburden of manually tuning and pipelining the equivalent HDL.

892

50.1

2.2 Out-of-band communication between tasksThe mosaic implementation described earlier transmits a pair oftiles between the source and scoring tasks. While convenient, thisis inefficient since there is redundancy between successive pairs:the same reference tile appears in a sequence of pairs so that itcan be compared to different library tiles. An improvement to theimplementation can therefore cache the reference tile in the scoringtask so that it is only transmitted once. In Verilog, the module mayinclude the following logic:

always @(posedge clk) beginif (newTile) begin

// cache new image tileend else begin

// compute scoreend

end

Communication between modules that is considered infrequentis expressed in Lime using messaging. An example is shown inFigure 7. A task created from a class that implements a messageinterface establishes an incoming input signal for receiving themessage. The method is invoked whenever a message is receivedby the task, and before the task executes its actor method.

The top level task graph for the new implementation of thescoring algorithm can now be written by replacing line 3 in Fig-ure 3 with var scorer = task CachingScorer().score. Messag-ing does not disturb the task graph construction and is minimallyinvasive for the programmer.

The task can receive messages from other tasks that send mes-sages by invoking the newTilemethod declared by the correspond-ing message interface (Figure 7 line 13). A method explicitly indi-cates that it sends messages to inform the compiler of the necessaryrouting between tasks. To this end, the generate method shown inFigure 4 may be modified as follows:

1 public local Tile generate() sends UpdateTile {2 if (...) {3 newTile(image[refCount++]);4 libCount = 0;5 }6 var libTile = library[libCount++];7 return libTile;8 }

Messaging in general allows point-to-point communication be-tween tasks regardless of how the top-level task graph is assembled.

1 public class CachingScorer2 extends Tile3 implements UpdateTile {

4 Tile rt;

5 public local void newTile(Tile rt) {6 this.rt = rt;7 }

8 public local int score(Tile lt) {9 // compute score for rt and lt

10 }11 }

12 public message interface UpdateTile {13 public void newTile(Tile newTile);14 }

Figure 7. Example of messaging in Lime.

!"

source!

mosaic!

Tile Scorer

3"

Tile Scorer

N"

Tile Scorer

2"

Tile Scorer

1"

splitter!round-robin

distribution of tiles

…"

joiner!round-robin gathering

of tiles scores

N tile scores for each reference tile!

library tiles!reference tiles sent via messages!

Figure 8. Alternative implementation of scoring algorithm usingcoarse-grained task parallelism.

The only requirement is that there exists a path in the graph betweenthe senders and receivers.

2.3 Coarse Grained Task- and Data-ParallelismThe pipelined implementations described so far does not exploitthe potential data parallelism between scoring modules. Since eachscoring module computes a score for a given pair of referenceand library tiles, it is possible and desirable to instantiate multiplemodules to operate in parallel. Each scoring modules can computethe score for a given reference tile and a distinct subset of the tilelibrary. This implementation is illustrated in Figure 8 and in Verilogwould be described using the generate statement.

Lime provides a parallel task composition operator task [...](line 9 in Figure 9) to facilitate the expression of coarse grained taskand data parallelism. The operator produces a multi-input to multi-output compound task that is used to construct a split-join structure(lines 8-10). A split-join consists of a splitter that transposes aninput stream into a tuple of streams which are connected in orderto the streams in the compound task, and a joiner which transposesthe output streams from the compound task into a single stream.

A splitter task is created using the task split operator (line1). The splitter transposes a stream of tiles into N streams of tiles.Similarly, the joiner (line 5) is created using the task join op-erator. It aggregates scores from each of the N scoring tasks (lines2-4) into a single stream of scores. Since the incoming and out-going types of the split-join do not exactly match the types of thesource and mosaic tasks (see Figure 3), a match operator (#) is usedto aggregate or disaggregate streams. The matcher on line 7 aggre-gates N tiles into a single array which can be split into N individualstreams of tiles. The matcher on line 11 disaggregates the result ofthe joiner (int[[N]]) into a stream of individual scores (of typeint). Now the types match and the top level task graph can be as-sembled as shown on lines 6-12. The matcher provides a convenientmechanism for describing buffering logic.

The width of the split-join (N) is statically parametrized andmay be easily tuned by the programmer to scale the throughputof the scoring modules as desired. The splitters and joiners reducethe programmer effort in creating wires and routing logic betweenmodules, and ensure that Lime task graphs remain structured andcompose hierarchically.

893

50.1

1 var splitter = task split Tile[[N]];2 var scorers = new Task[N];3 for (N i)4 scorers[i] = task CachingScorer().score;5 var joiner = task join int[[N]]

6 var top = source7 => #8 => splitter9 => task [ scorers ]

10 => joiner11 => #12 => mosaic;13 ...

Figure 9. Example of coarse-grained parallelism in Lime.

3. On-going and Future WorkWe have implemented several versions of the mosaic algorithm inLime including the pipelined, messaging, and task-parallel versiondescribed in this paper. These versions currently run entirely insoftware (as Java bytecode). We also implemented a native C andVerilog version of the scoring algorithm for future performancecomparisons. The Verilog HDL implementation was synthesizedfor a Xilinx ML555 (Virtex-5) FPGA using Xilinx ISE 12.1 andachieved a clock frequency of 100 MHz and achieved end-to-endthroughput of 50 M tiles/second. Our goal is to match or surpassthe native HDL implementation starting with the Lime code. Thisis part of our on-going research.

4. Related WorkThere are many “C-to-gates” efforts in academia and industry tocompile C programs to hardware [3]. These methodologies tendto restrict the C dialect into a synthesizable subset that eschewpointer-aliasing, dynamic allocation, and recursion. Although thetechnology continues to mature, there remains a substantial pro-gramming burden on developers to expose the kinds of parallelismthat Lime attempts to unify into a single semantic domain. Further,the introduction of value and bounded types in Lime provide moreprogrammer convenience and greater expressive power.

Bluespec [7] is a hardware description language with Verilog-like syntax. Bluespec uses a rule-based model of computation todescribe hardware in a way that is amenable to formal analysis andsynthesis. Bluespec offers some productivity advantages comparedto programming in Verilog but its programming model is unfamiliarto object-oriented programmers.

Kiwi [4] is an object-oriented hardware programming approachusing C# and .NET concurrency mechanisms. That work aimsto define hardware semantics for existing parallel programmingconstructs including monitors, events, message passing, and asyn-chronous threads. We believe that the task-based execution modelin Lime is simpler and more desirable for FPGA synthesis com-pared to the Kiwi concurrency model.

5. ConclusionThis paper presents the synthesis-oriented features of Lime – a newJava-compatible language for programming hardware and soft-ware. Using a photo-mosaic algorithm as an example, the papershowed how a Lime programmer might implement the algorithmusing pipeline and data parallelism to enable efficient hardwaresynthesis. The Lime compiler can compile programs into Javabytecode for execution on platforms that support a Java virtual ma-chine. In addition, the compiler can perform behavioral synthesis togenerate hardware descriptions that are synthesized into FPGA de-signs. The behavioral synthesis compiler is a work in progress butour goal is to generate efficient hardware that is competitive withnatively implemented designs. To this end, we have implemented anumber of benchmarks in both Lime and Verilog to enable futurecomparison and benchmarking.

References[1] AUERBACH, J., BACON, D. F., CHENG, P., AND RABBAH, R. Lime:

a Java-Compatible and Synthesizable Language for Heterogeneous Ar-chitectures. In Proceedings of the ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages, and Applications(Oct. 2010).

[2] AUERBACH, J., BACON, D. F., CHENG, P., AND RABBAH, R. Limelanguage manual (version 2.0). Tech. Rep. RC-25004, IBM Research,Oct 2010.

[3] CARDOSO, J., AND DINIZ, P. Compilation Techniques for Reconfig-urable Architectures. Springer-Verlag, 2008.

[4] GREAVES, D., AND SINGH, S. Kiwi: Synthesis of FPGA circuits fromparallel programs. In IEEE Symposium on Field-Programmable CustomComputing Machines (2008).

[5] KHRONOS OPENCL WORKING GROUP. The OpenCL Specification.A. Munshi, Ed.

[6] LEE, E. A., AND MESSERSCHMITT, D. G. Static scheduling ofsynchronous data flow programs for digital signal processing. IEEETrans. on Computers 36, 1 (January 1987), 24–35.

[7] NIKHIL, R. Bluespec System Verilog: efficient, correct RTL fromhigh level specifications. In Proceedings of the Second ACM andIEEE International Conference on Formal Methods and Models for Co-Design (2004), pp. 69–70.

[8] PELLAUER, M., AGARWAL, A., KHAN, A., NG, M. C., VIJA-YARAGHAVAN, M., BREWER, F., AND EMER, J. Design ContestOverview: Combined Architecture for Network Stream Categorizationand Intrusion Detection (CANSCID). In Proceedings of the ACM/IEEEInternational Conference on Formal Methods and Models for Codesign(MEMOCODE 2010) (Grenoble, France, July 2010).

894

50.1

virtualization of heterogeneous machines hardware description in a

Documents