kiwi: synthesis of fpga circuits from multi-threaded c# programs

Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Satnam Singh, Microsoft Research Cambridge, UK

David Greaves, Computer Lab, Cambridge University, UK

XD2000i FPGA in-socketaccelerator for Intel FSB

XD2000F FPGA in-socketaccelerator for AMD socket F

XD1000 FPGA co-processormodule for socket 940

The Future is Heterogeneous

Example Speedup: DNA Sequence Matching

Why are regular computers not fast enough?

FPGAs are the Lego of Hardware

LUT4 (OR)

LUT4 (AND)

opportunity

scientific computingdata miningsearchimage processingfinancial analytics

challenge

The Accidental Semi-colon

;

Kiwi Thesis• Parallel programs are a good

representation for circuit designs. (?)• Separated at birth?

Objectives• A system for software engineers.• Model synchronous digital circuits in C# etc.– Software models offer greater productivity than

models in VHDL or Verilog.• Transform circuit models automatically into

circuit implementations.• Transform programs with dynamic memory

allocation into their array equivalents.• Exploit existing concurrent software

verification tools.

Previous Work• Starts with sequential C-style programs.• Uses various heuristics to discover

opportunities for parallelism esp. in nested loops.

• Good for certain idioms that can be recognized.• However:

– many parallelization opportunities are not discovered

– lack of control– no support for dynamic memory allocation

Kiwi

structural imperative (C)parallelimperative

gate-level VHDL/Verilog Kiwi C-to-

gates

&0

0

0

Q

QSET

CLR

S

R

;;;

jpeg.cthread 2

thread 3

thread 1

Key Points• We focus on compiling parallel C# programs into

parallel hardware.• Important because future processors will be

heterogeneous and we need to find ways to model and program multi-core CPUs, GPUs, FPGAs etc.

• Previous work has had some success with compiling sequential programs into hardware.

• Our hypothesis: it’s much better to try and produce parallel hardware from parallel programs.

• Our approach involves compiling .NET concurrency constructs into gates.

Self Inflicted Constraints• Use a standard programming

language with no special extensions (C#).

• Use standard mechanism for concurrency (System.Threading).

• Use concurrency of model circuit structure.

I2C Bus Control in VHDL

Ports and Clockspublic static class I2C { [OutputBitPort("scl")] static bool scl;

[InputBitPort("sda_in")] static bool sda_in;

[OutputBitPort("sda_out")] static bool sda_out;

[OutputBitPort("rw")] static bool rw;

circuit ports identified by

custom attribute

I2C Control private static void SendDeviceID() { Console.WriteLine("Sending device ID"); // Send out 7-bit device ID 0x76 int deviceID = 0x76;

for (int i = 7; i > 0; i--) { scl = false; sda_out = (deviceID & 64) != 0; Kiwi.Pause();

// Set it i-th bit of the device ID scl = true; Kiwi.Pause(); // Pulse SCL scl = false; deviceID = deviceID << 1; Kiwi.Pause(); } }

Generated Verilogmodule i2c_demo(clk, reset, I2CTest_I2C_scl, I2CTest_I2C_sda);

input clk; input reset; reg i2c_demo_CS$4$0000; reg I2CTest_I2C_SendDeviceID_CS$4$0000; reg I2CTest_I2C_SendDeviceID_second_CS$4$0000; reg I2CTest_I2C_ProcessACK_ack1; reg I2CTest_I2C_ProcessACK_fourth_ack1; reg I2CTest_I2C_ProcessACK_second_ack1; reg I2CTest_I2C_ProcessACK_third_ack1; integer I2CTest_I2C_SendDeviceID_deviceID; integer I2CTest_I2C_SendDeviceID_second_deviceID; integer I2CTest_I2C_SendDeviceID_i; integer i2c_demo_i; integer I2CTest_I2C_SendDeviceID_second_i; integer i2c_demo_inBit; integer i2c_demo_registerID; output I2CTest_I2C_scl; output I2CTest_I2C_sda;

System Composition• We need a way to separately develop

components and then compose them together.

• Don’t invent new language constructs: reuse existing concurrency machinery.

• Adopt single-place channels for the composition of components.

• Model channels with regular concurrency constructs (monitors).

Writing to a Channelpublic class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Reading from a Channelpublic T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}

Our Implementation• Use regular Visual Studio technology to

generate a .NET IL assembly language file.• Our system then processes this file to produce

a circuit:– The .NET stack is analyzed and removed– The control structure of the code is analyzed and

broken into basic blocks which are then composed.– The concurrency constructs used in the program

are used to control the concurrency / clocking of the generated circuit.

systems level concurrency constructsthreads, events, monitors, condition variables

rendezvous join patterns transactionalmemory

dataparallelism

user applications

domain specificlanguages

Higher Level Concurrency Constructs

• By providing hardware semantics for the system level concurrency abstractions we hope to then automatically deal with other higher level concurrency constructs:– Join patterns (C-Omega, CCR, .NET Joins

Library)– Rendezvous– Data parallel operations

KiwiLibrary

Kiwi.cs

circuitmodel

JPEG.cs

Visual Studio

multi-thread simulationdebuggingverification

Kiwi Synthesis

circuitimplementation

JPEG.v

parallelprogram

C#

Thread 1

Thread 2

Thread 3

Thread 3

C togates

C togates

C togates

C togates

circuit

circuit

circuit

circuitVerilog

for system

public static int max2(int a, int b){ int result; if (a > b) result = a; else result = b; return result;}

.method public hidebysig static int32 max2(int32 a, int32 b) cil managed{ // Code size 12 (0xc) .maxstack 2 .locals init ([0] int32 result) IL_0000: ldarg.0 IL_0001: ldarg.1 IL_0002: ble.s IL_0008

IL_0004: ldarg.0 IL_0005: stloc.0 IL_0006: br.s IL_000a

IL_0008: ldarg.1 IL_0009: stloc.0 IL_000a: ldloc.0 IL_000b: ret}

max2(3, 7)

stack

local memory0

377

7

System.Threading• We have decided to target hardware

synthesis for a sub-set of the concurrency features in the .NET library System.Threading–Monitors (synchronization)– Thread creation (circuit structure)

Kiwi Concurrency Library• A conventional concurrency library Kiwi is

exposed to the user which has two implementations:– A software implementation which is defined purely in

terms of the support .NET concurrency mechanisms (events, monitors, threads).

– A corresponding hardware semantics which is used to drive the .NET IL to Verilog flow to generate circuits.

• A Kiwi program should always be a sensible concurrent program but it may also be a sensible parallel circuit.

System Composition• We need a way to separately develop

components and then compose them together.

• Don’t invent new language constructs: reuse existing concurrency machinery.

• Adopt single-place channels for the composition of components.

• Model channels with regular concurrency constructs (monitors).

Writing to a Channelpublic class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Reading from a Channelpublic T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}

class FIFO2{ [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result;

static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();

public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } }

public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }

public static void Behaviour(){ Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start();

Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();

twoclock ticksper result

handshakingprotocol

Filter Example

thread one-placechannel

public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

Transposed Filter

static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout){ byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }}

Inter-thread Communication and Synchronization

// Create the channels to link together the tapsfor (int c = 0; c < size; c++){ Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros}

// Connect up the taps for a transposed filterfor (int i = 0; i < size; i++){ int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start();}

Performance• Software

– Dual-core Pentium 2.67GHz, 3GB– 6,562,500 pixels per second

• BEE3 FPGA Performance– Xilinx XC5VLX110T FPGA, 100MHz– DDR2 memory, 2 DIMMS per channels, 288-bits per read– 4 cycles per pixel– 429,000,000 pixels per second

• Hand optimized core– Xilinx CoreGenerator: 400MHz

Current Limitations• Only integer arithmetic and string handling.• Floating point could be added easily.• Generation of statically allocated code:– Arrays must be dimensioned at compile time– Number of objects on the heap is determined at

compile time– Recursive function calling must bottom out at

compile time (so depth can not be run-time dependent)

Next Steps• Consider a series of concurrency constructs and

their meaning in hardware:– Transactional memory– Rendezvous.– Join patterns / chords– Data Parallel Descriptions

• Optimize away handshaking protocol.• Allow non trivial dynamic memory allocation.• Solve impedance mismatch with back-end tools

to improve performance.

Smith-Waterman Recurrence

SW Diagonal Dependencies

Can perform all operations on an anti-diagonal in parallel.Can pass query and database data along channels between cells.However, each operation needs a scoring matrix read.

for (int qpos = 0; qpos < height; qpos++) { short score = (dbval < 0 || seq[qpos] < 0) ? (short)0: pam250[dbval, seq[qpos]]; int left = prev[qpos]; int above = (qpos==0)? aboveScore: here[qpos-1]; int diag = (qpos==0)? prevAbove: prev[qpos-1]; int nv = Math.Max(0, Math.Max(left - 10, Math.Max(above - 10, diag + score))); if (nv > (int)max) max = (short)nv; here[qpos] = (short)nv; if (qpos == height-1) below_score.Write((short)nv); }

data paralleldescription of

FFT-style operationsin a multi-core

bytecode

FPGAhardware(VHDL)

GPU code (CUDA)

SMPC#

Summary• Circuits can be modelled as regular parallel programs.• Automatically transform parallel circuit models into digital

circuit implementations.• Exploit shared memory and passage passing idioms for co-

design.• We don’t need to invent a new language:

– Exploit rich existing knowledge of concurrent programming.• Apply recent innovations in shape analysis and region types

to allow us to compile programs with lists and trees.• Is there an application for this work at Sanger/EBI?• More information about Kiwi synthesis at

http://research.microsoft.com/~satnams

http://research.microsoft.com/~satnams

Synplify Pro FPGA Implementation:

First, preliminary result: Device: Virtex 5x110T-2: Static timing: 20 logic layers, Fmax=78MHz (12.7 ns). Utilisation = 3120 Virtex-5 slices, 17% of 17500. Clock cycles per streaming base: 10.

Future parameter exploration: QSL search string query limit increase = 256 or 512. N search parallelism (number of units) = 32 or 64. Clocks per cell : reduce to 4 or 2 (channel overheads then dominate). Extend Kiwi channels between the four chips on the BEE3 board.

kiwi: synthesis of fpga circuits from multi-threaded c# programs

Documents