kiwi: synthesis of fpga circuits from multi-threaded c# programs
DESCRIPTION
Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs. Satnam Singh, Microsoft Research Cambridge, UK David Greaves, Computer Lab, Cambridge University, UK. XD2000i FPGA in-socket accelerator for Intel FSB. XD2000F FPGA in-socket accelerator for AMD socket F. - PowerPoint PPT PresentationTRANSCRIPT
Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Satnam Singh, Microsoft Research Cambridge, UK
David Greaves, Computer Lab, Cambridge University, UK
XD2000i FPGA in-socketaccelerator for Intel FSB
XD2000F FPGA in-socketaccelerator for AMD socket F
XD1000 FPGA co-processormodule for socket 940
The Future is Heterogeneous
Example Speedup: DNA Sequence Matching
Why are regular computers not fast enough?
FPGAs are the Lego of Hardware
LUT4 (OR)
LUT4 (AND)
opportunity
scientific computingdata miningsearchimage processingfinancial analytics
challenge
The Accidental Semi-colon
;
Kiwi Thesis• Parallel programs are a good
representation for circuit designs. (?)• Separated at birth?
Objectives• A system for software engineers.• Model synchronous digital circuits in C# etc.– Software models offer greater productivity than
models in VHDL or Verilog.• Transform circuit models automatically into
circuit implementations.• Transform programs with dynamic memory
allocation into their array equivalents.• Exploit existing concurrent software
verification tools.
Previous Work• Starts with sequential C-style programs.• Uses various heuristics to discover
opportunities for parallelism esp. in nested loops.
• Good for certain idioms that can be recognized.• However:
– many parallelization opportunities are not discovered
– lack of control– no support for dynamic memory allocation
Kiwi
structural imperative (C)parallelimperative
gate-level VHDL/Verilog Kiwi C-to-
gates
&0
0
0
Q
QSET
CLR
S
R
;;;
jpeg.cthread 2
thread 3
thread 1
Key Points• We focus on compiling parallel C# programs into
parallel hardware.• Important because future processors will be
heterogeneous and we need to find ways to model and program multi-core CPUs, GPUs, FPGAs etc.
• Previous work has had some success with compiling sequential programs into hardware.
• Our hypothesis: it’s much better to try and produce parallel hardware from parallel programs.
• Our approach involves compiling .NET concurrency constructs into gates.
Self Inflicted Constraints• Use a standard programming
language with no special extensions (C#).
• Use standard mechanism for concurrency (System.Threading).
• Use concurrency of model circuit structure.
I2C Bus Control in VHDL
Ports and Clockspublic static class I2C { [OutputBitPort("scl")] static bool scl;
[InputBitPort("sda_in")] static bool sda_in;
[OutputBitPort("sda_out")] static bool sda_out;
[OutputBitPort("rw")] static bool rw;
circuit ports identified by
custom attribute
I2C Control private static void SendDeviceID() { Console.WriteLine("Sending device ID"); // Send out 7-bit device ID 0x76 int deviceID = 0x76;
for (int i = 7; i > 0; i--) { scl = false; sda_out = (deviceID & 64) != 0; Kiwi.Pause();
// Set it i-th bit of the device ID scl = true; Kiwi.Pause(); // Pulse SCL scl = false; deviceID = deviceID << 1; Kiwi.Pause(); } }
Generated Verilogmodule i2c_demo(clk, reset, I2CTest_I2C_scl, I2CTest_I2C_sda);
input clk; input reset; reg i2c_demo_CS$4$0000; reg I2CTest_I2C_SendDeviceID_CS$4$0000; reg I2CTest_I2C_SendDeviceID_second_CS$4$0000; reg I2CTest_I2C_ProcessACK_ack1; reg I2CTest_I2C_ProcessACK_fourth_ack1; reg I2CTest_I2C_ProcessACK_second_ack1; reg I2CTest_I2C_ProcessACK_third_ack1; integer I2CTest_I2C_SendDeviceID_deviceID; integer I2CTest_I2C_SendDeviceID_second_deviceID; integer I2CTest_I2C_SendDeviceID_i; integer i2c_demo_i; integer I2CTest_I2C_SendDeviceID_second_i; integer i2c_demo_inBit; integer i2c_demo_registerID; output I2CTest_I2C_scl; output I2CTest_I2C_sda;
System Composition• We need a way to separately develop
components and then compose them together.
• Don’t invent new language constructs: reuse existing concurrency machinery.
• Adopt single-place channels for the composition of components.
• Model channels with regular concurrency constructs (monitors).
Writing to a Channelpublic class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }
Reading from a Channelpublic T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}
Our Implementation• Use regular Visual Studio technology to
generate a .NET IL assembly language file.• Our system then processes this file to produce
a circuit:– The .NET stack is analyzed and removed– The control structure of the code is analyzed and
broken into basic blocks which are then composed.– The concurrency constructs used in the program
are used to control the concurrency / clocking of the generated circuit.
systems level concurrency constructsthreads, events, monitors, condition variables
rendezvous join patterns transactionalmemory
dataparallelism
user applications
domain specificlanguages
Higher Level Concurrency Constructs
• By providing hardware semantics for the system level concurrency abstractions we hope to then automatically deal with other higher level concurrency constructs:– Join patterns (C-Omega, CCR, .NET Joins
Library)– Rendezvous– Data parallel operations
KiwiLibrary
Kiwi.cs
circuitmodel
JPEG.cs
Visual Studio
multi-thread simulationdebuggingverification
Kiwi Synthesis
circuitimplementation
JPEG.v
parallelprogram
C#
Thread 1
Thread 2
Thread 3
Thread 3
C togates
C togates
C togates
C togates
circuit
circuit
circuit
circuitVerilog
for system
public static int max2(int a, int b){ int result; if (a > b) result = a; else result = b; return result;}
.method public hidebysig static int32 max2(int32 a, int32 b) cil managed{ // Code size 12 (0xc) .maxstack 2 .locals init ([0] int32 result) IL_0000: ldarg.0 IL_0001: ldarg.1 IL_0002: ble.s IL_0008
IL_0004: ldarg.0 IL_0005: stloc.0 IL_0006: br.s IL_000a
IL_0008: ldarg.1 IL_0009: stloc.0 IL_000a: ldloc.0 IL_000b: ret}
max2(3, 7)
stack
local memory0
377
7
System.Threading• We have decided to target hardware
synthesis for a sub-set of the concurrency features in the .NET library System.Threading–Monitors (synchronization)– Thread creation (circuit structure)
Kiwi Concurrency Library• A conventional concurrency library Kiwi is
exposed to the user which has two implementations:– A software implementation which is defined purely in
terms of the support .NET concurrency mechanisms (events, monitors, threads).
– A corresponding hardware semantics which is used to drive the .NET IL to Verilog flow to generate circuits.
• A Kiwi program should always be a sensible concurrent program but it may also be a sensible parallel circuit.
System Composition• We need a way to separately develop
components and then compose them together.
• Don’t invent new language constructs: reuse existing concurrency machinery.
• Adopt single-place channels for the composition of components.
• Model channels with regular concurrency constructs (monitors).
Writing to a Channelpublic class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }
Reading from a Channelpublic T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}
class FIFO2{ [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result;
static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();
public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } }
public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }
public static void Behaviour(){ Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start();
Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();
twoclock ticksper result
handshakingprotocol
Filter Example
thread one-placechannel
public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }
Transposed Filter
static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout){ byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }}
Inter-thread Communication and Synchronization
// Create the channels to link together the tapsfor (int c = 0; c < size; c++){ Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros}
// Connect up the taps for a transposed filterfor (int i = 0; i < size; i++){ int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start();}
Performance• Software
– Dual-core Pentium 2.67GHz, 3GB– 6,562,500 pixels per second
• BEE3 FPGA Performance– Xilinx XC5VLX110T FPGA, 100MHz– DDR2 memory, 2 DIMMS per channels, 288-bits per read– 4 cycles per pixel– 429,000,000 pixels per second
• Hand optimized core– Xilinx CoreGenerator: 400MHz
Current Limitations• Only integer arithmetic and string handling.• Floating point could be added easily.• Generation of statically allocated code:– Arrays must be dimensioned at compile time– Number of objects on the heap is determined at
compile time– Recursive function calling must bottom out at
compile time (so depth can not be run-time dependent)
Next Steps• Consider a series of concurrency constructs and
their meaning in hardware:– Transactional memory– Rendezvous.– Join patterns / chords– Data Parallel Descriptions
• Optimize away handshaking protocol.• Allow non trivial dynamic memory allocation.• Solve impedance mismatch with back-end tools
to improve performance.
Smith-Waterman Recurrence
SW Diagonal Dependencies
Can perform all operations on an anti-diagonal in parallel.Can pass query and database data along channels between cells.However, each operation needs a scoring matrix read.
for (int qpos = 0; qpos < height; qpos++) { short score = (dbval < 0 || seq[qpos] < 0) ? (short)0: pam250[dbval, seq[qpos]]; int left = prev[qpos]; int above = (qpos==0)? aboveScore: here[qpos-1]; int diag = (qpos==0)? prevAbove: prev[qpos-1]; int nv = Math.Max(0, Math.Max(left - 10, Math.Max(above - 10, diag + score))); if (nv > (int)max) max = (short)nv; here[qpos] = (short)nv; if (qpos == height-1) below_score.Write((short)nv); }
data paralleldescription of
FFT-style operationsin a multi-core
bytecode
FPGAhardware(VHDL)
GPU code (CUDA)
SMPC#
Summary• Circuits can be modelled as regular parallel programs.• Automatically transform parallel circuit models into digital
circuit implementations.• Exploit shared memory and passage passing idioms for co-
design.• We don’t need to invent a new language:
– Exploit rich existing knowledge of concurrent programming.• Apply recent innovations in shape analysis and region types
to allow us to compile programs with lists and trees.• Is there an application for this work at Sanger/EBI?• More information about Kiwi synthesis at
http://research.microsoft.com/~satnams
Synplify Pro FPGA Implementation:
First, preliminary result: Device: Virtex 5x110T-2: Static timing: 20 logic layers, Fmax=78MHz (12.7 ns). Utilisation = 3120 Virtex-5 slices, 17% of 17500. Clock cycles per streaming base: 10.
Future parameter exploration: QSL search string query limit increase = 256 or 512. N search parallelism (number of units) = 32 or 64. Clocks per cell : reduce to 4 or 2 (channel overheads then dominate). Extend Kiwi channels between the four chips on the BEE3 board.