1 is for circuits: capturing fpga circuits as sequential code for portability scott sirowy*, greg...
TRANSCRIPT
1
is for Circuits: Capturing FPGA Circuits as Sequential Code for Portability
Scott Sirowy*, Greg Stitt‡, Frank Vahid*†
This work was supported in part by the National Science Foundation and the Semiconductor Research
Corporation
*Department of Computer Science and Engineering
University of California, Riverside{ssirowy,vahid}@cs.ucr.edu
†Also with the Center for Embedded Computer Systems at UC Irvine
‡Department of Electrical and Computer Engineering
University of [email protected]
2 of 21
“C is for Circuits” vs.High Level Synthesis
Designer captures spatial algorithm as custom circuit
N unsorted
Split
1 sorted 1 sorted
SplitMerge
MergeSplit
2 sorted2 sorted
4 sorted 4 sorted
…
Designer captures application with temporal algorithmquicksort( array, left, right){
if right > left: pivot= array[left] newpivot = partition(array, left, right, pivot) quicksort(array, left, newpivot -1) quicksort(array, newpivot + 1, right)}
Synthesis
?
3 of 21
“C is for Circuits” vs.High Level Synthesis
Designer captures spatial algorithm as custom circuit
N unsorted
Split
1 sorted 1 sorted
SplitMerge
MergeSplit
2 sorted2 sorted
4 sorted 4 sorted
…
Designer captures application with temporal algorithmquicksort( array, left, right){
if right > left: pivot= array[left] newpivot = partition(array, left, right, pivot) quicksort(array, left, newpivot -1) quicksort(array, newpivot + 1, right)}
Synthesis
Queue 1_1, 1_2, 2_1, 2_2, 4_s, 4_us;Split(16_u.dequeue, 16_u.dequeue, 1_1, 1_2);stage1 = Merge(1_1.dequeue, 1_2.dequeue);Split(16_u.dequeue, 16_u.dequeue);stage1 += Merge(1_1.dequeue, 1_2.dequeue);Split(stage1, 2_1, 2_2);stage2 = Merge(2_1, 2_2);Split(16_u.dequeue, 16_u.dequeue);stage1 = Merge(1_1.dequeue, 1_2.dequeue);Split(16_u.dequeue, 16_u.dequeue);stage1 += Merge(1_1.dequeue, 1_2.dequeue);Split(stage1);stage2 += Merge(2_1, 2_2);Split(stage2, 4_1, 4_2);…
Capture in temporal language
Synthesis
4 of 21
?
Goal: Portable Circuit Distribution Format
0010010010101010111011100111001011001101110001011000100100011100101111011111100101110011100011101000011100101111
Current circuit distribution method Bitstreams
Tightly coupled to a specific device
4 sorted4 sorted
8 sorted8 sorted
16 unsorted
Split
1 sorted1 sorted
2 sorted2 sorted
SplitMerge
Merge
MergeSplit
MergeSplit
16 sorted
1111110000100110111011100111001011001101110001011111110100011100101111011000000000110011100011101000011100101111
1111110000100110111011100111001011001101110001011111110100011100101111011000000000110011100011101000011100101111
FPGA
+** +
MEMProc.
FPGA
+ +Proc.
FPGA
Proc.
Proc.FPGA
Proc.
Proc.
Applic
ati
on c
once
ptu
aliz
ed
and c
aptu
red a
s cir
cu
it
5 of 21
Goal: Portable Circuit Distribution Format
Current circuit distribution method RTL
Good across multiple FPGA devices But requires resynthesis/mapping May not use FPGA resources most
effectively Loop unrolling, memory mapping, hard-
core use, …
4 sorted4 sorted
8 sorted8 sorted
16 unsorted
Split
1 sorted1 sorted
2 sorted2 sorted
SplitMerge
Merge
MergeSplit
MergeSplit
16 sorted
Entity Circuitport( … );
Architecture of…Begin…End arch; FPGA
+** +
MEMProc.
FPGA
+ +Proc.
FPGA
Proc.
Proc.FPGA
Proc.
Proc.
Applic
ati
on c
once
ptu
aliz
ed
and c
aptu
red a
s cir
cu
it
6 of 21
#include <foo.h>
int main(){ float pi = 3.141; while(1){ … }}
Goal: Portable Circuit Distribution Format
Higher abstraction C code (or any sequential language)
Can yield more effective resource usage
Could even run on platforms with no FPGA
But also requires resynthesis/mapping
4 sorted4 sorted
8 sorted8 sorted
16 unsorted
Split
1 sorted1 sorted
2 sorted2 sorted
SplitMerge
Merge
MergeSplit
MergeSplit
16 sorted
FPGA
+** +
MEMProc.
FPGA
+ +Proc.
FPGA
Proc.
Proc.FPGA
Proc.
Proc.
ProcessorProcessor
Applic
ati
on c
once
ptu
aliz
ed
and c
aptu
red a
s cir
cu
it
7 of 21
~~~~~~~~~
~~~~~~~~~
Problem: Many FPGA Applications Captured “Spatially” as Circuits, not C
Designer captures spatial algorithm as custom circuit for max performance
N unsorted
Split
1 sorted 1 sorted
SplitMerge
MergeSplit
2 sorted2 sorted
4 sorted 4 sorted
…
~~~~~~~~~
~~~~~~~~~
…
Circuits in FCCM Year
3D Vector Normalization 2001Regular Expression 2001RC4 2002Gaussian Noise Gen. 2003Molecular Dynamics 2004Particle Graphics 2005
Shortest Path 2006
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~
70 custom circuits in FCCM’01-’06 alone
8 of 21
Queue 1_1, 1_2, 2_1, 2_2, 4_s, 4_us;Split(16_u.dequeue, 16_u.dequeue, 1_1, 1_2);stage1 = Merge(1_1.dequeue, 1_2.dequeue);Split(16_u.dequeue, 16_u.dequeue);stage1 += Merge(1_1.dequeue, 1_2.dequeue);Split(stage1, 2_1, 2_2);stage2 = Merge(2_1, 2_2);Split(16_u.dequeue, 16_u.dequeue);stage1 = Merge(1_1.dequeue, 1_2.dequeue);Split(16_u.dequeue, 16_u.dequeue);stage1 += Merge(1_1.dequeue, 1_2.dequeue);Split(stage1);stage2 += Merge(2_1, 2_2);Split(stage2, 4_1, 4_2);…
Capturing Circuit Level Designs in
N unsorted
Split
1 sorted 1 sorted
SplitMerge
MergeSplit
2 sorted2 sorted
4 sorted 4 sorted
…
Can designers’ circuits be reverse-engineered to some form of C code?
From which original circuit will be synthesized by “standard” synthesis tools
Synthesis
Designer captures spatial algorithm as custom circuit for max performance
9 of 21
Previous Work Convert existing sequential algorithms to
circuits Diniz, Eles, Frigo, Henkel, Najjar, Srinivasan, Stitt,
etc. Coding guidelines for synthesis
Stitt CODES/ISSS 2006 Reverse engineering techniques
Doom, Hanson et. al Languages that encapsulate spatial and
temporal concepts SystemC, StreamsC, etc.
10 of 21
Study Methodology Chose pseudo-random subset of all
applicable FPGA circuit designs from past six years of FCCM (Field Programmable Custom Computing Machines)
Attempted to capture circuit with high level C such that a “standard” synthesis tool would output the original circuit
~~~~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~
11 of 21
Study Methodology
CDFG creation
Optimizations/SchedulingResource Allocation
VHDL Creation
CDFG analysis
int main(){Float pi = 3.14;…;…; }
int main(){Float pi = 3.14;…;…; }
Capture circuit in C code?
~~~~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~
1.
2.
3.“Standard” Synthesis
“Standard” HLS tool Manually performed Optimizations
applied in same order for every application
1. Function Inlining2. Loop Unrolling3. Predication4. Constant Propagation5. Dead Code Elimination6. Code Hoisting7. Pipeline Analysis
?
Each circuit either Re-derivable from C Not re-derivable from C
Re-derivable Temporal C (the
“natural” algorithm Spatial C (reflecting the
circuit) Not re-derivable
Might still be possible
12 of 21
Gaussian Noise GeneratorFCCM 2003 Lee et. al
Linear Feedback Shift Registers
u2
f(u1)g1(u2) g2(u2)
*
x1 x2
+
*
+
Stage1
Stage2
Stage3
Stage4
u1
~~~~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~
2.
1.
int main(){…}
int main(){…}
Capture circuit in C code?
CDFG creation
Optimizations/Scheduling
Resource Allocation
VHDL Creation
CDFG analysis
Synthesis1. Function Inlining2. Loop Unrolling3. Predication4. Constant Propagation5. Dead Code Elimination6. Code Hoisting7. Pipeline Analysis
13 of 21
Gaussian Noise GeneratorFCCM 2003 Lee et. al
~~~~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~
2.
1.
int main(){…}
int main(){…}
Capture circuit in C code?
CDFG creation
Optimizations/Scheduling
Resource Allocation
VHDL Creation
CDFG analysis
Synthesis1. Function Inlining2. Loop Unrolling3. Predication4. Constant Propagation5. Dead Code Elimination6. Code Hoisting7. Pipeline Analysis
inline float rand0_1() { return rand()/((float) RAND_MAX+1);}
inline Stage1 doStage1() { Stage1 result; result.u1 = rand0_1(); result.u2 = rand0_1(); return result;}
inline Stage2 doStage2( float u1, float u2 ) {
Stage2 result; float f_u1, g1_u2, g2_u2;
f_u1 = sqrt( -log( u1 ) ); g1_u2 = sin( 2*M_PI*u2 ); g2_u2 = cos( 2*M_PI*u2 ); result.x1 = f_u1*g1_u2; result.x2 = f_u1*g2_u2; return result;}
inline Stage3 doStage3( float x1, float x2 ) {
static float acc1=0.0, acc2=0.0; Stage3 result;
result.x1 = acc1 + x1; result.x2 = acc2 + x2; acc1 = x1; acc2 = x2; return result;}
inline void doStage4( int i, int j, float x1, float x2 ) {
noise[i] = stage3.x1; noise[j] = stage3.x2;}
int main() {
Stage1 stage1; Stage2 stage2; Stage3 stage3; unsigned int i=0;
while (1) { stage1 = doStage1(); stage2 = doStage2( stage1.u1, stage1.u2 ); stage3 = doStage3( stage2.x1, stage2.x2 ); doStage4( i, i+1%NUM_SAMPLES,
stage3.x1, stage3.x2 ); i = (i+2)%NUM_SAMPLES; }
return 1;}
Linear Feedback Shift Registers
u2
f(u1)g1(u2) g2(u2)
*x1 x2
+
*
+
Stage1
Stage2
Stage3
Stage4
u1
14 of 21
Gaussian Noise GeneratorFCCM 2003 Lee et. al
~~~~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~~~~
~~~~~~
~~~~~~~~~
2.
1.
int main(){…}
int main(){…}
Capture circuit in C code?
CDFG creation
Optimizations/Scheduling
Resource Allocation
VHDL Creation
CDFG analysis
Synthesis1. Function Inlining2. Loop Unrolling3. Predication4. Constant Propagation5. Dead Code Elimination6. Code Hoisting7. Pipeline Analysis
rand()
rand()
u1 u2
doStage1()
g2(u2)f(u1)g1(u2)
*
x1 x2
u1u2 u2
*
doStage2()
acc1
acc2x1 x2+
x1 x2
+
doStage3()
acc1
acc2
doStage4()x1 x2
noise[i]
noise[j]
u1 u2
LFSR
doStage1()
f(u1)g1(u2) g2(u2)
* *
u1u2doStage2()
x1
+
acc1
+
acc2
x2
doStage3()
noise[]
doStage4()
sel
x1 x2
CDFG Creation/AnalysisScheduling/Resource Allocation
15 of 21
Gaussian Noise GeneratorFCCM 2003 Lee et. al
rand()
rand()
u1 u2
doStage1()
g2(u2)f(u1)g1(u2)
*
x1 x2
u1u2 u2
*
doStage2()
acc1
acc2x1 x2+
x1 x2
+
doStage3()
acc1
acc2
doStage4()x1 x2
noise[i]
noise[j]
u1 u2
LFSR
doStage1()
f(u1)g1(u2) g2(u2)
* *
u1u2doStage2()
x1
+
acc1
+
acc2
x2
doStage3()
noise[]
doStage4()
sel
x1 x2
CDFG Creation/AnalysisScheduling/Resource Allocation
doStage1()
doStage2()
doStage4()
main()
doStage3()
LFSR
f(u1)g1(u2)
**
+
acc1
+
acc2
sel
g2(u2)
Circuit from “Standard” Synthesis
16 of 21
Gaussian Noise GeneratorFCCM 2003 Lee et. al
Original Circuit
If (nearly) same “Rederivable from C”
Linear Feedback Shift Registers
u2
f(u1)g1(u2) g2(u2)
*x1 x2
+
*
+
Stage1
Stage2
Stage3
Stage4
u1
doStage1()
doStage2()
doStage4()
main()
doStage3()
LFSR
f(u1)g1(u2)
**
+
acc1
+
acc2
sel
g2(u2)
Circuit from “Standard” Synthesis
17 of 21
Results2001 3D Vec. Normalization Yes Spatial, if online algorithms can be specified 2001 Efficient CAM No Uses dynamic FPGA routing2001 Automated Sensor Yes Temporal, floating point -> fixed point2001 Regular Expression Yes Spatial, creative connections of one-bit flip flops2002 Hyperspectral Image Yes Spatial, data reordering2002 Machine Vision Yes Spatial, custom pipelining2002 RC4 Yes Temporal, straightforward implementation2002 Set Covering Yes Spatial, data structures for easy hw implementation2002 Template Matching Yes Spatial, heavy modifications to original algorithm2002 Triangle Mesh Yes Spatial, custom encoding scheme2003 Congruential Sieves Yes Temporal, straightforward translation2003 Content Scanning Yes Temporal2003 F.P and Square Root Yes Spatial2003 Gaussian Noise Yes Spatial, requires the use of spatial C constructs2003 TRNG No Requires sampling a high frequency clock for noise2004 3D FDTD Method Yes Spatial2004 Deep Packet Filter No Requires knowledge of underlying FPGA2004 Online Floating Point No Online algorithm, variable length buffers2004 Molecular Dynamics Yes Spatial2004 Pattern Matching Yes Spatial2004 Seismic Migration Yes Spatial2004 Software Deceleration No Use a uP for its cache2004 V.M Window No Specific timing schemes implemented2005 Data Mining Yes Spatial2005 Cell Automata Yes Temporal2005 Particle Graphics Yes Spatial2005 Radiosity Yes Temporal2005 Transient Waves Yes Spatial2005 Road Traffic Yes Temporal2006 All Pairs Shortest Path Yes Spatial2006 Apriori Data Mining Yes Spatial2006 Molecular Dynamics Yes Spatial, define separate memories, custom pipeline2006 Gaussian Elimination Yes Spatial2006 Radiation Dose Yes Temporal2006 Random Variates Yes Spatial
Year of Publication Design Re-derivable from C Method/Reason
82% of the circuit designswere re-derivable from C
18 of 21
ResultsPerformance Comparison
012345
Float
MD
CLA-E
C
Noise M
D`
Traffic
Elimin
atio
n
Avera
geExe
cuti
on
Tim
e
We couldn’t describein C to re-derive samecircuit
Used separate on-boardmemories
Custom
Synthesized
Re-derivable from C
Not re-derivable from C
Similar or identical performance
19 of 21
ResultsArea Comparison
00.5
11.5
2Custom
Synthesized
Extra area due to added multiplexors or registers, none of which significantly altered behavior of the circuit
20 of 21
onclusion Designers continue to conceptualize/capture
some FPGA applications “spatially” as circuits Despite increasing C-based synthesis tools
For 35 FCCM circuits studied, 82% were re-derivable from some form of C
Distributing a circuit using C code expands the range of target platforms and the longevity of an application
Compared to a netlist or RTL distribution Future work
Using C as part of a standard binary for FPGA
21 of 21
Sponsors This presentation brought to you
by the letters
And viewers like you…
SFN