1 synthesizing datapath circuits for fpgas with emphasis on area minimization andy ye, david lewis,...
TRANSCRIPT
1
Synthesizing Datapath Circuits for FPGAs With Emphasis on
Area Minimization
Andy Ye, David Lewis, Jonathan Rose
Department of Electrical and Computer Engineering, University of Toronto
{yeandy, lewis, jayar}@eecg.utoronto.ca
2
Motivation: Datapath Regularity
• Larger FPGAs– Larger applications on FPGAs
– More datapath logic in larger applications
– Datapath logic is highly regular
• Utilize regularity to improve logic density
3
Utilizing Datapath Regularity
• A new datapath-oriented FPGA
• New CAD tools supporting the new FPGA– Synthesis
– Packing
– Placement
– Routing
• This talk focuses on synthesis
4
Background: Datapath-oriented FPGA
• Architected to utilize datapath regularity
• Architectural features– Capture regularity using special logic blocks
– Increase logic density by coarse grain routing
5
Background: FPGA Overview
L L
L L
S
L Logic cluster
Coarse grain routing tracksFine grain routing tracks
S Switch box
RoutingChannels
6
Background: Logic ClusterBLEBLEBLEBLE
BLEBLEBLEBLE
BLEBLEBLEBLE
BLEBLEBLEBLE
Subcluster 1Subcluster 2Subcluster 3Subcluster 4
LocalRoutingNetwork
BLEBLEBLEBLE
A Subcluster
MU
X
LUTDF
F
MA Basic Logic Element (BLE)
7
Background: FPGA Overview
L L
L L
S
L Logic cluster
Coarse grain routing tracksFine grain routing tracks
S Switch box
RoutingChannels
8
Background: Coarse Grain Routing Tracks
Logic Cluster
Sub-cluster
Sub-cluster
Sub-Cluster
Sub-cluster
M
Sw
itch
Bo
x
M
M
Coarse Grain Routing
M M M M
Fine Grain Routing
9
Datapath Synthesis
• Synthesis– The first step in a fully automated CAD flow
– Transforms high level descriptions into logic
• Conventional synthesis (flat synthesis)– Minimizes area and delay metrics
– Destroys datapath regularity
• Datapath synthesis– Preserves datapath regularity
– Supports downstream CAD tools
10
Datapath Representation
• Datapath circuits are represent by netlists of datapath components (VHDL or Verilog)
• Datapath component library– Multiplexers
– Adders/subtracters
– Shifters
– Comparators
– Registers
• Each component consists of identical bit-slices
11
Hard Boundary Hierarchical Synthesis
• Optimize within the boundaries of bit-slices
• Keep identical bit-slices identical
• Optimized 15 datapath circuits from Pico-java processor using Synopsys [sun]– Good regularity
– Bad area - 38% area inflation
• FPGA architecture – increase logic density– Need a better synthesis tool
12
Causes of Area Inflation
• Examined circuits to determine the causes
• Constraint of preserving bit-slice boundaries– Common sub-expressions exist across bit-slices
– Harder to discover in datapath synthesis
• Constraint of preserving datapath regularity– Identical bit-slices have different external connections
– Some bit-slices have more optimization opportunities
– Missing optimization opportunities if one has to keeping all bit-slices identical
13
Enhanced Module CompactionNetlist of Datapath
Components
Word-level Optimization
Module Compaction
Bit-slice Netlist I/OOptimization
Flat Synthesis & OptimizationWithin Bit-slice Boundaries
Manual Operation
Netlist of SynthesizedBit-slices
14
Word-level Optimization
• Done manually and will be automated
• Optimizes across bit-slice boundaries
• Uses the functionality of each datapath component to create optimization opportunities
• Two are performed– Multiplexer tree collapsing
– Operation reordering
• More in the future
15
Multiplexer Tree Collapsing
• Datapath circuits contain multiplexers in a tree topology
• Collapses several multiplexers in a multiplexer tree into a single multiplexer
• Collapsing operation creates common sub-expressions
• Extracts common expressions out of multiple bit-slices to save area
16
An Example
FF
S1
S2
R
A
FF
A
rl
S1
S2
rl – random logic
mux1
mux2
17
Operation Reordering
• Transforms result selection into operand selection
• Accepts the transformation if resulting in smaller area
18
An Example
mux
+ +a b c d
se
mux
+
a c b dmux
e
s
sum carry sum carry
a0b0cin0a c0
d0cin0b
cout0a
cout0bs0
e0
sum carry
e0cout0
cin0
a0 c0 b0 d0
s0
19
Module Compaction
• Merges bit-slices into larger bit-slices
• Based on connectivity between datapath components
• Larger bit-slices have more optimization opportunities for flat synthesis
• Avoids merging based on carry chains
• Similar to the algorithm proposed by Koch
20
An Example
mux0 mux1 mux2 mux3
FA0 FA1 FA2 FA3 FA4
21
Bit-slice I/O Optimization
• Granularity of bit-slice I/O optimization, m
• Breaks datapath components into m-bit wide chunks
• m bit-slices are kept identical to each other
• Allows some bit-slices in a datapath component to be optimized more than others
22
Bit-slice I/O Optimization
• Converts bit-slice I/O signals into internal signals if all m bit-slices meet an optimization criteria
• More optimization opportunities for flat synthesis
• Four types of I/O optimizations– Constant absorption
– Feedback absorption
– Duplicated input absorption
– Unused output absorption
23
Experimental Results
• Fifteen benchmark circuits– From the Pico-java processor
– Synthesized into 4-LUTs and DFFs
• Experiments– Area
– Regularity
– Area against m (the granularity of bit-slice I/O optimization)
24
Area
• m (granularity of bit-slice I/O optimization) = 4
• Compare datapath synthesis with flat synthesis
25
Post-synthesis Area (LUT Count)
Flat Synthesis
Area
Datapath Synthesis
Area Inflation
icu_dpath 3120 3235 3.7%ex_dpath 2530 2553 0.91%multmod_dp 1558 1634 4.9%ucode_dat 1243 1304 4.9%imdr_dpath 1182 1219 3.1%dcu_dpath 960 966 0.63%mantissa_dp 846 878 3.8%incmod_dp 779 865 11%smu_dpath 490 493 0.61%exponent_dp 477 501 5.0%pipe_dpath 443 471 6.3%prils_dp 377 388 2.9%rsadd_dp 346 305 -12%code_seq_dp 218 223 2.3%ucode_reg 78 82 5.1%Total Area 14647 15117 3.2%
26
Regularity
• m (granularity of bit-slice I/O optimization) = 4
• Two terminal connections captured by– 4-bit wide buses
– 4-bit wide control groups
27
Regularity
A 4-bit wide bus
S1S2S3S4
S1S2S3S4
S1S2S3S4
A 4-bit wide control group
28
Regularity ResultsTwo Terminal Connections
4-bit Wide Buses 4-bit Wide Control groups
dcu_dpath 2232 49% 43%ex_dpath 6547 52% 39%icu_dpath 8047 47% 36%imdr_dpath 3100 50% 36%pipe_dpath 1049 48% 42%smu_dpath 1167 48% 25%ucode_data 3143 52% 41%ucode_reg 194 72% 21%code_seq_dp 799 58% 18%exponent_dp 1362 32% 23%incmod_dp 2013 42% 33%mantissa_dp 2533 47% 36%multmod_dp 3380 39% 25%prils_dp 864 41% 32%rsadd_dp 722 52% 27%Total 37152 48% 35%
• 94% of LUTs remain in regular datapath components
29
Granularity (m) Vs. Area
• Higher m (the granularity of bit-slice I/O optimization)– Keeps more bit-slices identical
– Preserves more regularity
– Higher area cost
30
Granularity Vs. Area Inflation
0
1
2
3
4
5
6
7
8
%
1 4 8 12 16 20 24 28 32
31
Conclusion
• Presented a datapath-oriented FPGA architecture
• Presented an enhanced module compaction algorithm
• Empirically demonstrated the area efficiency of the algorithm– 3%-8% area inflation
• Good regularity– 48% two terminal connections are in 4-bit wide buses– 35% two terminal connections are in 4-bit wide control
groups