optimizing high speed arithmetic circuits using three-term extraction anup hosangadi ryan kastner...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Optimizing high speed arithmetic circuits using three-term
extraction Anup Hosangadi
Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories
University of California, Santa Barbara of America
2
Outline• Carry Save Arithmetic
• Related Work
• Problem formulation
• Algebraic methods
• Delay aware optimization
• Experimental results
3
Carry Save Arithmetic• Multi-Operand addition
• F = A + B + C + D + E + F• Carry propagation major bottleneck• Fast adders: Carry Lookahead Adder (CLA),
Carry Select Adders, not fast enough
• Solution: Eliminate Carry propagation to the final step
• Generate Sums and Carries separately• Treat them as separate numbers• Keep adding till only two numbers remain• Add the numbers using fast adder (CLA)
4
Carry Save Arithmetic
CSA CSA
CSA
CSA
+
A B C D E F
Delay = 3 + log2(M + 3)
3 = height of CSA tree
M = bitwidth of operands
S
S
S
SCC
C
C
F
CLA
Tree height = log1.5(N/2)
5
Carry Save arithmetic
RCA
RCA
RCA
RCA
RCA
(M +1)
Delay = (M+5) + 4
Delay comparison
0
20
40
60
80
100
120
2 6 10 14 18 22 26 30 34 38 42 46 50
# of operands
Del
ay (
full
ad
der
del
ays)
RCA
CSA
Area comparison
0
500
1000
1500
2000
# Operands
Are
a (f
ull
ad
der
un
its)
RCA
CSA
Using Ripple carry adders (RCAs)
(M +2)
(M +3)
(M +4)
(M +5)
Delay thru CSA network =
3 + log1.5(M + 3)
6
Related Work• Kim et. al “Arithmetic optimization using Carry
Save Adders”, DAC’98
+
+
+
+
A B C DE
F
D
E
CSA
A B C
CSA
CSA
+
+
F
7
Related Work• Kim. et. al “Optimal allocation of CSAs”, ICCAD’99
• Delay aware CSA allocation
• Kim et. al “High performance, low power synthesis”, DAC’2000
• SynopsysTM Behavioral optimization for arithmetic (BOA)
• A.Verma and P.Ienne “Improved use of the carry save representation for the synthesis of complex arithmetic circuits”, ICCAD’2004
ArithmeticOptimizer?
8
Problem formulation• No methodology for detecting redundancy
in CSA computations• Can reduce the number of CSAs
• Can reduce the number of wires
• Common subexpression elimination• Standard compiler technique
• Applied to 2-term arithmetic operations– Polynomial expressions (ICCAD’04, VLSI’05)
– Constant multiplications (ASAP’04, ASPDAC’05)
• CSA expressions (Common 3-term subexpressions)
9
Problem formulation
Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2
Y2 = X1<<2 + X2<<2 + X2<<3
D1 = X1 + X2 + X2<<1
Y1 = (D1S + D1
C) + X1<<2 + X2<<2
Y2 = (D1S + D1
C)
10
Algebraic methods • Polynomial transformation
• X<<i = XLi
• Detects shifted common subexpressions and also extends to multiple variables
C × X = (±X×Li)
(14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1
= (100-10)CSD × X = XL4 – XL1
11
Algebraic methods• 3-term divisors = All potential common
subexpressions
• Divisor generation• One for every combination of 3 terms
• eg. F1 = X1 + X1L2 + X2 + X2L + X2L2
• d1 = X1L2 + X2L + X2L2
• MinL = L
• Divisor D1 = d1/L = X1L + X2 + X2L
• # of divisors =
• Theorem: • There exists a 3-term common subexpression iff
there exists a non-overlapping intersection among the set of 3-term divisors
N
3
12
Algebraic methods• Greedy Iterative algorithm
• Extracts the “best” 3-term divisor
• Rewrites the expressions containing it
• Terminates when there are no more common subexpressions
F1 = a + b + c + d + e
F2 = a + b + c + d + f
>> D1 = a + b + c
F1 = D1S + D1
C + d + e
F2 = D1S + D1
C + d + f
>> D2 = D1S + D1
C + e
F1 = D2S + D2
C + e
F2 = D2S + D2
C + f
13
Algebraic methods• Algorithm details
Optimize ({Pi}){ {Pi} = Set of expressions in polynomial form; {D} = Set of divisors = φ; // Step 1. Creating divisors and their frequency statistics for each expression Pi in {Pi} { {Dnew} = Divisors(Pi); Update frequency statistics of divisors in {D}; {D} = {D} { Dnew}; }
//Step 2. Iterative selection and elimination of best divisor while (1) { Find d = divisor in {D} with most number of non-overlapping intersections; if (d == NULL) break; Rewrite affected expressions in {Pi} using d;
Remove divisors in {D} that have become invalid;
Update frequency statistics of affected divisors; {Dnew} = Set of new divisors from new terms added by division; {D} = {D} {Dnew}; }}
14
Algebraic methods• Algorithm complexity
• M expressions, each with N terms
• Divisor generation = M* = O(MN3)
• Iterative algorithm, worst case
– N terms reduced to 2 terms = (N -2) steps
– M expressions = O(MN) steps
N
3
15
Delay aware optimization• Sharing subexpressions can increase the
total delay• Traditional high level synthesis approach:
Reduce delay by Tree Height Reduction (THR)
• Our solution: Control delay during optimization itself
• Optimal delay CSA allocation (T.Kim, J.Um, “Timing driven synthesis”, ASPDAC’2000)
– Use this to get minimum possible delay
F1 = a(2) + b(0) + c(0) + d(0) + e(0)
F2 = a(2) + b(0) + c(0) + d(0) + f(0)
16
Delay aware optimization• Optimal allocation Delay ignorant extraction
CSA0 0 0
b c d
CSA1
e
CSA
a
+
F1
33
1 0
2 2 2
CSA0 0 0
b c d
CSA1
f
CSA
a
+
F2
33
1 0
2 2 2
Delay(F1) = Delay(F2) =
3 + D(Add)
17
Delay aware extraction• Control delay during optimization
• Evaluate each candidate divisor for delay
• Only consider those divisors that do not increase the delay
F1 = a(2) + b(0) + c(0) + d(0) + e(0)
F2 = a(2) + b(0) + c(0) + d(0) + f(0)
>> D1(3) = a(2) + b(0) + c(0)
F1 = D1S(3) + D1C
(3) + d(0) + e(0)
F2 = D1S(3) + D1C
(3) + d(0) + f(0)
Delay = 5 + D(Add)
Delay = 5 + D(Add)
18
Delay aware extraction• Control delay during optimization
• Evaluate each candidate divisor for delay
• Only consider those divisors that do not increase the delay
F1 = a(2) + b(0) + c(0) + d(0) + e(0)
F2 = a(2) + b(0) + c(0) + d(0) + f(0)
>> D2(1) = b(0) + c(0) + d(0)
F1 = D2S(1) + D2C
(1) + e(0) + a(2)
F2 = D2S(1) + D2C
(1) + f(0) + a(2)
Delay = 3 + D(Add)
Delay = 3 + D(Add)
19
Experimental results• Comparing # of CSAs
Comparing # of CSAs
0
50
100
150
200
250
Example
# C
SA
s
Original
Optimized
Average 38.4% reduction
20
Experimental results• Synthesis for Standard Cell Designs
• SynopsysTM Design compiler
• 0.25 micron library
• Synthesized for minimum delay
Area results
0200400600800
10001200140016001800
Example
Are
a Series1
Series2
Avg 32.7% Area reduction
Avg 3.7% increase in delay
21
Experimental results• FPGA synthesis
• Virtex II FPGAs
• Synthesized designs and performed place & route
Reduction in LUTs and slices
05
10152025303540
H.264 DCT8 IDCT8 6 tapFIR
20 tapFIR
41 tapFIR
Average
Examples
% R
educt
ion
LUTs
Slices
Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs
Avg 5.7% increase in the delay
22
Experimental results• Evaluate Delay aware extraction algorithm
• Consider different arrival times of the signals
• Assume delay dominated by gate delay (FA delay)
• Only consider best case delay
Example # of CSAs Delay (FA units)
Delay
ignorant
Delay
aware
Delay
Ignorant
Delay
aware
H.264 78 79 9 8
DCT8 222 232 14 13
IDCT8 195 201 14 13
FIR6tap 11 15 5 4
FIR20tap 34 45 6 5
FIR41tap 79 91 6 5
Average 103.2 110.5 9 8
Best delay with 15.5% increase in #CSAs
23
Conclusions• First methodology for common
subexpression elimination for Carry Save Arithmetic
• Significant area/power reduction
• Delay aware optimization algorithm also developed
• Can be combined with CSA tree extraction methods for actual application improvement
24
Thank you!!• Questions?