optimizing high speed arithmetic circuits using three-term extraction anup hosangadi ryan kastner...

Optimizing high speed arithmetic circuits using three-term

extraction Anup Hosangadi

Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories

University of California, Santa Barbara of America

2

Outline• Carry Save Arithmetic

• Related Work

• Problem formulation

• Algebraic methods

• Delay aware optimization

• Experimental results

3

Carry Save Arithmetic• Multi-Operand addition

• F = A + B + C + D + E + F• Carry propagation major bottleneck• Fast adders: Carry Lookahead Adder (CLA),

Carry Select Adders, not fast enough

• Solution: Eliminate Carry propagation to the final step

• Generate Sums and Carries separately• Treat them as separate numbers• Keep adding till only two numbers remain• Add the numbers using fast adder (CLA)

4

Carry Save Arithmetic

CSA CSA

CSA

CSA

+

A B C D E F

Delay = 3 + log2(M + 3)

3 = height of CSA tree

M = bitwidth of operands

S

S

S

SCC

C

C

F

CLA

Tree height = log1.5(N/2)

5

Carry Save arithmetic

RCA

RCA

RCA

RCA

RCA

(M +1)

Delay = (M+5) + 4

Delay comparison

0

20

40

60

80

100

120

2 6 10 14 18 22 26 30 34 38 42 46 50

# of operands

Del

ay (

full

ad

der

del

ays)

RCA

CSA

Area comparison

0

500

1000

1500

2000

# Operands

Are

a (f

ull

ad

der

un

its)

RCA

CSA

Using Ripple carry adders (RCAs)

(M +2)

(M +3)

(M +4)

(M +5)

Delay thru CSA network =

3 + log1.5(M + 3)

6

Related Work• Kim et. al “Arithmetic optimization using Carry

Save Adders”, DAC’98

+

+

+

+

A B C DE

F

D

E

CSA

A B C

CSA

CSA

+

+

F

7

Related Work• Kim. et. al “Optimal allocation of CSAs”, ICCAD’99

• Delay aware CSA allocation

• Kim et. al “High performance, low power synthesis”, DAC’2000

• SynopsysTM Behavioral optimization for arithmetic (BOA)

• A.Verma and P.Ienne “Improved use of the carry save representation for the synthesis of complex arithmetic circuits”, ICCAD’2004

ArithmeticOptimizer?

8

Problem formulation• No methodology for detecting redundancy

in CSA computations• Can reduce the number of CSAs

• Can reduce the number of wires

• Common subexpression elimination• Standard compiler technique

• Applied to 2-term arithmetic operations– Polynomial expressions (ICCAD’04, VLSI’05)

– Constant multiplications (ASAP’04, ASPDAC’05)

• CSA expressions (Common 3-term subexpressions)

9

Problem formulation

Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2

Y2 = X1<<2 + X2<<2 + X2<<3

D1 = X1 + X2 + X2<<1

Y1 = (D1S + D1

C) + X1<<2 + X2<<2

Y2 = (D1S + D1

C)

10

Algebraic methods • Polynomial transformation

• X<<i = XLi

• Detects shifted common subexpressions and also extends to multiple variables

C × X = (±X×Li)

(14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1

= (100-10)CSD × X = XL4 – XL1

11

Algebraic methods• 3-term divisors = All potential common

subexpressions

• Divisor generation• One for every combination of 3 terms

• eg. F1 = X1 + X1L2 + X2 + X2L + X2L2

• d1 = X1L2 + X2L + X2L2

• MinL = L

• Divisor D1 = d1/L = X1L + X2 + X2L

• # of divisors =

• Theorem: • There exists a 3-term common subexpression iff

there exists a non-overlapping intersection among the set of 3-term divisors

N

3

12

Algebraic methods• Greedy Iterative algorithm

• Extracts the “best” 3-term divisor

• Rewrites the expressions containing it

• Terminates when there are no more common subexpressions

F1 = a + b + c + d + e

F2 = a + b + c + d + f

>> D1 = a + b + c

F1 = D1S + D1

C + d + e

F2 = D1S + D1

C + d + f

>> D2 = D1S + D1

C + e

F1 = D2S + D2

C + e

F2 = D2S + D2

C + f

13

Algebraic methods• Algorithm details

Optimize ({Pi}){ {Pi} = Set of expressions in polynomial form; {D} = Set of divisors = φ; // Step 1. Creating divisors and their frequency statistics for each expression Pi in {Pi} { {Dnew} = Divisors(Pi); Update frequency statistics of divisors in {D}; {D} = {D} { Dnew}; }

//Step 2. Iterative selection and elimination of best divisor while (1) { Find d = divisor in {D} with most number of non-overlapping intersections; if (d == NULL) break; Rewrite affected expressions in {Pi} using d;

Remove divisors in {D} that have become invalid;

Update frequency statistics of affected divisors; {Dnew} = Set of new divisors from new terms added by division; {D} = {D} {Dnew}; }}

14

Algebraic methods• Algorithm complexity

• M expressions, each with N terms

• Divisor generation = M* = O(MN3)

• Iterative algorithm, worst case

– N terms reduced to 2 terms = (N -2) steps

– M expressions = O(MN) steps

N

3

15

Delay aware optimization• Sharing subexpressions can increase the

total delay• Traditional high level synthesis approach:

Reduce delay by Tree Height Reduction (THR)

• Our solution: Control delay during optimization itself

• Optimal delay CSA allocation (T.Kim, J.Um, “Timing driven synthesis”, ASPDAC’2000)

– Use this to get minimum possible delay

F1 = a(2) + b(0) + c(0) + d(0) + e(0)

F2 = a(2) + b(0) + c(0) + d(0) + f(0)

16

Delay aware optimization• Optimal allocation Delay ignorant extraction

CSA0 0 0

b c d

CSA1

e

CSA

a

+

F1

33

1 0

2 2 2

CSA0 0 0

b c d

CSA1

f

CSA

a

+

F2

33

1 0

2 2 2

Delay(F1) = Delay(F2) =

3 + D(Add)

17

Delay aware extraction• Control delay during optimization

• Evaluate each candidate divisor for delay

• Only consider those divisors that do not increase the delay

F1 = a(2) + b(0) + c(0) + d(0) + e(0)

F2 = a(2) + b(0) + c(0) + d(0) + f(0)

>> D1(3) = a(2) + b(0) + c(0)

F1 = D1S(3) + D1C

(3) + d(0) + e(0)

F2 = D1S(3) + D1C

(3) + d(0) + f(0)

Delay = 5 + D(Add)

Delay = 5 + D(Add)

18

Delay aware extraction• Control delay during optimization

• Evaluate each candidate divisor for delay

• Only consider those divisors that do not increase the delay

F1 = a(2) + b(0) + c(0) + d(0) + e(0)

F2 = a(2) + b(0) + c(0) + d(0) + f(0)

>> D2(1) = b(0) + c(0) + d(0)

F1 = D2S(1) + D2C

(1) + e(0) + a(2)

F2 = D2S(1) + D2C

(1) + f(0) + a(2)

Delay = 3 + D(Add)

Delay = 3 + D(Add)

19

Experimental results• Comparing # of CSAs

Comparing # of CSAs

0

50

100

150

200

250

Example

# C

SA

s

Original

Optimized

Average 38.4% reduction

20

Experimental results• Synthesis for Standard Cell Designs

• SynopsysTM Design compiler

• 0.25 micron library

• Synthesized for minimum delay

Area results

0200400600800

10001200140016001800

Example

Are

a Series1

Series2

Avg 32.7% Area reduction

Avg 3.7% increase in delay

21

Experimental results• FPGA synthesis

• Virtex II FPGAs

• Synthesized designs and performed place & route

Reduction in LUTs and slices

05

10152025303540

H.264 DCT8 IDCT8 6 tapFIR

20 tapFIR

41 tapFIR

Average

Examples

% R

educt

ion

LUTs

Slices

Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs

Avg 5.7% increase in the delay

22

Experimental results• Evaluate Delay aware extraction algorithm

• Consider different arrival times of the signals

• Assume delay dominated by gate delay (FA delay)

• Only consider best case delay

Example # of CSAs Delay (FA units)

Delay

ignorant

Delay

aware

Delay

Ignorant

Delay

aware

H.264 78 79 9 8

DCT8 222 232 14 13

IDCT8 195 201 14 13

FIR6tap 11 15 5 4

FIR20tap 34 45 6 5

FIR41tap 79 91 6 5

Average 103.2 110.5 9 8

Best delay with 15.5% increase in #CSAs

23

Conclusions• First methodology for common

subexpression elimination for Carry Save Arithmetic

• Significant area/power reduction

• Delay aware optimization algorithm also developed

• Can be combined with CSA tree extraction methods for actual application improvement

24

Thank you!!• Questions?

optimizing high speed arithmetic circuits using three-term extraction anup hosangadi ryan kastner...

Documents

arithmetic csa

arithmetic optimization

outline carry

arithmetic boa

arithmetic rca

n2 slide

fast adder cla slide

iccad2004 arithmetic