cse 567 - autumn 1998 - misc. topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000...

CSE 567 - Autumn 1998 - Misc. Topics - 1

multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5

1 1 0 00 0 0 0

1 1 0 00 0 0 00 1 1 1 1 0 0 60

4 partial products

compute partial product; shift; add

repeat n times:

note: each bit of partial products is just an AND operation

Multiplication

Example


adder

result

0

one bit of multiplier applied each cycle

2n bit adder

multiplicand

multiplierx

z

y

z = 0;

repeat n

if (x[0]) z = z + y;

x = x >> 1; y = y << 1;

Sequential Multiplier


adder

result

one bit of multiplier applied each cycle

n-bit adder

multiplicand

multiplierx

z

y

z = 0;

repeat n

if (x[0]) z = z + y * 2n;

x = x >> 1; z = z >> 1;

Sequential Multiplier (cont’d)


Parallelism in hardware

Fine-grained - bit level e.g., carry-select, carry-lookahead adder

Pipelining same number of functional units different latency, but increased throughput less work per clock cycle

Coarse-grained - data-path level e.g., multiple arithmetic units multi-port register files (read/write from different

sources/destinations)

Processor level difficult to take advantage of many levels of parallelism

in fixed general-purpose processors much easier when the processors are special-purpose,

e.g., systolic computations


Bit level parallelism

Exploit ability to do necessary bit-level computations directly exploit redundant logic goal - keep all circuits busy, reduce critical path

Examples carry-lookahead adder carry-select adder multipliers


LSB

LSB1 1 0

1

1

0

1 1 0

1 1 0

0 0 0

01001

multplier

multiplicand

Combinational Multipliers

Use AND gates to generate all partial products in parallel


LSB

LSB1 1 0

1

1

0

1 1 0

1 1 0

0 0 0

0

1

001

Combinational Multipliers (cont'd)

Skew array to send partial products along diagonal and make it square


worst-case delay is 3n

LSB

LSB

0 0 00

0

0

A B

CinCout

S

Full Adder


Ripple-carry adder in each row (carries ripple right to left)

Sums ripple down (shifted one to right)


LSB

LSB0 0 0

0 0 0

0

0

CLA

A B Cin

Cout S

Full Adder

no need to optimize carry more than sum

using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)

Using Carry-Save

Forward carries to next row of adders

CLA at the end to add last partial product and forwarded carries


x1

x2

x1

x2

x1

x2

x1

x2

x1

x2

partial products


Carry-save adder is a 3-2 adder:


PP1PP2 PP3PP4PP5

+

CLA

Result

PP6PP7PP8

+ +

+ +

PP0

+

+

Wallace Tree Multiplier

Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)

Difficult structure to layoutand integrate with partial product crossbar

Wiring constraints make it unattractive in many technologies


Binary Tree Multipliers

Problem with Wallace tree is 3:2 column reduction need 2:1 reduction for binary tree

One solution: signed-digit binary trees represent digits as 0, 1, -1 similar to Booth's encoding

1+ -1

0 0

1+ 1

1 0

0+ 0

0 0

-1+ -1-1 0

1+ 0

1 -10 1

-1+ 0

0 -1-1 1

xyif x>=0 and y>=0otherwise

xyif x>=0 and y>=0otherwise


0 0 1 1 0 1 13

1 1 1 0 1 0 –6

0 0 –1 1 –1 0

0 –1 –2

1 1 1 1 1 1 1 0 0 1 1 0

1 1 1 1 1 1 0 0 1 1

0 0 0 0 0 0 0 0

1 1 1 1 1 0 1 1 0 0 1 0 –78

Boothrecodingsteps

must be able to add multiplier times 0, –1, –2, 1, and 2

i+1 i i-1 add

0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M

Boothrecodingtable

Booth's Algorithm

Take care of (retire) more than one bit per shift operation

Example: shift two bits at a time


Register Transfer

Registers have input and output output can be fanned out to many destinations input can come from many sources

multiplexer needed on input to select which

input

output

input

outputs to other registers

inputs from other registers

output

controlsignalsto chooseinputsource


Connecting Registers

Multiplexers: lots of control signals but full parallelism of transfers

Busses


Pipelining

Adding registers along a path split combinational logic into multiple cycles each cycle smaller than previously Told Cold > Tnew Cnew

increase throughput


Pipelining

Delay, d, of slowest combinational stage determines performance

Throughput = 1/d – rate at which outputs are produced

Latency = n•d – number of stages * clock period

Pipelining increases circuit utilization

Registers slow down data, synchronize data paths

Wave-pipelining no pipeline registers - waves of data flow through circuit relies on equal-delay circuit paths - no short paths


When and How to Pipeline?

Where is the best place to add registers? splitting combinational logic overhead of registers (propagation delay and setup time

requirements)

What about cycles in data path?

Example: 16-bit adder, add 8-bits in each of two cycles


Retiming

Process of optimally distributing registers throughout a circuit minimize the clock period minimize the number of registers


Retiming (cont’d)

Fast optimal algorithm (Leiserson & Saxe 1983)

Retiming rules: remove one register from each input and add one to each

output remove one register from each output and add one to each

input


Optimal Pipelining

Add registers - use retiming to find optimal location

871310

56

871310

56


Example - Digital Correlator

yt = (xt, a0) + (xt-1, a1) + (xt-2, a2) + (xt-3, a3)

(xt, a0) = 0 if x a, 1 otherwise (and passes x along to the right)

++

+

host

yt

xta0 a1 a2 a3


Example - Digital Correlator (cont’d)

Delays: adder, 7; comparator, 3; host, 0

++

+

host

++

+

host

cycle time = 24

cycle time = 13


+

CLA

+ +

+ +

+

+

CLA

FF at every intersection of pipe state and wire

Pipelined Multipliers

Pipelining can be applied to any of the combinational multipliers


Comparator

Parallel Sorter

Example - Sorting

AB

HL


Example - Sorting (cont’d)

Pipelined


Pipelined Sorter (cont’d)


Better Sorter


Sequential Sorter


Analogy: data flowing through the system in a

rhythmic fashion – from main memory through

a series of processing elements and back to

main memory

Systolic Arrays

Set of identical processing elements specialized or programmable

Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)

SIMD-like

Multiple data flows, converging to engage in computation


- x3 - x2 - x1

- - - y1 - y2 - y3 -w4 w3 w2 w1

y1 = x1w1 + x2w2 + x3w3 + x4w4

y2 = x2w1 + x3w2 + x4w3 + x5w4

y3 = x3w1 + x4w2 + x5w3 + x6w4

. . . .

Example - Convolution

yj = xjw1 + xj+1w2 + . . . + xj+n-1wn


– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

– – – y1 – y2 – y3

w4 w3 w2 w1

x6 – x5 – x4 – x3 – x2 – x1

x6 – x5 – x4 – x3 – x2 – x1

x6 – x5 – x4 – x3 – x2 –

– y1 – y2 – y3

x6 – x5 – x4 – x3 – x2

y1 – y2 – y3

x6 – x5 – x4 – x3 –

– y2 – y3

x6 – x5 – x4 – x3

– – y1 – y2 – y3

Example - Convolution (cont’d)


c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

Example: Matrix Multiplication

C = A B cij = k=1n aikbkj

– – – a14 a13 a12 a11

– – a24 a23 a22 a21 –

– a34 a33 a32 a31 – –

a44 a43 a42 a41 – – –

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

| | | b44

| | b43 b34

| b42 b33 b24

b41 b32 b23 b14

b31 b22 b13 |

b21 b12 | |

b11 | | |

Example: Matrix Multiplication


Systolic Computers

Warp (CMU) - 1987 linear array of 10 or more processing cells optimized inter-cell communication for low-latency pipelined cells and communication conditional execution compiler partitions problem into cells and generates microcode

i-Warp (Intel) - 1990 successor to Warp two-dimensional array time-multiplexing of physical busses between cells 32x32 array has 20Gflops peak performance

cse 567 - autumn 1998 - misc. topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000...

Documents

adder ymultipliers slide

z y x

right slide

technologies slide

square slide

parallel slide

cycle n

forwarded carries slide