cse 567 - autumn 1998 - misc. topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000...
TRANSCRIPT
![Page 1: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/1.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 1
multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5
1 1 0 00 0 0 0
1 1 0 00 0 0 00 1 1 1 1 0 0 60
4 partial products
compute partial product; shift; add
repeat n times:
note: each bit of partial products is just an AND operation
Multiplication
Example
![Page 2: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/2.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 2
adder
result
0
one bit of multiplier applied each cycle
2n bit adder
multiplicand
multiplierx
z
y
z = 0;
repeat n
if (x[0]) z = z + y;
x = x >> 1; y = y << 1;
Sequential Multiplier
![Page 3: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/3.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 3
adder
result
one bit of multiplier applied each cycle
n-bit adder
multiplicand
multiplierx
z
y
z = 0;
repeat n
if (x[0]) z = z + y * 2n;
x = x >> 1; z = z >> 1;
Sequential Multiplier (cont’d)
![Page 4: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/4.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 4
Parallelism in hardware
Fine-grained - bit level e.g., carry-select, carry-lookahead adder
Pipelining same number of functional units different latency, but increased throughput less work per clock cycle
Coarse-grained - data-path level e.g., multiple arithmetic units multi-port register files (read/write from different
sources/destinations)
Processor level difficult to take advantage of many levels of parallelism
in fixed general-purpose processors much easier when the processors are special-purpose,
e.g., systolic computations
![Page 5: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/5.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 5
Bit level parallelism
Exploit ability to do necessary bit-level computations directly exploit redundant logic goal - keep all circuits busy, reduce critical path
Examples carry-lookahead adder carry-select adder multipliers
![Page 6: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/6.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 6
LSB
LSB1 1 0
1
1
0
1 1 0
1 1 0
0 0 0
01001
multplier
multiplicand
Combinational Multipliers
Use AND gates to generate all partial products in parallel
![Page 7: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/7.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 7
LSB
LSB1 1 0
1
1
0
1 1 0
1 1 0
0 0 0
0
1
001
Combinational Multipliers (cont'd)
Skew array to send partial products along diagonal and make it square
![Page 8: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/8.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 8
worst-case delay is 3n
LSB
LSB
0 0 00
0
0
A B
CinCout
S
Full Adder
Combinational Multipliers (cont'd)
Ripple-carry adder in each row (carries ripple right to left)
Sums ripple down (shifted one to right)
![Page 9: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/9.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 9
LSB
LSB0 0 0
0 0 0
0
0
CLA
A B Cin
Cout S
Full Adder
no need to optimize carry more than sum
using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)
Using Carry-Save
Forward carries to next row of adders
CLA at the end to add last partial product and forwarded carries
![Page 10: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/10.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 10
x1
x2
x1
x2
x1
x2
x1
x2
x1
x2
partial products
Combinational Multipliers (cont'd)
Carry-save adder is a 3-2 adder:
![Page 11: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/11.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 11
PP1PP2 PP3PP4PP5
+
CLA
Result
PP6PP7PP8
+ +
+ +
PP0
+
+
Wallace Tree Multiplier
Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)
Difficult structure to layoutand integrate with partial product crossbar
Wiring constraints make it unattractive in many technologies
![Page 12: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/12.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 12
Binary Tree Multipliers
Problem with Wallace tree is 3:2 column reduction need 2:1 reduction for binary tree
One solution: signed-digit binary trees represent digits as 0, 1, -1 similar to Booth's encoding
1+ -1
0 0
1+ 1
1 0
0+ 0
0 0
-1+ -1-1 0
1+ 0
1 -10 1
-1+ 0
0 -1-1 1
xyif x>=0 and y>=0otherwise
xyif x>=0 and y>=0otherwise
![Page 13: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/13.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 13
0 0 1 1 0 1 13
1 1 1 0 1 0 –6
0 0 –1 1 –1 0
0 –1 –2
1 1 1 1 1 1 1 0 0 1 1 0
1 1 1 1 1 1 0 0 1 1
0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 1 0 0 1 0 –78
Boothrecodingsteps
must be able to add multiplier times 0, –1, –2, 1, and 2
i+1 i i-1 add
0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M
Boothrecodingtable
Booth's Algorithm
Take care of (retire) more than one bit per shift operation
Example: shift two bits at a time
![Page 14: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/14.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 14
Register Transfer
Registers have input and output output can be fanned out to many destinations input can come from many sources
multiplexer needed on input to select which
input
output
input
outputs to other registers
inputs from other registers
output
controlsignalsto chooseinputsource
![Page 15: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/15.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 15
Connecting Registers
Multiplexers: lots of control signals but full parallelism of transfers
Busses
![Page 16: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/16.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 16
Pipelining
Adding registers along a path split combinational logic into multiple cycles each cycle smaller than previously Told Cold > Tnew Cnew
increase throughput
![Page 17: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/17.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 17
Pipelining
Delay, d, of slowest combinational stage determines performance
Throughput = 1/d – rate at which outputs are produced
Latency = n•d – number of stages * clock period
Pipelining increases circuit utilization
Registers slow down data, synchronize data paths
Wave-pipelining no pipeline registers - waves of data flow through circuit relies on equal-delay circuit paths - no short paths
![Page 18: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/18.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 18
When and How to Pipeline?
Where is the best place to add registers? splitting combinational logic overhead of registers (propagation delay and setup time
requirements)
What about cycles in data path?
Example: 16-bit adder, add 8-bits in each of two cycles
![Page 19: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/19.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 19
Retiming
Process of optimally distributing registers throughout a circuit minimize the clock period minimize the number of registers
![Page 20: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/20.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 20
Retiming (cont’d)
Fast optimal algorithm (Leiserson & Saxe 1983)
Retiming rules: remove one register from each input and add one to each
output remove one register from each output and add one to each
input
![Page 21: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/21.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 21
Optimal Pipelining
Add registers - use retiming to find optimal location
871310
56
871310
56
![Page 22: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/22.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 22
Example - Digital Correlator
yt = (xt, a0) + (xt-1, a1) + (xt-2, a2) + (xt-3, a3)
(xt, a0) = 0 if x a, 1 otherwise (and passes x along to the right)
++
+
host
yt
xta0 a1 a2 a3
![Page 23: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/23.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 23
Example - Digital Correlator (cont’d)
Delays: adder, 7; comparator, 3; host, 0
++
+
host
++
+
host
cycle time = 24
cycle time = 13
![Page 24: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/24.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 24
+
CLA
+ +
+ +
+
+
CLA
FF at every intersection of pipe state and wire
Pipelined Multipliers
Pipelining can be applied to any of the combinational multipliers
![Page 25: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/25.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 25
Comparator
Parallel Sorter
Example - Sorting
AB
HL
![Page 26: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/26.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 26
Example - Sorting (cont’d)
Pipelined
![Page 27: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/27.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 27
Pipelined Sorter (cont’d)
![Page 28: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/28.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 28
Better Sorter
![Page 29: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/29.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 29
Sequential Sorter
![Page 30: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/30.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 30
Analogy: data flowing through the system in a
rhythmic fashion – from main memory through
a series of processing elements and back to
main memory
Systolic Arrays
Set of identical processing elements specialized or programmable
Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)
SIMD-like
Multiple data flows, converging to engage in computation
![Page 31: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/31.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 31
- x3 - x2 - x1
- - - y1 - y2 - y3 -w4 w3 w2 w1
y1 = x1w1 + x2w2 + x3w3 + x4w4
y2 = x2w1 + x3w2 + x4w3 + x5w4
y3 = x3w1 + x4w2 + x5w3 + x6w4
. . . .
Example - Convolution
yj = xjw1 + xj+1w2 + . . . + xj+n-1wn
![Page 32: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/32.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 32
– – – y1 – y2 – y3
x6 – x5 – x4 – x3 – x2 – x1
– – – y1 – y2 – y3
x6 – x5 – x4 – x3 – x2 – x1
– – – y1 – y2 – y3
– – – y1 – y2 – y3
w4 w3 w2 w1
x6 – x5 – x4 – x3 – x2 – x1
x6 – x5 – x4 – x3 – x2 – x1
x6 – x5 – x4 – x3 – x2 –
– y1 – y2 – y3
x6 – x5 – x4 – x3 – x2
y1 – y2 – y3
x6 – x5 – x4 – x3 –
– y2 – y3
x6 – x5 – x4 – x3
– – y1 – y2 – y3
Example - Convolution (cont’d)
![Page 33: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/33.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 33
c11 c12 c13 c14
c21 c22 c23 c24
c31 c32 c33 c34
c41 c42 c43 c44
Example: Matrix Multiplication
C = A B cij = k=1n aikbkj
![Page 34: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/34.jpg)
– – – a14 a13 a12 a11
– – a24 a23 a22 a21 –
– a34 a33 a32 a31 – –
a44 a43 a42 a41 – – –
c11 c12 c13 c14
c21 c22 c23 c24
c31 c32 c33 c34
c41 c42 c43 c44
| | | b44
| | b43 b34
| b42 b33 b24
b41 b32 b23 b14
b31 b22 b13 |
b21 b12 | |
b11 | | |
Example: Matrix Multiplication
![Page 35: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;](https://reader035.vdocuments.us/reader035/viewer/2022070323/56649da15503460f94a8c864/html5/thumbnails/35.jpg)
CSE 567 - Autumn 1998 - Misc. Topics - 35
Systolic Computers
Warp (CMU) - 1987 linear array of 10 or more processing cells optimized inter-cell communication for low-latency pipelined cells and communication conditional execution compiler partitions problem into cells and generates microcode
i-Warp (Intel) - 1990 successor to Warp two-dimensional array time-multiplexing of physical busses between cells 32x32 array has 20Gflops peak performance