joseph schneider february 23, 2010 1. fused multiply-add (fma) is a unit designed to perform (a x...
TRANSCRIPT
![Page 1: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/1.jpg)
1
BRIDGE FLOATING-POINT FUSED MULTIPLY-ADD DESIGN
BY ERIC QUINNELL, EARL E SWARTZLANDER JR., AND CARL LEMONDS
Joseph SchneiderFebruary 23, 2010
![Page 2: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/2.jpg)
2
FUSED MULTIPLY-ADD
Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction
Faster, more precise than using two consecutive instructions with standard multiplier and adder
Can perform standard addition and multiplication with appropriate constants
![Page 3: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/3.jpg)
3
FUSED MULTIPLY-ADD
Performing standard addition and multiplication suffers greater latencies than when using a standard adder or multiplier
When using an FMA instead, can’t perform addition and multiplication in parallel
![Page 4: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/4.jpg)
4
FUSED MULTIPLY-ADD
Goal: To design architecture between FADD and FMUL units.
Reuse components to minimize area and power consumption
Allow both standard operations and the FMA functionality
![Page 5: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/5.jpg)
5
FORMAT
Floating-point units all assume double-precision (64-bit) IEEE-754 standard format
![Page 6: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/6.jpg)
6
BASIS OF COMPARISON
Compare adder standalone, multiplier standalone, FMA standalone, and the FMA bridge
Compared on basis of latency, area, and power
![Page 7: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/7.jpg)
7
FMA ARCHITECTURE
(A x B) + C A and B multiplied while C is aligned
based on exponent difference Carry-save adder implemented Result is rounded- only once as
opposed to two roundings necessary for performing the equation in two operations
![Page 8: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/8.jpg)
8
![Page 9: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/9.jpg)
9
BRIDGE FMA
Follows same architecture of FMA, only reusing parts from FADD and FMUL as appropriate
From FMUL, uses multiplier array. From FADD, uses rounding unit. In this method, FADD and FMUL can be used
individually or in parallel, while the FMA is used only when needed.
Clock-gating used to ensure bridge is only powered when needed
![Page 10: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/10.jpg)
10
![Page 11: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/11.jpg)
11
FMUL
Same as a standard unit, only with additional outputs from multiplier array leading to FMA
Round element shut down via clock-gating when performing an FMA operation
![Page 12: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/12.jpg)
12
![Page 13: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/13.jpg)
13
FADD
Uses Farmwald dual-path FADD design; Two paths available based on exponent difference of inputs
Multiplexer used to select between paths for rounding unit now include option for FMA input
In this manner, FMA uses FADD’s rounding unit
![Page 14: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/14.jpg)
14
![Page 15: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/15.jpg)
15
![Page 16: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/16.jpg)
16
BRIDGE FMA
End result, Bridge FMA hardware is essentially the original FMA hardware, only without the multiplier array and rounding unit.
![Page 17: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/17.jpg)
17
![Page 18: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/18.jpg)
18
RESULTS
FMUL, FADD, FMA, and Bridge FMA all implemented in Verilog
Uses AMD 65-nm silicon-on-insulator design set
![Page 19: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/19.jpg)
19
RESULTS
Bridge architecture 30%-70% faster than FMA architecture when performing FADD or FMUL instructions with significant savings in power consumption
Also allows for an FADD and FMUL instruction in parallel, further improving speed
12% performance gain when executing FMA instruction over consecutive operations on individual FADD and FMUL.
![Page 20: Joseph Schneider February 23, 2010 1. Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction Faster, more precise](https://reader035.vdocuments.us/reader035/viewer/2022062518/5697bf731a28abf838c7f320/html5/thumbnails/20.jpg)
20
RESULTS
Takes 40% more area to include Bridge FMA with FADD and FMUL Unit
60% increase in power for FMA instruction over consecutive FADD and FMUL instructions in worst case conditions
Increased latency and power over standalone FMA unit