June 5, 2006 1
Architectural Exploration: 802.11a Transmitter
Arvind, Nirav Dave, Steve Gerding, Mike PellauerComputer Science & Artificial Intelligence LaboratoryMassachusetts Institute of Technology
MIT-Nokia Architecture Group Helsinki, June 5, 2006
2
Why architectural exploration
Architects are clever people and can think of a variety of designsBut often cannot determine which design is best for a given metric (e.g., power) Too short of time and manpower to go far
enough with several designs for proper evaluation
Guess work instead of architectural exploration
New design tools can change all that
3
This talk
Architectural exploration of 802.11a transmitter The goal is to show that it is easy and
economical to do so in Bluespec You don’t have to know 802.11a or Bluespec
to understand the talk
4
802.11a Transmitter Overview
Controller Scrambler Encoder
Interleaver Mapper
IFFTCyclicExtend
headers
data
IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)
complex numbers accounts for > 95% area
24 Uncoded
bits
One OFDM symbol (64 Complex Numbers)
Must produce one OFDM symbol every 4 sec
Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol
5
Combinational IFFT
in0
…
in1
in2
in63
in3
in4
Radix 4
Radix 4
Radix 4
x16
Radix 4
Radix 4
Radix 4
…
Radix 4
Radix 4
Radix 4
…
out0
…
out1
out2
out63
out3
out4
Perm
ute
_1
Perm
ute
_2
Perm
ute
_3
All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...
*
*
*
*
+
-
-
+
+
-
-
+
*jt2
t0
t3
t1
6
Design Tradeoffs
1. We can decrease the area by multiplexing some circuits
It may be a win if the throughput requirements can be met without increasing the frequency
2. Power can be lowered by lowering the frequency, which can be adjusted by changing the voltage
power (voltage)2
7
Combinational IFFTOpportunity for reuse
in0
…
in1
in2
in63
in3
in4
Radix 4
Radix 4
Radix 4
x16
Radix 4
Radix 4
Radix 4
…
Radix 4
Radix 4
Radix 4
…
out0
…
out1
out2
out63
out3
out4
Perm
ute
_1
Perm
ute
_2
Perm
ute
_3
Reuse the same circuit three times
8
Circular pipeline: Reusing the Pipeline Stagein0
…
in1
in2
in63
in3
in4
out0
…
out1
out2
out63
out3
out4
…
Radix 4
Radix 4
Perm
ute
_1Perm
ute
_2Perm
ute
_3
Stage Counter16 Radix 4s can be
shared but not the three permutations. Hence the need for muxes
64
, 4-w
ay
Muxes
9
Superfolded circular pipeline: Just one Radix-4 node!in0
…
in1
in2
in63
in3
in4
out0
…
out1
out2
out63
out3
out4
Radix 4
Perm
ute
_1Perm
ute
_2Perm
ute
_3
Stage Counter 0 to 2
Index Counter 0 to 15
64
, 4-w
ay
Muxes
4, 1
6-w
ay
Muxes
4, 1
6-w
ay
DeM
uxes
Designs with 2, 4, and 8 Radix-4 modules make sense too!
10
Which design consumes the least energy to transmit a symbol?
Can we quickly code up all the alternatives? single source with parameters?
Not practical in traditional hardware description languages like Verilog/VHDL
June 5, 2006 11
Expressing the designs in Bluespec
12
Bluespec code: Radix-4 Nodefunction Vector#(4,Complex) radix4(Vector#(4,Complex) t, Vector#(4,Complex) k);
Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector();
m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3];
y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]);
z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3];
return(z);endfunction
Polymorphic code: works on any type of numbers for which *, + and - have been defined
*
*
*
*
+
-
-
+
+
-
-
+
*j
13
Combinational IFFTCan be used as a reference
in0
…
in1
in2
in63
in3
in4
Radix 4
Radix 4
Radix 4
x16
Radix 4
Radix 4
Radix 4
…
Radix 4
Radix 4
Radix 4
…
out0
…
out1
out2
out63
out3
out4
Perm
ute
_1
Perm
ute
_2
Perm
ute
_3
stage_f function
repeat it three times
14
Bluespec Code for Combinational IFFT
function SVector#(64, Complex) stage_f(Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = radix4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx + 1] = y[1]; stage_temp[idx + 2] = y[2]; stage_temp[idx + 3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; endreturn(stage_out);
function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data);//Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]);return(stage_data[3]);
Stage function
The code is unfolded to generate a combinational circuit
15
Synchronous pipeline
rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule
xsReg1inQ
f1 f2 f3
sReg2 outQ
This is real IFFT code; just replace f1, f2 and f3 with stage_f code
16
Folded pipeline
x
sReginQ
rule folded-pipeline (True); if (stage==1) begin inQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1;endrule
f
outQstage
f1
f2
f3
function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcaseendfunction
This is real IFFT code too ...
17
Expressing these designs in Bluespec is easy
All these designs were done in less than one day!Area and power estimates?
Combinational
Pipelined
Folded (16 Radices)
Super-Folded (8 Radices)
Super-Folded (4 Radices)
Super-Folded (2 Radices)
Super-Folded (1 Radix)How long will it take to write these designs in Verilog? VHDL? SystemC?
18
Bluespec Tool flowBluespec SystemVerilog source
Verilog 95 RTL
Verilog sim
VCD output
DebussyVisualization
Bluespec Compiler
RTL synthesis
gates
C
Bluespec C sim CycleAccurate
FPGAPower estimatio
n tool
Power estimatio
n tool
Sequence Design PowerTheater
19
802.11a Transmitter Synthesis results for various IFFT designs
IFFT Design Area (mm2)
Min. CLK Period(ns)
Latency(clks/Sym)
ns/output(req 4000)
Combinational 15.15 33.0 10 132
Pipelined 15.50 12.2 12 49
Folded(16 Radices)
6.26 13.0 12 52
Super-Folded(8 Radices)
4.02 13.1 15 79
SF (4 Radices) 2.86 13.1 21 157
SF (2 Radices) 2.33 13.2 33 317
SF (1 Radix) 2.00 13.2 48 634
TSMC .18 micron; numbers reported are before place and route.Some areas will be larger after layout.
20
Algorithmic Improvements
in0
…
in1
in2
in63
in3
in4
Radix 4
Radix 4
Radix 4
x16
Radix 4
Radix 4
Radix 4
…
Radix 4
Radix 4
Radix 4
…
out0
…
out1
out2
out63
out3
out4
Perm
ute
_1
Perm
ute
_2
Perm
ute
_3
1. All the three permutations can be made identical more saving in area
2. One multiplication can be removed from Radix-4
21
802.11a Transmitter Synthesis results: old vs. new IFFT designs
IFFT Design Old Area (mm2)
New Area (mm2)
Combinational 15.15 5.91
Pipelined 15.50 6.26
Folded(16 Radices)
6.26 4.61
Super-Folded(8 Radices)
4.02 3.57
SF(4 Radices) 2.86 2.75
SF(2 Radices) 2.33 2.21
SF (1 Radix) 2.00 1.67
TSMC .18 micron; numbers reported are before place and route.
???
exp
ecte
d
22
802.11a Transmitter Synthesis results with new IFFT designs
IFFT Design Area (mm2)
Min. CLK Period(ns)
Latency(clks/
Symbol)
Min. ns/output
PermittedClock scaling
Combinational 5.91 33.0 10 132 30
Pipelined 6.26 12.0 12 49 83
Folded(16 Radices)
4.61 13.0 12 52 77
Super-Folded(8 Radices)
3.57 13.1 15 79 51
SF(4 Radices) 2.75 13.1 21 157 25
SF(2 Radices) 2.21 13.1 33 314 13
SF (1 Radix) 1.67 13.1 57 629 6
TSMC .18 micron; numbers reported are before place and route.
23
802.11a Transmitter with new IFFT designs: Power EstimatesIFFT Design
c1Area
(mm2) c2
Min Freq.
c3
Power(mW) @ 100MHz
c4
Power(mW) @ Min Freq.
c5
Energy/Symb(nJ)
c6
Combinational 5.91 1 MHz 398.6 0.399 1.594
Pipeline (48 R-4) 6.26 1 MHz 438.6 0.439 1.754
Folded (16 R-4) 4.61 1 MHz 475.6 0.476 1.902
SF (8 R-4) 3.57 1.5MHz 299.7 0.446 1.798
SF (4 R-4) 2.75 3MHz 166.2 0.499 1.994
SF (2 R-4) 2.21 6MHz 98.7 0.592 2.369
SF (1 R-4) 1.67 12MHz 66.2 0.794 3.178
c3 = min clock x scaling factor; c4 is raw data collected by the Sequence Design PowerTheater c5 = c4xc3/100MHz/voltage scaling(=10); c6 = c5x4 sec
Work in progress
24
SummaryIt is essential to do architectural exploration for better (area, power, performance, ...) designs.It is possible to do so with new design tools and methodologies.Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration.
Thanks