optimizing a high performance 32-bit processor for
TRANSCRIPT
![Page 1: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/1.jpg)
© 2004 Altera Corporation
Optimizing a High Performance 32-bit Processor for Programmable Logic
Optimizing a High Performance 32-bit Processor for Programmable Logic
Paul Metzgen16th November 2004
Paul Metzgen16th November 2004
![Page 2: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/2.jpg)
2 © 2004 Altera Confidential ®
Agenda
System Design on FPGAs– Brief Overview of Altera’s SOPC Tools
Architecting Designs for FPGAs– Different Design Trade-offs
Case Study: The Design of Nios II– Implementing Multiplexers in FPGAs– Optimizing Multiplexers in Nios II
![Page 3: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/3.jpg)
© 2004 Altera Corporation
System Design on FPGAsSystem Design on FPGAs
Overview of Altera’s SOPC ToolflowOverview of Altera’s SOPC Toolflow
5
![Page 4: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/4.jpg)
4 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Peripheral SetCan also add your own (eg:– custom peripherals,– accelerators)
![Page 5: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/5.jpg)
5 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Can specify system connectivity
RAM PIO
I-master D-master
![Page 6: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/6.jpg)
6 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Automatic Logic & Bus Generation
![Page 7: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/7.jpg)
7 © 2004 Altera Confidential ®
Altera’s SOPC Builder
Automatic Device Driver Generation
![Page 8: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/8.jpg)
8 © 2004 Altera Confidential ®
Nios II IDE
Terminal Terminal windowwindow
File File Viewer Viewer
WindowWindow
![Page 9: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/9.jpg)
9 © 2004 Altera Confidential ®
SOPC Toolflow: Summary
![Page 10: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/10.jpg)
10 © 2004 Altera Confidential ®
SOPC Toolflow: Summary
![Page 11: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/11.jpg)
11 © 2004 Altera Confidential ®
SOPC Toolflow: Summary
![Page 12: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/12.jpg)
12 © 2004 Altera Confidential ®
Nios II Family of Processors:
Pipeline
Br. Prediction
I$ - Cache
D$ - Cache
Performance
Size (LEs)
Econom
y
Standard
Fast
6-stage 5-stage 5-cycle
Dynamic Static
yes yes no
no
yes no no
7.5x 4.7x 1.0x
1800 1400 700
![Page 13: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/13.jpg)
13 © 2004 Altera Confidential ®
0
50
100
150
200
250
300
$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 $4.50 $5.00
Cost of CPU Logic
Perf
orm
ance
(DM
IPS)
Processor Cost vs. Performance
Stratix
Cyclone
Stratix II
HardCopy® Stratix II
e
s
f
e
s
f
e
s
f
e
s
f
![Page 14: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/14.jpg)
14 © 2004 Altera Confidential ®
Nios II Family of Processors:
Pipeline
Br. Prediction
I$ - Cache
D$ - Cache
Performance
Size (LEs)
Econom
y
Standard
Fast
6-stage 5-stage 5-cycle
Dynamic Static
yes yes no
no
yes no no
7.5x 4.7x 1.0x
1800 1400 700
![Page 15: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/15.jpg)
© 2004 Altera Corporation
Architecting Designs for FPGAsArchitecting Designs for FPGAs
Different Design Trade-offsDifferent Design Trade-offs
10
![Page 16: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/16.jpg)
16 © 2004 Altera Confidential ®
Making the most of the Available Resources
LUTLUT REGREG
Logic ‘Elements’ DSP Blocks
+
Opt
iona
l Pip
elin
ing
Out
put R
egis
ter U
nit
Out
put M
ultip
lexe
r
144 144
36
36
36
36
37
37
38
+ - Σ
+ - Σ
Inpu
t Reg
iste
r Uni
t
Memories
More Bits For Larger Memory Buffering
More Data Ports for Greater Memory Bandwidth
142 GMac/s
x180,000
5.1 Tbyte/s
![Page 17: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/17.jpg)
17 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium
ASIC FPGA
Adders Medium
Multipliers High
Memory High
Multiplexers Low
+
4:1
D$
*
Area Cost
![Page 18: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/18.jpg)
18 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High
Memory High
Multiplexers Low
+
4:1
D$
*
Area Cost
Free Register with every Lookup Table
(independently accessible)
![Page 19: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/19.jpg)
19 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High Medium
Memory High Medium
Multiplexers Low
+
4:1
D$
*‘Hard’ Optimized
ASIC Blocks
Area Cost
Free Register with every Lookup Table
(independently accessible)
![Page 20: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/20.jpg)
20 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High Medium
Memory High Medium
Multiplexers Low High
+
4:1
D$
*‘Hard’ Optimized
ASIC Blocks
Implemented in Lookup Tables
Area Cost
Free Register with every Lookup Table
(independently accessible)
![Page 21: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/21.jpg)
21 © 2004 Altera Confidential ®
Relative Area Costs
Registers Medium Low
ASIC FPGA
Adders Medium Low
Multipliers High Medium
Memory High Medium
Multiplexers Low High
+
4:1
D$
*‘Hard’ Optimized
ASIC Blocks
Implemented in Lookup Tables
Free Register with every Lookup Table
“The Key to Optimizing Designs for an FPGA …is to Optimize the Multiplexers”
![Page 22: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/22.jpg)
© 2004 Altera Corporation
Architecting Designs for FPGAsArchitecting Designs for FPGAs
Barrel-Shifts using MultipliersBarrel-Shifts using Multipliers
![Page 23: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/23.jpg)
23 © 2004 Altera Confidential ®
A Barrel-Shifter Using MultiplexersG HA B C D E F
Z6 Z7Z0 Z1 Z2 Z3 Z4 Z5
N log2N LEs
160 LEsfor a 32-bit Barrel Shifter
![Page 24: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/24.jpg)
24 © 2004 Altera Confidential ®
Barrel Shifter Using Multipliers
G HA B C D E F
Z6 Z7Z0 Z1 Z2 Z3 Z4 Z5
W X Y Z
*
0000000100000
00000Sign W X Y Z
N
N
Multipliers High Medium*
ASIC FPGA
Multiplexers Low High4:1
Area Cost
![Page 25: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/25.jpg)
25 © 2004 Altera Confidential ®00000Sign W X Y Z
W X Y Z
Shifters using Multipliers
*
0000000100000
00000Sign W X Y Z
N
N
Signed?
SHL (N)
![Page 26: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/26.jpg)
26 © 2004 Altera Confidential ®00000Sign W X Y Z
00000Sign W X Y Z
W X Y Z
Shifters using Multipliers
*
0000000100000
00000Sign W X Y Z
N
N
Signed?
ASR (32-N)SHL (N)
![Page 27: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/27.jpg)
27 © 2004 Altera Confidential ®
W X Y Z
Shifters using Multipliers
*
0000000100000
0000000000000 W X Y Z
N
N
Unsigned
00000Sign W X Y Z
0000000000000 W X Y ZSHR (32-N)
SHL (N)
![Page 28: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/28.jpg)
28 © 2004 Altera Confidential ®00000Sign W X Y Z W X Y Z
W X Y Z
Shifters using Multipliers
*
0000000100000
0000000000000 W X Y Z
N
N
Unsigned
ROT (N)
![Page 29: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/29.jpg)
29 © 2004 Altera Confidential ®
ASR (32-N)
W X Y Z
Shifters using Multipliers
0000000100000
N
Signed?
SHR (32-N)
ROT (N)
SHL (N)MULLOW
MULHIGH
*
3:1
![Page 30: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/30.jpg)
© 2004 Altera Corporation
Case Study: The Design of Nios IICase Study: The Design of Nios II
The ALUThe ALU
15
![Page 31: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/31.jpg)
31 © 2004 Altera Confidential ®
ALU
Case Study:The NIOS II Pipeline
I$
2:1
RFa RFb
RFbRFa
Instruction Immediate
External Memory
Read
Alu Result
2:1
![Page 32: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/32.jpg)
32 © 2004 Altera Confidential ®
ALU
The NIOS II Pipeline I$
3:1
RFa RFb
RFbRFa
*
Instruction Immediate
External Memory
Read
Alu Result
2:1
![Page 33: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/33.jpg)
33 © 2004 Altera Confidential ®
ALU
The NIOS II Pipeline I$
3:1
RFa RFb
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
2:1
Alu Result
Multiplier is used forBarrel-Shifts as well
as Multiplication
![Page 34: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/34.jpg)
34 © 2004 Altera Confidential ®
ALU
The NIOS II Pipeline I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
![Page 35: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/35.jpg)
35 © 2004 Altera Confidential ®
The Logic Unit I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
2:14:1
4 LUT
![Page 36: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/36.jpg)
36 © 2004 Altera Confidential ®
The Arithmetic Unit I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
2:1
![Page 37: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/37.jpg)
37 © 2004 Altera Confidential ®
The Comparator Unit I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
>/=
3:1
CMP.op r3, r2, r1IF (r2 op r1)
THEN R3 = 0x00000001ELSE R3 = 0x00000000
Nios II has no explicit Flags
![Page 38: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/38.jpg)
38 © 2004 Altera Confidential ®
Return Address Save I$
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
External Memory
Read
Alu Result
Data Cache Read
2:1
+/-logic
>/=
ReturnAddress
4:1CALLTRAP
INTERUPTBREAK
Return Address is saved in a Link
Register
Instructions that save Return Address
![Page 39: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/39.jpg)
© 2004 Altera Corporation
Case Study: The Design of Nios IICase Study: The Design of Nios II
Increasing the Clock RateIncreasing the Clock Rate
![Page 40: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/40.jpg)
40 © 2004 Altera Confidential ®
The NIOS II Pipeline
+/-
I$
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
2:1
Pipeline to achieve a high Clock Rate
(fmax)
![Page 41: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/41.jpg)
41 © 2004 Altera Confidential ®
Forwarding Logic
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
ADD R2, R1, R0
MUL R4, R3, R2
Fowarding needed to update out-of-date
values in the pipeline
new
![Page 42: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/42.jpg)
© 2004 Altera Corporation
Case Study: The Design of Nios IICase Study: The Design of Nios II
The Cost of MultiplexersThe Cost of Multiplexers
20
![Page 43: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/43.jpg)
43 © 2004 Altera Confidential ®
NIOS II Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
![Page 44: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/44.jpg)
44 © 2004 Altera Confidential ®
What is the Cost of a Multiplexer…?
5:14:12:1 3:1 6:1
Binary (2:1)
Natural Implementation Choice for an ASIC
![Page 45: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/45.jpg)
45 © 2004 Altera Confidential ®
NIOS II Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
4
2
2
1
3
5
544 LEs(17 x 32bits)
![Page 46: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/46.jpg)
46 © 2004 Altera Confidential ®
NIOS II Multiplexers I$
5:1 6:1
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
1
1
544 LEs(17 x 32bits)
+/-logic
>/=78 LEs
Multiplexer Cost is
Dominant
<1
![Page 47: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/47.jpg)
47 © 2004 Altera Confidential ®
Area Usage in 100 Customer Designs
Muxes26%
Arithmetic(+,<,=)11%
Wide-AND11%
Wide-XOR3%
Lonely-Reg18%
Other31%
MuxesArithmetic(+,<,=)Wide-ANDWide-XORLonely-RegOther
Many Designs contain lots of Multiplexers !
![Page 48: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/48.jpg)
© 2004 Altera Corporation
Multiplexers in FPGAMultiplexers in FPGA
Low-Cost MultiplexersLow-Cost Multiplexers
![Page 49: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/49.jpg)
49 © 2004 Altera Confidential ®
Efficient 4:1 Mux on Stratix
C DA B
S1S0
Uses just
2 LEs.
![Page 50: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/50.jpg)
50 © 2004 Altera Confidential ®
Efficient 4:1 Mux on Stratix: How it works
C DA B
C/D0
C DA B
A/B1
1 0
1 0
0 1
0 1
1 0
0 1
0 1
1 0
![Page 51: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/51.jpg)
51 © 2004 Altera Confidential ®
The Improved Cost of Binary Multiplexers
5:14:12:1 3:1 6:1
Binary (4:1)
Selector
4:1
4:1 4:1
4:1
![Page 52: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/52.jpg)
52 © 2004 Altera Confidential ®
The Improved Cost of Binary Multiplexers
5:14:12:1 3:1 6:1
Binary (4:1)
4:1
4:1 4:1
4:1
Selector
1 2 3 43
1 2 3 42
![Page 53: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/53.jpg)
53 © 2004 Altera Confidential ®
Efficient Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
4
448 LEs(14 x 32bits)
544 LEs(17 x 32bits)
-18%
![Page 54: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/54.jpg)
© 2004 Altera Corporation
Multiplexers in FPGAMultiplexers in FPGA
Registered MultiplexersRegistered Multiplexers
25
![Page 55: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/55.jpg)
55 © 2004 Altera Confidential ®
Efficient Multiplexers
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
4 3
416 LEs(13 x 32bits)
544 LEs(17 x 32bits)
-24%
Multiplexer costs can be reducedusing a register!
![Page 56: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/56.jpg)
56 © 2004 Altera Confidential ®
The Stratix LE
![Page 57: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/57.jpg)
57 © 2004 Altera Confidential ®
The Stratix LE
enable
sload sclear
Additional Lab-wide signals(shared between 8 LEs)
![Page 58: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/58.jpg)
58 © 2004 Altera Confidential ®
2:1 Mux in 1 LE
d0 d1sel
![Page 59: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/59.jpg)
59 © 2004 Altera Confidential ®
3:1 Mux in 1 LE
d0 d1 d2
Sync-loadRegister Needed(for sload)
sel
![Page 60: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/60.jpg)
60 © 2004 Altera Confidential ®
4:1 Mux in 1 LE
d0 d1 d2
sload
Register Needed(for sload / sclear)
sel
0sclear
0
![Page 61: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/61.jpg)
61 © 2004 Altera Confidential ®
The Cost of Multiplexers
1 2 3 42
5:14:12:1 3:1 6:1
1 1 3 31-2
5:14:12:1 3:1 6:1
Asynchronous
Registered
![Page 62: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/62.jpg)
62 © 2004 Altera Confidential ®
The Most Cost Effective Multiplexers
Asynchronous
1 2 3 42
5:14:12:1 3:1
Registered
1 1 31-2
5:14:12:1 3:1 6:1
3
6:1
![Page 63: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/63.jpg)
63 © 2004 Altera Confidential ®
The Most Cost Effective Multiplexers
Asynchronous
1 2 3 42
5:14:12:1 3:1
Registered
1 1 31-2
5:14:12:1 3:1 6:1
6:1
3
![Page 64: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/64.jpg)
64 © 2004 Altera Confidential ®
Recap:
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
4 3
416 LEs(13 x 32bits)
544 LEs(17 x 32bits)
-24%
Multiplexer costs were reduced using
a register!
![Page 65: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/65.jpg)
© 2004 Altera Corporation
Optimizing Multiplexers in Nios IIOptimizing Multiplexers in Nios II
Restructuring TechniquesRestructuring Techniques
30
![Page 66: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/66.jpg)
66 © 2004 Altera Confidential ®
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
3
2
2
1
2
32:1 3:1
1 1
Registered
Underutilized Muxes
Can extend 2:1 to be a 3:1 at no
extra cost!
![Page 67: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/67.jpg)
67 © 2004 Altera Confidential ®
Input Balancing:
+/-
I$
5:1 6:1
logic
2:1
3:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
2
1
2:1 3:1
1 1
Registered
1 2
Async
2:1 3:1
![Page 68: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/68.jpg)
68 © 2004 Altera Confidential ®
NIOS II Multiplexers
+/-
I$
5:1 6:1
logic
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
1 2
1
384 LEs(12 x 32bits)
416 LEs(13 x 32bits)
-8%
![Page 69: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/69.jpg)
69 © 2004 Altera Confidential ®
Related Inputs:
+/-
I$
5:1 6:1
logic
3:1
2:1
RFa RFb
D$
RFbRFa
3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
352 LEs(11 x 32bits)
*1 2
2:1
2:1
* * 5-LUT
4-LUT
![Page 70: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/70.jpg)
70 © 2004 Altera Confidential ®
Design Trade-offs
+/-
I$
5:1 6:1
logic
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
4:1
CALLTRAPINTR
BREAK
3333
cycles
No need to Forward Return Address Early
3:1
![Page 71: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/71.jpg)
71 © 2004 Altera Confidential ®
Design Trade-offs
+/-
I$
5:1 6:1
logic
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
3:1
CALLTRAPINTR
BREAK
3333
cycles
No need to Forward Return Address Early
![Page 72: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/72.jpg)
72 © 2004 Altera Confidential ®
Forwarding Zero… I$
5:1 6:1
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
3:1
Can use Synchronous Reset instead of
multiplexer input.
CMP.op r3, r2, r1
+/-logic
>/=
IF (r2 op r1) THEN R3 = 0x00000001ELSE R3 = 0x00000000
Mostly 0’s
![Page 73: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/73.jpg)
73 © 2004 Altera Confidential ®
Forwarding Zero…
+/-
I$
logic
3:1
2:1
RFa RFb
D$
RFbRFa
*3:1
>/=
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
2:1
5:1 6:1
2 1
Can use Synchronous Reset instead of
multiplexer input.
CMP.op r3, r2, r1IF (r2 op r1)
THEN R3 = 0x00000001ELSE R3 = 0x00000000
Mostly 0’s
![Page 74: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/74.jpg)
© 2004 Altera Corporation
Optimizing Multiplexers in FPGAOptimizing Multiplexers in FPGA
SummarySummary
35
![Page 75: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/75.jpg)
75 © 2004 Altera Confidential ®
Summary: Restructure to 4:1 or 3:1(reg)
Asynchronous
1 2 3 42
5:14:12:1 3:1
Registered
1 1 31-2
5:14:12:1 3:1 6:1
6:1
3
Optimal Multiplexer Densities
![Page 76: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/76.jpg)
76 © 2004 Altera Confidential ®
Summary
3:1
2:1
3:1
Instruction Immediate
ReturnAddress
External Memory
Read
Alu Result
Data Cache Read
2:1
5:1 6:1
+/-
I$
logic
RFa RFb
D$
RFbRFa
*
>/=
320 LEs(10 x 32bits)
544 LEs(17 x 32bits)
- 42%
![Page 77: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/77.jpg)
77 © 2004 Altera Confidential ®
Techniques Extend to Real Designs…
D 13,472
Size
67 MHz
SpeedOriginal
-60% unchng
Size SpeedOptimized
A 2,400 40 MHz -50% 2.5x
B 7,373 77 MHz -77% 2.0x
E 1,925 75 MHz -27% unchng
Others … … … …
C 13,500 50 MHz 1c12 fit 1.5x
![Page 78: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/78.jpg)
© 2004 Altera Corporation
Optimizing Multiplexers in FPGAOptimizing Multiplexers in FPGA
Support in Quartus SynthesisSupport in Quartus Synthesis
![Page 79: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/79.jpg)
79 © 2004 Altera Confidential ®
New Multiplexer Report:
(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)
![Page 80: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/80.jpg)
80 © 2004 Altera Confidential ®
New Multiplexer Report:
(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)
– Number of Unique (or Constant) Inputs– Number of busses with identical structure
![Page 81: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/81.jpg)
81 © 2004 Altera Confidential ®
New Multiplexer Report:
(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)
– Estimate of Area Inefficiency
![Page 82: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/82.jpg)
82 © 2004 Altera Confidential ®
New Synthesis Option:
![Page 83: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/83.jpg)
83 © 2004 Altera Confidential ®
Results: (Stratix I: Logic Reduction)Stratix I QOR Set, LEs Post Synthesis
-10%
-5%
0%
5%
10%
15%
20%
25%se
ibus
_sw
itch
topl
evel
netw
orki
nter
face
mas
terfp
gaal
t_ra
pidi
o2fu
jitsu
crc3
2x32
bfyx
_top
quat
trofa
ust
cht
unpa
cker
_top
tdm
_phy
_top
tsi_
top
hda_
top
band
_fil
fldp
oops
corr
_409
6m
bcid
_top
msb
_asi
crm
on_c
hip
yang
tze
aqui
la_c
ore
sraa
tcp_
fpga
2al
t_bd
ti80
noki
a_fil
ter
me1
_cor
rect
edac
s_ge
nera
tor
oc_d
es_p
erf_
opt
siriu
sch
ip_f
icon
_40
coeu
r_op
logi
c_co
rede
m_c
ode
mbc
b
Design
%ag
e R
educ
tion
Mean = 4.2% (geo)
(preliminary)
Over 20% Area Reduction in Benchmark Set!
![Page 84: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/84.jpg)
© 2004 Altera Corporation
SummarySummary
40
![Page 85: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/85.jpg)
85 © 2004 Altera Confidential ®
SummarySystem Design on FPGAs– Low cost easy-to-use tools with Time-to-Market advantage
Architecting Designs for FPGAs– Multiplexer Costs can dominate in FPGAs
• 25% of the area on average• Significant in Processor / Busses
– FPGA Multiplexer Costs do not scale linearly• best to map to 4:1 or 3:1(reg)• Registers can reduce multiplexer costs!
– The Cheapest Multiplexers are those not implemented in Logic!• Eg: By using a multiplier
Synthesis Tools assist in Optimization Process– But the Designer still has a huge influence on QoR
3:14:1
![Page 86: Optimizing a High Performance 32-bit Processor for](https://reader030.vdocuments.us/reader030/viewer/2022040803/624d46e075f4e14e3b0954a7/html5/thumbnails/86.jpg)
© 2004 Altera Corporation
The End.The End.
Questions?Questions?