optimizing a high performance 32-bit processor for

86
© 2004 Altera Corporation Optimizing a High Performance 32-bit Processor for Programmable Logic Optimizing a High Performance 32-bit Processor for Programmable Logic Paul Metzgen 16 th November 2004 Paul Metzgen 16 th November 2004

Upload: others

Post on 06-Apr-2022

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Optimizing a High Performance 32-bit Processor for Programmable Logic

Optimizing a High Performance 32-bit Processor for Programmable Logic

Paul Metzgen16th November 2004

Paul Metzgen16th November 2004

Page 2: Optimizing a High Performance 32-bit Processor for

2 © 2004 Altera Confidential ®

Agenda

System Design on FPGAs– Brief Overview of Altera’s SOPC Tools

Architecting Designs for FPGAs– Different Design Trade-offs

Case Study: The Design of Nios II– Implementing Multiplexers in FPGAs– Optimizing Multiplexers in Nios II

Page 3: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

System Design on FPGAsSystem Design on FPGAs

Overview of Altera’s SOPC ToolflowOverview of Altera’s SOPC Toolflow

5

Page 4: Optimizing a High Performance 32-bit Processor for

4 © 2004 Altera Confidential ®

Altera’s SOPC Builder

Peripheral SetCan also add your own (eg:– custom peripherals,– accelerators)

Page 5: Optimizing a High Performance 32-bit Processor for

5 © 2004 Altera Confidential ®

Altera’s SOPC Builder

Can specify system connectivity

RAM PIO

I-master D-master

Page 6: Optimizing a High Performance 32-bit Processor for

6 © 2004 Altera Confidential ®

Altera’s SOPC Builder

Automatic Logic & Bus Generation

Page 7: Optimizing a High Performance 32-bit Processor for

7 © 2004 Altera Confidential ®

Altera’s SOPC Builder

Automatic Device Driver Generation

Page 8: Optimizing a High Performance 32-bit Processor for

8 © 2004 Altera Confidential ®

Nios II IDE

Terminal Terminal windowwindow

File File Viewer Viewer

WindowWindow

Page 9: Optimizing a High Performance 32-bit Processor for

9 © 2004 Altera Confidential ®

SOPC Toolflow: Summary

Page 10: Optimizing a High Performance 32-bit Processor for

10 © 2004 Altera Confidential ®

SOPC Toolflow: Summary

Page 11: Optimizing a High Performance 32-bit Processor for

11 © 2004 Altera Confidential ®

SOPC Toolflow: Summary

Page 12: Optimizing a High Performance 32-bit Processor for

12 © 2004 Altera Confidential ®

Nios II Family of Processors:

Pipeline

Br. Prediction

I$ - Cache

D$ - Cache

Performance

Size (LEs)

Econom

y

Standard

Fast

6-stage 5-stage 5-cycle

Dynamic Static

yes yes no

no

yes no no

7.5x 4.7x 1.0x

1800 1400 700

Page 13: Optimizing a High Performance 32-bit Processor for

13 © 2004 Altera Confidential ®

0

50

100

150

200

250

300

$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 $4.50 $5.00

Cost of CPU Logic

Perf

orm

ance

(DM

IPS)

Processor Cost vs. Performance

Stratix

Cyclone

Stratix II

HardCopy® Stratix II

e

s

f

e

s

f

e

s

f

e

s

f

Page 14: Optimizing a High Performance 32-bit Processor for

14 © 2004 Altera Confidential ®

Nios II Family of Processors:

Pipeline

Br. Prediction

I$ - Cache

D$ - Cache

Performance

Size (LEs)

Econom

y

Standard

Fast

6-stage 5-stage 5-cycle

Dynamic Static

yes yes no

no

yes no no

7.5x 4.7x 1.0x

1800 1400 700

Page 15: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Architecting Designs for FPGAsArchitecting Designs for FPGAs

Different Design Trade-offsDifferent Design Trade-offs

10

Page 16: Optimizing a High Performance 32-bit Processor for

16 © 2004 Altera Confidential ®

Making the most of the Available Resources

LUTLUT REGREG

Logic ‘Elements’ DSP Blocks

+

Opt

iona

l Pip

elin

ing

Out

put R

egis

ter U

nit

Out

put M

ultip

lexe

r

144 144

36

36

36

36

37

37

38

+ - Σ

+ - Σ

Inpu

t Reg

iste

r Uni

t

Memories

More Bits For Larger Memory Buffering

More Data Ports for Greater Memory Bandwidth

142 GMac/s

x180,000

5.1 Tbyte/s

Page 17: Optimizing a High Performance 32-bit Processor for

17 © 2004 Altera Confidential ®

Relative Area Costs

Registers Medium

ASIC FPGA

Adders Medium

Multipliers High

Memory High

Multiplexers Low

+

4:1

D$

*

Area Cost

Page 18: Optimizing a High Performance 32-bit Processor for

18 © 2004 Altera Confidential ®

Relative Area Costs

Registers Medium Low

ASIC FPGA

Adders Medium Low

Multipliers High

Memory High

Multiplexers Low

+

4:1

D$

*

Area Cost

Free Register with every Lookup Table

(independently accessible)

Page 19: Optimizing a High Performance 32-bit Processor for

19 © 2004 Altera Confidential ®

Relative Area Costs

Registers Medium Low

ASIC FPGA

Adders Medium Low

Multipliers High Medium

Memory High Medium

Multiplexers Low

+

4:1

D$

*‘Hard’ Optimized

ASIC Blocks

Area Cost

Free Register with every Lookup Table

(independently accessible)

Page 20: Optimizing a High Performance 32-bit Processor for

20 © 2004 Altera Confidential ®

Relative Area Costs

Registers Medium Low

ASIC FPGA

Adders Medium Low

Multipliers High Medium

Memory High Medium

Multiplexers Low High

+

4:1

D$

*‘Hard’ Optimized

ASIC Blocks

Implemented in Lookup Tables

Area Cost

Free Register with every Lookup Table

(independently accessible)

Page 21: Optimizing a High Performance 32-bit Processor for

21 © 2004 Altera Confidential ®

Relative Area Costs

Registers Medium Low

ASIC FPGA

Adders Medium Low

Multipliers High Medium

Memory High Medium

Multiplexers Low High

+

4:1

D$

*‘Hard’ Optimized

ASIC Blocks

Implemented in Lookup Tables

Free Register with every Lookup Table

“The Key to Optimizing Designs for an FPGA …is to Optimize the Multiplexers”

Page 22: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Architecting Designs for FPGAsArchitecting Designs for FPGAs

Barrel-Shifts using MultipliersBarrel-Shifts using Multipliers

Page 23: Optimizing a High Performance 32-bit Processor for

23 © 2004 Altera Confidential ®

A Barrel-Shifter Using MultiplexersG HA B C D E F

Z6 Z7Z0 Z1 Z2 Z3 Z4 Z5

N log2N LEs

160 LEsfor a 32-bit Barrel Shifter

Page 24: Optimizing a High Performance 32-bit Processor for

24 © 2004 Altera Confidential ®

Barrel Shifter Using Multipliers

G HA B C D E F

Z6 Z7Z0 Z1 Z2 Z3 Z4 Z5

W X Y Z

*

0000000100000

00000Sign W X Y Z

N

N

Multipliers High Medium*

ASIC FPGA

Multiplexers Low High4:1

Area Cost

Page 25: Optimizing a High Performance 32-bit Processor for

25 © 2004 Altera Confidential ®00000Sign W X Y Z

W X Y Z

Shifters using Multipliers

*

0000000100000

00000Sign W X Y Z

N

N

Signed?

SHL (N)

Page 26: Optimizing a High Performance 32-bit Processor for

26 © 2004 Altera Confidential ®00000Sign W X Y Z

00000Sign W X Y Z

W X Y Z

Shifters using Multipliers

*

0000000100000

00000Sign W X Y Z

N

N

Signed?

ASR (32-N)SHL (N)

Page 27: Optimizing a High Performance 32-bit Processor for

27 © 2004 Altera Confidential ®

W X Y Z

Shifters using Multipliers

*

0000000100000

0000000000000 W X Y Z

N

N

Unsigned

00000Sign W X Y Z

0000000000000 W X Y ZSHR (32-N)

SHL (N)

Page 28: Optimizing a High Performance 32-bit Processor for

28 © 2004 Altera Confidential ®00000Sign W X Y Z W X Y Z

W X Y Z

Shifters using Multipliers

*

0000000100000

0000000000000 W X Y Z

N

N

Unsigned

ROT (N)

Page 29: Optimizing a High Performance 32-bit Processor for

29 © 2004 Altera Confidential ®

ASR (32-N)

W X Y Z

Shifters using Multipliers

0000000100000

N

Signed?

SHR (32-N)

ROT (N)

SHL (N)MULLOW

MULHIGH

*

3:1

Page 30: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Case Study: The Design of Nios IICase Study: The Design of Nios II

The ALUThe ALU

15

Page 31: Optimizing a High Performance 32-bit Processor for

31 © 2004 Altera Confidential ®

ALU

Case Study:The NIOS II Pipeline

I$

2:1

RFa RFb

RFbRFa

Instruction Immediate

External Memory

Read

Alu Result

2:1

Page 32: Optimizing a High Performance 32-bit Processor for

32 © 2004 Altera Confidential ®

ALU

The NIOS II Pipeline I$

3:1

RFa RFb

RFbRFa

*

Instruction Immediate

External Memory

Read

Alu Result

2:1

Page 33: Optimizing a High Performance 32-bit Processor for

33 © 2004 Altera Confidential ®

ALU

The NIOS II Pipeline I$

3:1

RFa RFb

RFbRFa

*3:1

Instruction Immediate

External Memory

Read

2:1

Alu Result

Multiplier is used forBarrel-Shifts as well

as Multiplication

Page 34: Optimizing a High Performance 32-bit Processor for

34 © 2004 Altera Confidential ®

ALU

The NIOS II Pipeline I$

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

External Memory

Read

Alu Result

Data Cache Read

2:1

Page 35: Optimizing a High Performance 32-bit Processor for

35 © 2004 Altera Confidential ®

The Logic Unit I$

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

External Memory

Read

Alu Result

Data Cache Read

2:1

+/-logic

2:14:1

4 LUT

Page 36: Optimizing a High Performance 32-bit Processor for

36 © 2004 Altera Confidential ®

The Arithmetic Unit I$

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

External Memory

Read

Alu Result

Data Cache Read

2:1

+/-logic

2:1

Page 37: Optimizing a High Performance 32-bit Processor for

37 © 2004 Altera Confidential ®

The Comparator Unit I$

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

External Memory

Read

Alu Result

Data Cache Read

2:1

+/-logic

>/=

3:1

CMP.op r3, r2, r1IF (r2 op r1)

THEN R3 = 0x00000001ELSE R3 = 0x00000000

Nios II has no explicit Flags

Page 38: Optimizing a High Performance 32-bit Processor for

38 © 2004 Altera Confidential ®

Return Address Save I$

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

External Memory

Read

Alu Result

Data Cache Read

2:1

+/-logic

>/=

ReturnAddress

4:1CALLTRAP

INTERUPTBREAK

Return Address is saved in a Link

Register

Instructions that save Return Address

Page 39: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Case Study: The Design of Nios IICase Study: The Design of Nios II

Increasing the Clock RateIncreasing the Clock Rate

Page 40: Optimizing a High Performance 32-bit Processor for

40 © 2004 Altera Confidential ®

The NIOS II Pipeline

+/-

I$

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

2:1

Pipeline to achieve a high Clock Rate

(fmax)

Page 41: Optimizing a High Performance 32-bit Processor for

41 © 2004 Altera Confidential ®

Forwarding Logic

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

ADD R2, R1, R0

MUL R4, R3, R2

Fowarding needed to update out-of-date

values in the pipeline

new

Page 42: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Case Study: The Design of Nios IICase Study: The Design of Nios II

The Cost of MultiplexersThe Cost of Multiplexers

20

Page 43: Optimizing a High Performance 32-bit Processor for

43 © 2004 Altera Confidential ®

NIOS II Multiplexers

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

Page 44: Optimizing a High Performance 32-bit Processor for

44 © 2004 Altera Confidential ®

What is the Cost of a Multiplexer…?

5:14:12:1 3:1 6:1

Binary (2:1)

Natural Implementation Choice for an ASIC

Page 45: Optimizing a High Performance 32-bit Processor for

45 © 2004 Altera Confidential ®

NIOS II Multiplexers

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

4

2

2

1

3

5

544 LEs(17 x 32bits)

Page 46: Optimizing a High Performance 32-bit Processor for

46 © 2004 Altera Confidential ®

NIOS II Multiplexers I$

5:1 6:1

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

1

1

544 LEs(17 x 32bits)

+/-logic

>/=78 LEs

Multiplexer Cost is

Dominant

<1

Page 47: Optimizing a High Performance 32-bit Processor for

47 © 2004 Altera Confidential ®

Area Usage in 100 Customer Designs

Muxes26%

Arithmetic(+,<,=)11%

Wide-AND11%

Wide-XOR3%

Lonely-Reg18%

Other31%

MuxesArithmetic(+,<,=)Wide-ANDWide-XORLonely-RegOther

Many Designs contain lots of Multiplexers !

Page 48: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Multiplexers in FPGAMultiplexers in FPGA

Low-Cost MultiplexersLow-Cost Multiplexers

Page 49: Optimizing a High Performance 32-bit Processor for

49 © 2004 Altera Confidential ®

Efficient 4:1 Mux on Stratix

C DA B

S1S0

Uses just

2 LEs.

Page 50: Optimizing a High Performance 32-bit Processor for

50 © 2004 Altera Confidential ®

Efficient 4:1 Mux on Stratix: How it works

C DA B

C/D0

C DA B

A/B1

1 0

1 0

0 1

0 1

1 0

0 1

0 1

1 0

Page 51: Optimizing a High Performance 32-bit Processor for

51 © 2004 Altera Confidential ®

The Improved Cost of Binary Multiplexers

5:14:12:1 3:1 6:1

Binary (4:1)

Selector

4:1

4:1 4:1

4:1

Page 52: Optimizing a High Performance 32-bit Processor for

52 © 2004 Altera Confidential ®

The Improved Cost of Binary Multiplexers

5:14:12:1 3:1 6:1

Binary (4:1)

4:1

4:1 4:1

4:1

Selector

1 2 3 43

1 2 3 42

Page 53: Optimizing a High Performance 32-bit Processor for

53 © 2004 Altera Confidential ®

Efficient Multiplexers

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

3

2

2

1

2

4

448 LEs(14 x 32bits)

544 LEs(17 x 32bits)

-18%

Page 54: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Multiplexers in FPGAMultiplexers in FPGA

Registered MultiplexersRegistered Multiplexers

25

Page 55: Optimizing a High Performance 32-bit Processor for

55 © 2004 Altera Confidential ®

Efficient Multiplexers

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

3

2

2

1

2

4 3

416 LEs(13 x 32bits)

544 LEs(17 x 32bits)

-24%

Multiplexer costs can be reducedusing a register!

Page 56: Optimizing a High Performance 32-bit Processor for

56 © 2004 Altera Confidential ®

The Stratix LE

Page 57: Optimizing a High Performance 32-bit Processor for

57 © 2004 Altera Confidential ®

The Stratix LE

enable

sload sclear

Additional Lab-wide signals(shared between 8 LEs)

Page 58: Optimizing a High Performance 32-bit Processor for

58 © 2004 Altera Confidential ®

2:1 Mux in 1 LE

d0 d1sel

Page 59: Optimizing a High Performance 32-bit Processor for

59 © 2004 Altera Confidential ®

3:1 Mux in 1 LE

d0 d1 d2

Sync-loadRegister Needed(for sload)

sel

Page 60: Optimizing a High Performance 32-bit Processor for

60 © 2004 Altera Confidential ®

4:1 Mux in 1 LE

d0 d1 d2

sload

Register Needed(for sload / sclear)

sel

0sclear

0

Page 61: Optimizing a High Performance 32-bit Processor for

61 © 2004 Altera Confidential ®

The Cost of Multiplexers

1 2 3 42

5:14:12:1 3:1 6:1

1 1 3 31-2

5:14:12:1 3:1 6:1

Asynchronous

Registered

Page 62: Optimizing a High Performance 32-bit Processor for

62 © 2004 Altera Confidential ®

The Most Cost Effective Multiplexers

Asynchronous

1 2 3 42

5:14:12:1 3:1

Registered

1 1 31-2

5:14:12:1 3:1 6:1

3

6:1

Page 63: Optimizing a High Performance 32-bit Processor for

63 © 2004 Altera Confidential ®

The Most Cost Effective Multiplexers

Asynchronous

1 2 3 42

5:14:12:1 3:1

Registered

1 1 31-2

5:14:12:1 3:1 6:1

6:1

3

Page 64: Optimizing a High Performance 32-bit Processor for

64 © 2004 Altera Confidential ®

Recap:

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

3

2

2

1

2

4 3

416 LEs(13 x 32bits)

544 LEs(17 x 32bits)

-24%

Multiplexer costs were reduced using

a register!

Page 65: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Optimizing Multiplexers in Nios IIOptimizing Multiplexers in Nios II

Restructuring TechniquesRestructuring Techniques

30

Page 66: Optimizing a High Performance 32-bit Processor for

66 © 2004 Altera Confidential ®

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

3

2

2

1

2

32:1 3:1

1 1

Registered

Underutilized Muxes

Can extend 2:1 to be a 3:1 at no

extra cost!

Page 67: Optimizing a High Performance 32-bit Processor for

67 © 2004 Altera Confidential ®

Input Balancing:

+/-

I$

5:1 6:1

logic

2:1

3:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

2

1

2:1 3:1

1 1

Registered

1 2

Async

2:1 3:1

Page 68: Optimizing a High Performance 32-bit Processor for

68 © 2004 Altera Confidential ®

NIOS II Multiplexers

+/-

I$

5:1 6:1

logic

3:1

2:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

1 2

1

384 LEs(12 x 32bits)

416 LEs(13 x 32bits)

-8%

Page 69: Optimizing a High Performance 32-bit Processor for

69 © 2004 Altera Confidential ®

Related Inputs:

+/-

I$

5:1 6:1

logic

3:1

2:1

RFa RFb

D$

RFbRFa

3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

352 LEs(11 x 32bits)

*1 2

2:1

2:1

* * 5-LUT

4-LUT

Page 70: Optimizing a High Performance 32-bit Processor for

70 © 2004 Altera Confidential ®

Design Trade-offs

+/-

I$

5:1 6:1

logic

2:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

4:1

CALLTRAPINTR

BREAK

3333

cycles

No need to Forward Return Address Early

3:1

Page 71: Optimizing a High Performance 32-bit Processor for

71 © 2004 Altera Confidential ®

Design Trade-offs

+/-

I$

5:1 6:1

logic

3:1

2:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

3:1

CALLTRAPINTR

BREAK

3333

cycles

No need to Forward Return Address Early

Page 72: Optimizing a High Performance 32-bit Processor for

72 © 2004 Altera Confidential ®

Forwarding Zero… I$

5:1 6:1

3:1

2:1

RFa RFb

D$

RFbRFa

*3:1

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

3:1

Can use Synchronous Reset instead of

multiplexer input.

CMP.op r3, r2, r1

+/-logic

>/=

IF (r2 op r1) THEN R3 = 0x00000001ELSE R3 = 0x00000000

Mostly 0’s

Page 73: Optimizing a High Performance 32-bit Processor for

73 © 2004 Altera Confidential ®

Forwarding Zero…

+/-

I$

logic

3:1

2:1

RFa RFb

D$

RFbRFa

*3:1

>/=

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

2:1

5:1 6:1

2 1

Can use Synchronous Reset instead of

multiplexer input.

CMP.op r3, r2, r1IF (r2 op r1)

THEN R3 = 0x00000001ELSE R3 = 0x00000000

Mostly 0’s

Page 74: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Optimizing Multiplexers in FPGAOptimizing Multiplexers in FPGA

SummarySummary

35

Page 75: Optimizing a High Performance 32-bit Processor for

75 © 2004 Altera Confidential ®

Summary: Restructure to 4:1 or 3:1(reg)

Asynchronous

1 2 3 42

5:14:12:1 3:1

Registered

1 1 31-2

5:14:12:1 3:1 6:1

6:1

3

Optimal Multiplexer Densities

Page 76: Optimizing a High Performance 32-bit Processor for

76 © 2004 Altera Confidential ®

Summary

3:1

2:1

3:1

Instruction Immediate

ReturnAddress

External Memory

Read

Alu Result

Data Cache Read

2:1

5:1 6:1

+/-

I$

logic

RFa RFb

D$

RFbRFa

*

>/=

320 LEs(10 x 32bits)

544 LEs(17 x 32bits)

- 42%

Page 77: Optimizing a High Performance 32-bit Processor for

77 © 2004 Altera Confidential ®

Techniques Extend to Real Designs…

D 13,472

Size

67 MHz

SpeedOriginal

-60% unchng

Size SpeedOptimized

A 2,400 40 MHz -50% 2.5x

B 7,373 77 MHz -77% 2.0x

E 1,925 75 MHz -27% unchng

Others … … … …

C 13,500 50 MHz 1c12 fit 1.5x

Page 78: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

Optimizing Multiplexers in FPGAOptimizing Multiplexers in FPGA

Support in Quartus SynthesisSupport in Quartus Synthesis

Page 79: Optimizing a High Performance 32-bit Processor for

79 © 2004 Altera Confidential ®

New Multiplexer Report:

(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)

Page 80: Optimizing a High Performance 32-bit Processor for

80 © 2004 Altera Confidential ®

New Multiplexer Report:

(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)

– Number of Unique (or Constant) Inputs– Number of busses with identical structure

Page 81: Optimizing a High Performance 32-bit Processor for

81 © 2004 Altera Confidential ®

New Multiplexer Report:

(Table is always produced after Analysis & Synthesis, even if optimizations are disabled)

– Estimate of Area Inefficiency

Page 82: Optimizing a High Performance 32-bit Processor for

82 © 2004 Altera Confidential ®

New Synthesis Option:

Page 83: Optimizing a High Performance 32-bit Processor for

83 © 2004 Altera Confidential ®

Results: (Stratix I: Logic Reduction)Stratix I QOR Set, LEs Post Synthesis

-10%

-5%

0%

5%

10%

15%

20%

25%se

ibus

_sw

itch

topl

evel

netw

orki

nter

face

mas

terfp

gaal

t_ra

pidi

o2fu

jitsu

crc3

2x32

bfyx

_top

quat

trofa

ust

cht

unpa

cker

_top

tdm

_phy

_top

tsi_

top

hda_

top

band

_fil

fldp

oops

corr

_409

6m

bcid

_top

msb

_asi

crm

on_c

hip

yang

tze

aqui

la_c

ore

sraa

tcp_

fpga

2al

t_bd

ti80

noki

a_fil

ter

me1

_cor

rect

edac

s_ge

nera

tor

oc_d

es_p

erf_

opt

siriu

sch

ip_f

icon

_40

coeu

r_op

logi

c_co

rede

m_c

ode

mbc

b

Design

%ag

e R

educ

tion

Mean = 4.2% (geo)

(preliminary)

Over 20% Area Reduction in Benchmark Set!

Page 84: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

SummarySummary

40

Page 85: Optimizing a High Performance 32-bit Processor for

85 © 2004 Altera Confidential ®

SummarySystem Design on FPGAs– Low cost easy-to-use tools with Time-to-Market advantage

Architecting Designs for FPGAs– Multiplexer Costs can dominate in FPGAs

• 25% of the area on average• Significant in Processor / Busses

– FPGA Multiplexer Costs do not scale linearly• best to map to 4:1 or 3:1(reg)• Registers can reduce multiplexer costs!

– The Cheapest Multiplexers are those not implemented in Logic!• Eg: By using a multiplier

Synthesis Tools assist in Optimization Process– But the Designer still has a huge influence on QoR

3:14:1

Page 86: Optimizing a High Performance 32-bit Processor for

© 2004 Altera Corporation

The End.The End.

Questions?Questions?