architecture-specific packing for virtex-5 fpgas taneem ahmed, paul kundarewich, jason anderson,...

36
Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

Upload: kayla-hoskinson

Post on 14-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

Architecture-Specific Packingfor Virtex-5 FPGAsTaneem Ahmed, Paul Kundarewich, Jason Anderson,Brad Taylor, Rajat Aggarwal

February 25th, 2008

Page 2: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

2

Overview

• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Results• Summary

Page 3: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

3

Simplified FPGA Logic Element

4-LUT

A4A3A2A1

O4

FF

Page 4: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

4

Simplified FPGA Logic Block

FF4-LUT

FF4-LUT

FF4-LUT

FF4-LUT

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

Page 5: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

5

Virtex-5 Logic Block

CLB

FF6-LUT

FF6-LUT

FF6-LUT

FF6-LUT

SLICE

FF6-LUT

FF6-LUT

FF6-LUT

FF6-LUT

SLICE

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

Page 6: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

6

Dual-Output 6-LUT

6-LUT

A6A5A4A3A2A1

O6

O5

Page 7: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

7

Dual-Output 6-LUT UsageA6

A5A4A3A2A1

O6

5-LUT O5

5-LUT

Page 8: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

8

Dual-Output Packing

A6

A5A4A3A2A1

O6

5-LUT

5-LUT O5

A6

A5A4A3A2A1

O6

5-LUT

5-LUT O5

6-LUT 6-LUT

Number of 6-LUTs used: 2Number of 6-LUTs used: 1!

xy

X

LogicX

ab

Y

LogicY

VCC

xy

ba

Y

LogicY

LogicX

X

Page 9: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

9

XOR

XOR

AX

AX

6-LUT

CY

CY

F7

F7

F7

O5

O5O5

O6

CIN

FFAQ

AMUX

A

O6

O6

Virtex-5 LUT/FF Pair

Page 10: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

10

Dual-Output Packing Tradeoff

AX

6-LUT

F7

O5

O5O5

O6

FF

O6

O66-LUT

Page 11: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

11

Dual-Output Packing in Placer

• Goal: To reduce area without performance hit– Can be done pre-placement

• Will be sub-optimal without delay estimates – Use delay estimates available during placement to

make good decisions on when to merge two LUTs

• Approach:– Allow second 5-LUT to be used, when performance

impact is small– Incorporate LUT packing in placer’s cost function

Page 12: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

12

Placer Cost Function

• Previous cost function:– Cost = a * W + b * T– W: wirelength cost T: timing performance cost

• Extend cost function with two new terms– One based on 6-LUT utilization (L)– One based on SLICE utilization (S)– Cost = a * W + b * T + c * L + d * S

Page 13: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

13

6-LUT Utilization Term

• L is computed based on all the used 6-LUT slots

• Where

Page 14: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

14

• S is computed based on all the available SLICEs

• Let:– Ni = Number of used 5-LUTs in SLICE i (at most 8)

SLICE Utilization Term

S = Sii=0

m

Page 15: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

15

Performance Recovery

• Helpful to prohibit pack in certain cases for performance reasons

• Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.

Page 16: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

16

Performance Recovery: XOR

XOR

XOR

AX

AX

LUT6

CY

CY

F7

F7

F7

O5

O5O5

O6

CIN

FFAQ

AMUX

A

O6

O6LUT6

FF

Page 17: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

17

Performance Recovery: F7

XOR

XOR

AX

AX

LUT6

CY

CY

F7

F7

F7

O5

O5O5

O6

CIN

FFAQ

AMUX

A

O6

O6LUT6

F7

FF

Page 18: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

18

6-LUT Reduction

0

2

4

6

8

10

12

14

16

Benchmark Design #

% 6

-LU

T R

ed

uc

tio

n

5.5% 6-LUTReduction

Page 19: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

19

SLICE Reduction

0

5

10

15

20

25

Benchmark Design #

% S

LIC

E R

edu

ctio

n

10.23% SLICEReduction

Page 20: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

20

Performance Results

-15

-10

-5

0

5

10

15

20

25

0 5 10 15 20 25

SLICEs Reduction (%)

Pe

rfo

rma

nc

e L

os

s (

%)

3.3% PerformanceDegradation

Page 21: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

21

Overview

• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Summary

Page 22: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

22

New Type of Packing Problem

• Traditionally, packing is considered to be a problem of just LUTs and flops

• However, Virtex-5 contains large IP blocks that present their own packing problem

Page 23: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

23

Virtex-5 Block RAMs

18 Kb RAM

18 Kb RAM

36Kb RAM

• A 36 Kbit block RAM tile can store:a) single 36 Kb RAMb) two independent 18 Kb RAMs

• Block RAM has configurable “aspect ratio”• 18 Kb RAM can be configured as:

16K x 1, 8K x 2, 2K x 9, or 1K x 18

• Tools decide which independent 18 Kb block RAMs to locate in which tile

Page 24: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

24

Virtex-5 DSP48E Block• A multiply-accumulate operation, pervasive in DSP

circuits, can be realized in a single DSP48E. • Multiple DSP48Es can be chained together to form more

complex functions through the PCIN and PCOUT ports

PCIN

C (48-bit)

B (18-bit)A (25-bit)

=

48-bit

Op

tion

al p

ipe

line

re

gis

ter/

rou

ting

log

ic

Op

tion

al p

ipe

line

re

gis

ter/

rou

ting

log

ic

Ro

utin

g lo

gicX

P

25x18

Pattern detect

ALU

PCOUT

Page 25: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

25

Block RAM and DSP Floorplan

• Block RAM and DSP48E tiles are organized in columns

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Virtex-5DSP tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

DSP48E

DSP48E

Block RAM tile

Page 26: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

26

Block RAM/DSP Packing

• Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing

• Goal: Leverage preferred block RAM packing patterns to achieve high performance

• Target area: DSP designs– DSP designs make heavy use of block RAMs and

DSP blocks

Page 27: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

27

DSP Block RAM Designs

• Most common DSP application is the Finite Impulse Response Filter or FIR filter– FIR filters have multiple instances of a “tap” which

involve DSP and block RAMs

Page 28: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

28

FIR Filter

• A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line

• An N-tap filter can be expressed as:y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1]– Where:

• y[n] is the output of the filter at time n• x[n] is the data input “signal” at time n• Ci is the coefficient

• Each coefficient/data product in sum is referred to as a “tap”– DSP units used for the multiply and accumulate– Block RAMs used to store the data and coefficients

Page 29: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

29

FIR Designs – Use Case 1• 2-tap FIR filter involving small block RAMs

RAMD1 RAMC1

Data RAM

18 Kb block RAM

RAMD0 RAMC0

Coefficient RAM

DSP0 Tap 0

DSP1 Tap 1

PCOUT

PCIN

A

B

datainput

dataoutput

A

B

36 Kb block RAM Tile

Page 30: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

30

Packing for Use Case 1

• Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs

High Performance!

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Operates as two independent18 Kb block RAMs

Virtex-5DSP tile

Page 31: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

31

FIR Designs – Use Case 2

• 2-tap FIR filter involving larger block RAMs

DSP0

DSP1

PCOUT

PCIN

RAMD0

RAMD1

A

B

18 Kb block RAM

A

B

36 Kb block RAM

RAMC0

RAMC1

Data RAM Coefficient RAM

Tap 1

Tap 0

Page 32: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

32

Packing for Use Case 2

• Two Block RAM columns feed one DSP column• Again provides a natural alignment between the

DSP and Block RAMsDSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

DSP48E

DSP48E

DSP48E

DSP48E

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Virtex-5DSP tile

Page 33: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

33

Block RAM Chains

• Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO)

• Algorithm: Look for such chains and pack them together into single block RAM tile

• Special Case: 18k block RAMs separated by registers

inRAM0dia doa

addra

RAM1dib dob

addrb

out

18 Kb block RAM

Page 34: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

34

Block RAM/DSP Packing Results

Circuit Perf RAM Packing (MHz)

Perf. Baseline (MHz)

Percent Improvement

Circuit 1 500 400 25%

Circuit 2 450 365 23%

Circuit 3 500 470 6%

Circuit 4 425 435 -2%

Circuit 5 215 200 8%

Geomean 400 359 11%

Page 35: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

35

Summary

• Described two architecture specific packing approaches for a 65nm commercial FPGA:Xilinx Virtex-5– Dual-output LUT packing in placement:

• Achieves 10.2% SLICE reduction and 5.5% LUT reduction– Packing for DSPs and block RAMs:

• Achieves 11% performance improvement

Page 36: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

36

Questions