architecture-specific packing for virtex-5 fpgas taneem ahmed, paul kundarewich, jason anderson,...

Architecture-Specific Packingfor Virtex-5 FPGAsTaneem Ahmed, Paul Kundarewich, Jason Anderson,Brad Taylor, Rajat Aggarwal

February 25th, 2008

Overview

• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Results• Summary

Simplified FPGA Logic Element

A4A3A2A1

Simplified FPGA Logic Block

FF4-LUT

GeneralInterconnec

Virtex-5 Logic Block

FF6-LUT

GeneralInterconnec

Dual-Output 6-LUT

A6A5A4A3A2A1

Dual-Output 6-LUT UsageA6

A5A4A3A2A1

5-LUT O5

Dual-Output Packing

A5A4A3A2A1

5-LUT O5

A5A4A3A2A1

5-LUT O5

6-LUT 6-LUT

Number of 6-LUTs used: 2Number of 6-LUTs used: 1!

LogicX

LogicY

LogicX

Virtex-5 LUT/FF Pair

Dual-Output Packing Tradeoff

O66-LUT

Dual-Output Packing in Placer

• Goal: To reduce area without performance hit– Can be done pre-placement

• Will be sub-optimal without delay estimates – Use delay estimates available during placement to

make good decisions on when to merge two LUTs

• Approach:– Allow second 5-LUT to be used, when performance

impact is small– Incorporate LUT packing in placer’s cost function

Placer Cost Function

• Previous cost function:– Cost = a * W + b * T– W: wirelength cost T: timing performance cost

• Extend cost function with two new terms– One based on 6-LUT utilization (L)– One based on SLICE utilization (S)– Cost = a * W + b * T + c * L + d * S

6-LUT Utilization Term

• L is computed based on all the used 6-LUT slots

• Where

• S is computed based on all the available SLICEs

• Let:– Ni = Number of used 5-LUTs in SLICE i (at most 8)

SLICE Utilization Term

S = Sii=0

Performance Recovery

• Helpful to prohibit pack in certain cases for performance reasons

• Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.

Performance Recovery: XOR

O6LUT6

Performance Recovery: F7

O6LUT6

6-LUT Reduction

Benchmark Design #

5.5% 6-LUTReduction

SLICE Reduction

Benchmark Design #

10.23% SLICEReduction

Performance Results

0 5 10 15 20 25

SLICEs Reduction (%)

3.3% PerformanceDegradation

Overview

• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Summary

New Type of Packing Problem

• Traditionally, packing is considered to be a problem of just LUTs and flops

• However, Virtex-5 contains large IP blocks that present their own packing problem

Virtex-5 Block RAMs

18 Kb RAM

36Kb RAM

• A 36 Kbit block RAM tile can store:a) single 36 Kb RAMb) two independent 18 Kb RAMs

• Block RAM has configurable “aspect ratio”• 18 Kb RAM can be configured as:

16K x 1, 8K x 2, 2K x 9, or 1K x 18

• Tools decide which independent 18 Kb block RAMs to locate in which tile

Virtex-5 DSP48E Block• A multiply-accumulate operation, pervasive in DSP

circuits, can be realized in a single DSP48E. • Multiple DSP48Es can be chained together to form more

complex functions through the PCIN and PCOUT ports

C (48-bit)

B (18-bit)A (25-bit)

48-bit

Pattern detect

Block RAM and DSP Floorplan

• Block RAM and DSP48E tiles are organized in columns

Block RAM tile

DSP48E

Block RAM tile

DSP48E

Block RAM tile

DSP48E

Block RAM tile

DSP48E

Virtex-5DSP tile

Block RAM tile

DSP48E

Block RAM tile

Block RAM/DSP Packing

• Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing

• Goal: Leverage preferred block RAM packing patterns to achieve high performance

• Target area: DSP designs– DSP designs make heavy use of block RAMs and

DSP blocks

DSP Block RAM Designs

• Most common DSP application is the Finite Impulse Response Filter or FIR filter– FIR filters have multiple instances of a “tap” which

involve DSP and block RAMs

FIR Filter

• A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line

• An N-tap filter can be expressed as:y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1]– Where:

• y[n] is the output of the filter at time n• x[n] is the data input “signal” at time n• Ci is the coefficient

• Each coefficient/data product in sum is referred to as a “tap”– DSP units used for the multiply and accumulate– Block RAMs used to store the data and coefficients

FIR Designs – Use Case 1• 2-tap FIR filter involving small block RAMs

RAMD1 RAMC1

Data RAM

18 Kb block RAM

RAMD0 RAMC0

Coefficient RAM

DSP0 Tap 0

DSP1 Tap 1

datainput

dataoutput

36 Kb block RAM Tile

Packing for Use Case 1

• Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs

High Performance!

Block RAM tile

DSP48E

Block RAM tile

DSP48E

Block RAM tile

DSP48E

Block RAM tile

DSP48E

Operates as two independent18 Kb block RAMs

Virtex-5DSP tile

FIR Designs – Use Case 2

• 2-tap FIR filter involving larger block RAMs

18 Kb block RAM

36 Kb block RAM

Data RAM Coefficient RAM

Packing for Use Case 2

• Two Block RAM columns feed one DSP column• Again provides a natural alignment between the

DSP and Block RAMsDSP48E

DSP48E

Block RAM tile

DSP48E

Block RAM tile

Virtex-5DSP tile

Block RAM Chains

• Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO)

• Algorithm: Look for such chains and pack them together into single block RAM tile

• Special Case: 18k block RAMs separated by registers

inRAM0dia doa

RAM1dib dob

18 Kb block RAM

Block RAM/DSP Packing Results

Circuit Perf RAM Packing (MHz)

Perf. Baseline (MHz)

Percent Improvement

Circuit 1 500 400 25%

Circuit 2 450 365 23%

Circuit 3 500 470 6%

Circuit 4 425 435 -2%

Circuit 5 215 200 8%

Geomean 400 359 11%

Summary

• Described two architecture specific packing approaches for a 65nm commercial FPGA:Xilinx Virtex-5– Dual-output LUT packing in placement:

• Achieves 10.2% SLICE reduction and 5.5% LUT reduction– Packing for DSPs and block RAMs:

• Achieves 11% performance improvement

Questions

architecture-specific packing for virtex-5 fpgas taneem ahmed, paul kundarewich, jason anderson,...

Documents

rfid and privacy - taneem ibrahim

rajat khaturia horoscope

rajaratnam and rajat

rajat gail

rajat mittal cs203b

rajat sharma

ep501 hw5 rajat joshi

rajat mohanty.pptx

fpga virtex 6

overview of virtex 4 & virtex 4 bist project

rajat mavi

rajat kumar mohanty

rajat gupta

rajat mishra autobiography

rajat prjct rprt

ds083 - virtex-ii pro and virtex-ii pro x platform fpgas:...

virtex 5 platform

ethics - rajat singla

75 rajat seth.pptx

meet rajat contd