designing for 100+mhz 1. 2 1999 designs demand... higher system speed higher integration...

81
1 Designing for 100+MHz Designing for 100+ MHz Designing for 100+ MHz

Upload: monica-russell

Post on 16-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

1Designing for 100+MHz

Designing for 100+ MHzDesigning for 100+ MHz

Page 2: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

2Designing for 100+MHz

1999 Designs Demand...1999 Designs Demand...

Higher system speed

Higher integration— smaller size, less power, better reliability

Lower cost

Shorter development time

Better product differentiation

Page 3: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

3Designing for 100+MHz

Traditional Multi-Chip BoardsTraditional Multi-Chip Boards

Discrete design components— CPU, memory— bus transceivers, PCI controller, FIFOs— Ethernet controller, Graphics accelerator,

MPEG, DSP, etc.— programmable logic as glue and custom function

Advantages:— well-documented sophisticated functions— readily available as IP in silicon

Page 4: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

4Designing for 100+MHz

Multi-Chip Board ProblemsMulti-Chip Board Problems

Physical size

Power consumption and reliability

PC board signal integrity

Limited flexibility— prevents design modifications and upgrades— prevents product diversification— prevents product customization

Poor product differentiation— standard parts = standard architecture

Page 5: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

5Designing for 100+MHz

FPGA AdvantagesFPGA Advantages

Smaller size

Lower power consumption

Better signal integrity— fewer PC-board issues

Enhanced flexibility— easy modifications, upgrades, etc.

Enhanced product differentiation— proprietary architectures

Page 6: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

6Designing for 100+MHz

FPGAs Users Want...FPGAs Users Want...

System clock rate of 100+ MHz

>100,000 gates

Efficient design methodologies

Availability of well-documented Cores

Reasonable cost

Page 7: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

7Designing for 100+MHz

The FPGA SolutionThe FPGA Solution

4th Generation FPGALogic+Memory+Routing

Multi-Standard Select I/O

Temperature Sensing

Delay-Locked Loop for Fast Clock and I/O

3.3 ns SynchronousDual-Port SRAM

500 Mbps SelectMAP Configuration

Page 8: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

8Designing for 100+MHz

Now the Challenge...Now the Challenge...

Together, we can do it...— we’ll supply the ingredients...— you use them intelligently

But don’t forget...— the clock period is less than 10 ns !

Design a 100+ MHz system

Page 9: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

9Designing for 100+MHz

Designing for 100+ MHz.Designing for 100+ MHz.

Volts, Amps, and Watts— PCB signal distribution— chip inputs and outputs— power and thermal considerations

Ones and zeros— logic emulation

Bits and bytes— memory hierarchy

Page 10: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

10Designing for 100+MHz

’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10Year

Clock Frequency

Trace Length MHz

Inches per 1/4 Clock Period

2048

1024

512

256

128

64

32

16

8

4

2

1

Moore Meets EinsteinMoore Meets Einstein

Speed Doubles Every 5 Years…...But the speed of light never changes

Page 11: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

11Designing for 100+MHz

Volts, Amps, and WattsVolts, Amps, and Watts PCB design issues

— capacative loading— transmission lines and termination

Chip inputs and outputs— clock distribution and DLLs— I/O standards

Power and thermal considerations— temperature sensing diode — power supply decoupling

Configuration— new SelectMAP mode

Page 12: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

12Designing for 100+MHz

Capacitive LoadingCapacitive Loading

Capacitance slows outputs and increases power— output delay increase:

– ~ 25 ps per pF of additional loading— output power dissipation increase:

– 11 µW per MHz per pF with 3.3-V swing

Sources of capacitance— 10 pF max for each device pin— 2 pF per inch for narrow traces ( 0.8 pF/cm )— 130 pF per inch2 for copper areas ( 20 pF/cm2)

IBIS files provide output impedance details

Page 13: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

13Designing for 100+MHz

Transmission LinesTransmission Lines

Some traces must be treated as transmission lines to minimize ringing— transmission line if round trip > transition time— lumped-capacitance if round trip < transition time

Signal delay on a PCB:— 140 to 180 ps per inch ( 50 to 70 ps/cm)

Lumped-capacitance trace length:— 3 inches max for a 1-ns transition time (7.5 cm)— 6 inches max for a 2-ns transition time (15 cm)

Page 14: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

14Designing for 100+MHz

50 Ω

50 Ω

50 Ω

VCC

50 Ω

100 pF

100 Ω

100 Ω

(50 Ω Total)

22 Ω 27 Ω

Traditional Thevenintermination at the end

Dynamic termination at the end is better and saves power

Series termination at the source is best single source and destination only!

Terminated Transmission Lines Terminated Transmission Lines Reflections and ringingReflections and ringing

Page 15: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

15Designing for 100+MHz

Clock

Data

On-Chip Clock DistributionOn-Chip Clock Distribution

Clock distribution introduces delay— larger chips suffer more clock delay

CLBIOB

Page 16: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

16Designing for 100+MHz

IOBFlip-Flop

QDData

ClockClock

DistributionDelay

Clock

Required Data Valid(without delay)

Required Data Valid(with delay)

Delay

Clock Delay Problems Clock Delay Problems

Clock delay increases clock-to-output times

Clock delay leads to unacceptable input hold time— set-up time is negative

Additional data delay can eliminate the hold time — set-up time becomes positive— but tolerance build-up widens the data-valid window

Page 17: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

17Designing for 100+MHz

DLLs Maximize I/O SpeedDLLs Maximize I/O Speed

Clock-to-output time plus set-up time determinesthe I/O speed and data bandwidth— min clock period = max clock-to-out + max set-up

Traditional solution:— use highly buffered, balanced clock trees

– needed to reduce internal clock skew– cannot totally eliminate the delay

The Virtex solution:— use a Delay-Locked-Loop ( DLL )

– aligns the internal and external clocks– effectively eliminates the clock-distribution delay

Page 18: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

18Designing for 100+MHz

Clock

Data

ComparatorError

Delay

Virtex Has 4 Independent DLLsVirtex Has 4 Independent DLLs

DLLs adjust clock delay to align internal and external clocks— digital closed-loop control — 25 to 200-MHz range, 35-picosecond resolution

CLBIOB

Page 19: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

19Designing for 100+MHz

Fast Clock-to-Out With DLLFast Clock-to-Out With DLL

Clock

3.8 ns

Virtex FPGA Virtex FPGA

Q

DLL

D

DLL

1.9 ns

0.5 ns

160 MHz inter-chip data rate— 16-mA LVTTL— IOB register to IOB register

Page 20: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

20Designing for 100+MHz

LVTTL Data Rate with DLLLVTTL Data Rate with DLL

1.4 ns measured clock-to-output delay

Output standard = LVTTL Fast 16mA

(OBUF_F_16)

Temp=100C, Vdd=2.375V, Vcco=3.3V

Waveforms:

1: CLKIN

2: DATA OUT (no DLL)

3: DATA OUT (DLL deskewed)

Timing

w/o DLL w/ DLL

r->r r->f r->r r->f

3.9n 3.9n 1.4n 1.4n

Page 21: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

21Designing for 100+MHz

Other DLL FunctionsOther DLL Functions

Double the incoming clock frequency — fast internal operation – slow external clock

Clock mirroring to the PCB

Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16

Adjust clock duty cycle to 50-50

Create four quadrature clock phases— input four sequential bits per clock period

Page 22: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

22Designing for 100+MHz

25 MHz 25% Duty

Cycle

25 MHz 50% Duty Cycle

Virtex FPGA

1X

Duty Cycle CorrectionDuty Cycle Correction

~25% duty cycle in – 50% duty cycle out

DLLDLL

Page 23: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

23Designing for 100+MHz

Clock Doubling and MirroringClock Doubling and Mirroring

Clock mirror with less than 100 ps skew— simplifies PCB clock distribution

Virtex

Zero-Delay Internal Clock Buffer

37 MHz74 MHz #1

74 MHz #2

74 MHz Internal

37 MHz Internal

System Clock

SDRAM

Inside FPGA

Inside FPGA

SystemClock

1 Input Load ExactlyAligned

ExactlyAligned

Actual HDTV Customer Example

SDRAM

DLL 2DLL 2

DLL 1DLL 1

Page 24: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

24Designing for 100+MHz

66MHz

Clock

132 MHz Clock

Virtex FPGA

2X

DLLDLL

Precise Clock MirroringPrecise Clock Mirroring

2x system clock for board use

Page 25: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

25Designing for 100+MHz

CLKIn 200 MHz

CLKout 200 MHz

CLKDV 12.5 MHz

Clock DivisionClock Division

Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16— maintain synchronous edges

Page 26: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

26Designing for 100+MHz

Multi-Standard SelectI/OMulti-Standard SelectI/OGTL+

5V Tolerant

2.5V SSTL

1.8V

3.3V LVTTL

5V

MicroProcessorMicroProcessor SRAMSRAM

DSPDSP

Mixed SignalMixed Signal

Busses/Backplanes(3/5V PCI, ISA, GTL…)

Busses/Backplanes(3/5V PCI, ISA, GTL…)

FLASHFLASH

SDRAMSDRAMSDRAM

Page 27: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

27Designing for 100+MHz

Mix & Match Output StandardsMix & Match Output Standards

User-supplied voltages determine output swing— 3.3 V, 2.5 V, 1.5 V— one voltage per bank— a bank is half of a chip edge

Output characteristics are programmable on a per-pin basis— push-pull or open-drain— LVTTL drive strength

– 2-mA to 24-mA sink and source current— LVTTL Slew rate

Page 28: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

28Designing for 100+MHz

InternalReference

VREF

Input

Input

Input

Input

Input

Input

VREF

Mix & Match Input StandardsMix & Match Input Standards

Internal or user-supplied threshold voltage— selectable on a per-pin basis— one user-supplied

threshold voltage per bank

Programmable over-voltage protection— 5-V tolerant or diode

clamp to VCCO— selectable on a per-pin basis

Page 29: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

29Designing for 100+MHz

SSTL Clock-to-Out With DLLSSTL Clock-to-Out With DLL

200 MHz inter-chip data rate— SSTL 3, Class II— IOB register to IOB register

Clock

2.8 ns

Virtex FPGA Virtex FPGA

Q

DLL

D

DLL

1.9 ns

0.3 ns

(Stub Series Transceiver Logic)

Page 30: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

30Designing for 100+MHz

SSTL Data Rate with DLLSSTL Data Rate with DLL

Output standard = SSTL 3 Class 2

(OBUF_SSTL3_II)

Temp=100C, Vdd=2.375V, Vcco=3.3V, Vtt=1.5V

Waveforms:

1: CLKIN

2: DATA OUT (no DLL)

3: DATA OUT (DLL deskewed)

Timing

w/o DLL w/ DLL

r->r r->f r->r r->f

3.5n 3.8n 1.1n 1.3n

1.3 ns measured clock-to-output delay— much lower noise than LVTTL

Page 31: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

31Designing for 100+MHz

From FPGA to System ComponentFrom FPGA to System Component‘Redefining the FPGA’‘Redefining the FPGA’

"Virtex moves FPGAs from glue to system component” - Ron Neale, EE

GT

L+

High Speed System Backplane

Low VoltageCPU

LVTTL

SD

RA

M (

133M

Hz)

SSTL3

Cache SRAM (Mbytes)

LVCMOS

Chip 1 Chip 1

x1 CLK x2 CLK

Page 32: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

32Designing for 100+MHz

Power and Thermal IssuesPower and Thermal Issues

Power and heat are serious concerns

All CMOS power consumption is dynamic— proportional to VCC

2

— proportional to capacitance— proportional to frequency

Virtex conserves power— 2.5-V supply voltage— small geometries and short interconnects

reduce capacitance

Page 33: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

33Designing for 100+MHz

384 16-bit Counters 2.5 W Total

768 8-bit Counters 3.7 W Total

1536 16-bit Counters 9.8 W Total

3072 8-bit Counters 14.7 W Total

XCV300

XCV1000

Virtex Power ConsumptionVirtex Power Consumption

Virtex is designed to conserve power— 100 MHz 16-bit counters

– 12.5 MHz average transition rate– 6.5 mW per counter including clock distribution

— 100 MHz 8-bit counters– 25 MHz average transition rate– 5 mW per counter including clock distribution

Page 34: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

34Designing for 100+MHz

DXP

DXN

VirtexFPGA SBMCLK

SBMDATA

ALERT

MaximMAX1617

Thermal ManagementThermal Management

Temperature-sensing diode— matched to maxim MAX 1617 A/D— programmable alarms— similar to the Pentium II solution

Page 35: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

35Designing for 100+MHz

Power Supply DecouplingPower Supply Decoupling

CMOS power-supply current is dynamic— current pulse every active clock edge

Peak current can be 5x the average current— instantaneous current peaks can only be

supplied by decoupling capacitors

Use one 0.1 µF ceramic chip capacitor for each power-supply pin— low L and R are more important than high C— double up for lower L and R if necessary— use direct vias to the supply planes, close to the

power-supply pins

Page 36: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

36Designing for 100+MHz

VirtexFPGA

WE, CS Data

Virtex ConfigurationVirtex Configuration

New byte-wide SelectMAP mode— up to 528 Mbps at 66 MHz

– simple handshake protocol— up to 400 Mbps at 50 MHz

– no handshake required

Configuration bit-stream length— 0.5 Mbits to 6.1 Mbits

CSAddress

ConfigurationEPROM

Control Logic(EPLD)

Busy

Page 37: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

37Designing for 100+MHz

Volts, Amps, and Watts: RecapVolts, Amps, and Watts: Recap PCB design issues

— minimize capacitance for higher speed— terminate transmission lines to reduce ringing

Chip inputs and outputs— use DLLs to maximize I/O bandwidth— use SelectI/O to interface with different standards

Power and thermal considerations— use the sensing diode to manage chip temperature— decouple the power supply well

Configuration— configure faster with the SelectMAP mode

Page 38: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

38Designing for 100+MHz

Designing for 100+ MHz.Designing for 100+ MHz.

Volts, Amps, and Watts— PCB Signal Distribution— chip Inputs and Outputs— power and Thermal Considerations

Ones and zeros— logic Emulation

Bits and bytes— memory hierarchy

Page 39: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

39Designing for 100+MHz

Spending the 10 ns BudgetSpending the 10 ns Budget

Fast logic requires fast function generators— signals often pass through several

function generators

Routing delays must also be kept short— there are routing delays between every

function generator

Arithmetic delays are important— carry chains often create critical paths

Page 40: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

40Designing for 100+MHz

You Don’t Have To Be An ExpertYou Don’t Have To Be An Expert

You don’t have to be an FPGA architecture expert to implement high-performance designs— the benefits of a good architecture are automatic

– all the logic goes faster – software provides easy access to the features

You can achieve high-performance only with a good FPGA architecture— a good FPGA empowers its users

You’ll design better if you know the architecture— matching your design style to the available features increases

performance and/or lowers cost

Page 41: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

41Designing for 100+MHz

CarryFnctGen

CarryFnctGen

CarryFnctGen

CarryFnctGen

Virtex CLBVirtex CLB Logic and arithmetic delay reduction demands

improvements in the CLB

Virtex CLB is divided into two slices, each with:– 2 function generators– 2 flip-flops– 2 bits of carry logic

Page 42: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

42Designing for 100+MHz

Fast Function GeneratorsFast Function Generators

Each function generator emulates 2 to 3 levels of logic— a 10-level logic path typically requires

3 to 5 Function Generators in series— at 100 MHz, they must be less than

2 ns each including the routing

Virtex has 0.6-ns function generators— leaves 1.4 ns for each route

Page 43: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

43Designing for 100+MHz

F5 F5

FnctGen

F6FnctGen

FnctGen

FnctGen

Connecting Function GeneratorsConnecting Function Generators Some functions need several function generators

— F5 MUXs connect pairs of function generators– functions with 5 to 9 inputs

— F6 MUXs connect all 4 function generators– functions with 6 to 17 inputs

Page 44: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

44Designing for 100+MHz

CarryFnctGen

CarryFnctGen

CarryFnctGen

CarryFnctGen

CarryFnctGen

CarryFnctGen

CarryFnctGen

CarryFnctGen

Fast Local RoutingFast Local Routing Local routing provides fast interconnects

— in a CLB, Function Generators connect with minimal routing delays

— fast paths between adjacent CLBs increases flexibility

Page 45: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

45Designing for 100+MHz

Use Pipelining for SpeedUse Pipelining for Speed

Shorter clock periods means doing less each period— create a pipeline structure— pipeline stages operate concurrently— more functions are done at the same time— throughput increases

All function generators have output flip-flops— most pipeline support is “free”

Page 46: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

46Designing for 100+MHz

In directly cascaded pipelines the flip-flopsare not free

One SRLUT can implementup to 16 bits of delay— shift data in and select

the appropriate tap

16

-Bit

Sh

ift

Re

gis

ter

16-Bit Pipeline in One LUT16-Bit Pipeline in One LUT

Input

Output

DelaySelect

Page 47: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

47Designing for 100+MHz

Fast Logic Needs Fast RoutingFast Logic Needs Fast Routing

Our typical design with 3 to 5 CLBs needed an average routing delay of 1.4 ns or less— the Virtex routing

architecture deliversthis performance

Delay is independentof direction— dependably

short delays

Vector-based Interconnect

The circles show 1.4-ns routing regions

Page 48: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

48Designing for 100+MHz

Go Farther, FasterGo Farther, Faster

Virtex achieves its speed through a hierarchy of highly buffered routing resources— wires span 1, 2, or 6 CLBs

The Virtex routing architecture is designed for large arrays— today’s FPGAs are big…

but tomorrow’s will be even bigger

Virtex is designed to maintain its performance even in very large arrays

Page 49: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

49Designing for 100+MHz

No Routing CongestionNo Routing Congestion

For high-speed applications, routing must be dependably fast— not just capable of being fast

In the past, high device utilization has caused routing congestion— critical nets might be forced to meander

Virtex minimizes these problems— abundant resources prevent congestion

If it needs to be fast, it will be fast – automatically!

Page 50: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

50Designing for 100+MHz

CLB CLB CLB CLB CLB

Built-in Tri-State BussesBuilt-in Tri-State Busses

Bi-directional busses are supported directly by tri-state buffers built into each CLB— two drivers per CLB— segmentable every four CLB columns

Page 51: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

51Designing for 100+MHz

Arithmetic – A Special CaseArithmetic – A Special Case

Adders, accumulators, counters, and comparators all depend on carry chains

Carry-chain logic is usually much deeper than the rest of the design— 32 levels for a 16-bit ripple adder— too deep to use function generators at 100 MHz— arithmetic delays would limit performance

Dedicated carry logic provides the desired speed— 16-bit adders can operate at up to

200 MHz register-to-register

Page 52: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

52Designing for 100+MHz

Wide ArithmeticWide Arithmetic

64-bit adders would require 128 levels of logic— expensive complex carry schemes would be needed

to preserve performance

Virtex minimizes the carry propagation delay— 100 ps per bit pair— zero routing delay between CLBs

Minimal performance loss for each extra bit

16-bit adders operate at up to 200 MHz64-bit adders operate at up to 135 MHz

Page 53: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

53Designing for 100+MHz

Efficient Virtex MultipliersEfficient Virtex Multipliers

Cascade vs. tree structure— cascade simpler and smaller— tree is faster

Virtex gives the best of both worlds— as fast as a tree— smaller than a cascade

160 MHz clock rate for pipelined 16 x 16 multiplier

4 x 4 8 x 8 16 x 16

CascadeTreeVirtex Tree

4 x 4 8 x 8 16 x 16

CascadeTreeVirtex Tree

Del

ayN

umbe

r of C

LBs

Page 54: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

54Designing for 100+MHz

0 1

0

0 1

0

0 1

0

0 1

0 1

Fast Address DecodersFast Address Decoders

Wide address decoderscould slow operation— wide AND gates with

invertable inputs

Virtex carry-chain MUXscan act as AND gates— combine function

generator ANDs

64-bit decoders operateat up to 155 MHz

Page 55: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

55Designing for 100+MHz

Speed Is Never WastedSpeed Is Never Wasted

You can never have too much performance— excess performance can always be traded for

size and cost reduction

Replace single-cycle functions with smaller multi-cycle versions— a 2-cycle multiplier is half the cost of a

single-cycle multiplier

Reduce costs by designing downto the performance you need

Page 56: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

56Designing for 100+MHz

2X 2X

DLL2DLL2DLL1DLL1

90 MHz 180 MHz

45 MHz

Creating a High-Speed ClockCreating a High-Speed Clock

Logic sometimes needs to operate faster than the available clock— multiple RAM accesses in a single cycle— low-speed PCB clock distribution for power or

noise reduction

Virtex DLLs can double and redouble incoming clocks

Page 57: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

57Designing for 100+MHz

Optimized for the FutureOptimized for the Future

Deep sub-micron technology permits larger and larger array sizes— poses new circuit-design challenges— changes the rules of FPGA architecture

Across-chip routing is the most vulnerable— could easily limit design performance

Virtex is designed for long-term growth— even long, across-chip routes will remain fast

Virtex is tomorrow’s FPGA… today!

Page 58: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

58Designing for 100+MHz

10 ns 10 ns isis Long Enough Long Enough

Virtex CLBs can implement relatively complex functions in 10 ns— 0.6 ns per 4-input function generator

Virtex offers fast interconnections— even across-chip when fully utilized— fast tri-state buses

Support for very fast arithmetic operations— 16-bit adders at 200MHz

Page 59: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

59Designing for 100+MHz

Implement Designs Implement Designs Automatically Automatically

You don’t have to be an FPGA wizard to use Virtex

Virtex is optimized for automated implementation— uniform structure

– efficient mapping/synthesis— ample routing

– simple placement and no congestion— predictable performance

– effective synthesis

IP cores speed design even more— validated functionality with guaranteed performance

Page 60: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

60Designing for 100+MHz

Designing for 100+ MHzDesigning for 100+ MHz

Volts, Amps, and Watts— PCB signal distribution— chip inputs and outputs— power and thermal considerations

Ones and zeros— logic emulation

Bits and bytes— memory hierarchy

Page 61: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

61Designing for 100+MHz

100+ MHz Memory100+ MHz Memory

Virtex memory operates up to 200 MHz

High-speed memory has two benefits— data storage

– “work-in-progress”

– input/output buffers, FIFOs

— accelerating complex functions– store pre-computed values in look-up tables

Page 62: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

62Designing for 100+MHz

Data Storage HierarchyData Storage Hierarchy

Virtex supports 3 levels of memory hierarchy On-chip SelectRAM+

— small-to-medium memories — 0.6-ns read access time

On-chip Block SelectRAM+ — larger memories— true dual-ported operation— 3.3-ns read access time

Fast SelectI/O interfaces to external RAM— DLL boosts memory bandwidth

Page 63: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

63Designing for 100+MHz

SelectRAM+SelectRAM+

SelectRAM+ uses CLB LUTs as user memory— 16-deep RAMs— 32-deep RAMs— 16-deep dual-ported RAMs— 16-deep shift registers

Cascadable for larger memories— 128 or more words deep— uses logic resources for expansion

Page 64: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

64Designing for 100+MHz

Block SelectRAM+Block SelectRAM+

Up to 32 dual-ported 4096-bit RAM Blocks— synchronous read and write

True dual-port memory— each port has full read and write capability— different clocks for each port

Configurable aspect ratio— trade width for depth

– 4096 x 1 bit to 256 x 16 bits— separate configurations for each port

Dedicated routing for memory expansion

Page 65: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

65Designing for 100+MHz

High-Speed Memory InterfacesHigh-Speed Memory Interfaces

SelectI0 and DLLs together provide fast access to many types of external memory

Xilinx currently offers two reference designs— fully synthesized— automatic placement and routing

SDRAM … up to 125 MHz

ZBTRAM … up to 143 MHz

(Zero Bus-Turn-around)

Page 66: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

66Designing for 100+MHz

Input/Output Data BuffersInput/Output Data Buffers

High-performance systems need data buffers to decouple internal operation from I/O activity— I/O may be sporadic (burst-mode busses) — I/O may be faster or slower— I/O may be wider or narrower

I/O buffers can take several forms — dual-ported RAMs— ping-pong buffers— FIFOs

Page 67: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

67Designing for 100+MHz

Dual-ported I/O BuffersDual-ported I/O Buffers

Block SelectRAM+ is ideal for I/O buffers— dual-ported operation

– independent clocks and controls– bridges between clock domains– simultaneous read and write

— port-specific aspect-ratio control– built-in rate/width conversions

SelectRAM+ provides similar benefits on a smaller scale

Page 68: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

68Designing for 100+MHz

Ping-pong buffers are pairs of blocks that alternate between input and processing

SRLUT for small buffers— self-addressing input— 0.6-ns read access

Larger buffers can usethe dual-ported Block RAM— one address bit alternates

read/write areas— 3.3-ns read access

16-B

it S

hif

t R

egis

ter

16-B

it S

hif

t R

egis

ter

Select

ReadAddress

Input

Output

Ping Pong BuffersPing Pong Buffers

Page 69: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

69Designing for 100+MHz

Small FIFOs can be implemented in SRLUTs— word count addresses the output data— increment and enable SRLUT to Push— decrement to Pop— enable only for both

16-Byte FIFO in 4 CLBs— 16 x 16 in 6 CLBs— 200+ MHz

Expandable for deeperFIFOs 1

6-B

it S

hif

t R

egis

ter

Input

Down

WordCounter

Up

Push

Pop

Small FIFOs in SRLUTsSmall FIFOs in SRLUTs

Output

Page 70: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

70Designing for 100+MHz

Large FIFOs in Block RAMLarge FIFOs in Block RAM

Large FIFOs can use the dual-ported block RAM— add read and write

address counters

Asynchronous push and pop

Different port sizes give rate-for-width conversion

Block RAM FIFOs can operate at up to 170 MHz including flag logic

BlockSelectRAM+

Input Output

Push

Pop

Addrs Addrs

WE

Data Data

Co

un

ter

En En

Control LogicFull Empty

Co

un

ter

Page 71: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

71Designing for 100+MHz

Pre-computing for SpeedPre-computing for Speed

Some functions are too complex for 10-ns logic implementation— pipelining is not always possible

An alternative is to pre-compute all the possible results and store them in memory— select a result according to the inputs

Function time is independent of complexity— 0.6 ns SelectRAM+ access time— 3.3 ns Block SelectRAM+ access time

The function table can be smaller than the logic

Page 72: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

72Designing for 100+MHz

Multiplication By A ConstantMultiplication By A Constant

Sometimes, data has to be “scaled”— multiplied by a constant value

A full multiplier is too expensive— it can multiply by a variable— unnecessarily general and too

complex

Storing all multiples of the constant is a better alternative — smaller and much faster

Constant

InputMultiplier

ArrayScaledData

Input ScaledData

ProductTable

Page 73: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

73Designing for 100+MHz

A 216-word product table is impractical— partition the input into nibbles

– use 16-word LUTs for nibble products– combine the partial products in adders

Roughly half the CLBs of a full multiplier— for a 16-bit Coefficient:

36 CLBs vs.62 CLBs

Pipeline the addersfor extra speed

ScaledData

Input

LUT

LUT

LUT

LUT

x16

x256

x4096

16-bit Scaler16-bit Scaler

Page 74: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

74Designing for 100+MHz

The SRLUT mode can be used to update the table— “push-only” stack— last 16 bits loaded define the table

A simple accumulatorcomputes all productsof a new constant Output

ClearConstant

ChangeConstant

Reg-isterReg-

ister

Load

Changing the ConstantChanging the Constant

16

-Bit

Sh

ift

Re

gis

ter

Input

Page 75: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

75Designing for 100+MHz

Large Function TablesLarge Function Tables

Larger functions can be implemented in the Block SelectRAM+— 12-input functions— micro-coded state machines

Data tables can also be implemented— sine/cosine tables for DSP, for example— dual-ported access gives the sine and cosine

simultaneously— a simple address offset gives 90º phase shift for

accessing sine and cosine from a single table

Page 76: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

76Designing for 100+MHz

Block RAM/ROM CreationBlock RAM/ROM Creation

CORE Generator software creates RAMs and ROMs— simple GUI

interface

Initialization file is loaded into RAMs and ROMs at configuration time

Page 77: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

77Designing for 100+MHz

Memory SummaryMemory Summary

Virtex has two kinds of internal memory — distributed SelectRAM+ for small RAMs— Block SelectRAM+ for larger RAMs

SelectRAM+— 0.6 ns read access time— 16- and 32-word RAMs— 16-word dual-ported RAMs— 16-word shift registers

– sequential write/random-access read– FIFOs, pipelining, LUT functions, etc...

Page 78: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

78Designing for 100+MHz

Memory SummaryMemory Summary

Dual-ported 4096-bit Block SelectRAM+— 3.3 ns read access time— true dual-ported operation

– both ports are read/write– ports can be clocked asynchronously

— configurable aspect ratio– 4096 x 1 bit to 256 x 16 bits– configure ports differently for width/rate conversion

High-speed SelectI/O access to external RAM

Page 79: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

79Designing for 100+MHz

Designing for 100+ MHzDesigning for 100+ MHz

Volts, Amps, and Watts— DLLs and flexible I/O standards— fast inter-chip communication— simple rules for good signal integrity

Ones and zeros— fast logic and fast interconnect— dependable high performance

Bits and bytes— distributed SelectRAM+— dual-ported Block SelectRAM+

Page 80: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

80Designing for 100+MHz

The Virtex FamilyThe Virtex Family

The complete Virtex Data Sheet is on your AppLinx CD-ROMand at www.xilinx.com/partinfo/virtex.pdf

XCV50 XCV100 XCV150 XCV200 XCV300 XCV400 XCV600 XCV800 XCV1000

System Gates 57,906 108,904 164,674 236,666 322,970 468,252 661,111 888,439 1,124,022

Logic Cells 1,758 2,700 3,888 5,292 6,912 10,800 15,552 21,168 27,648

Block RAM 32 Kb 40 Kb 48 Kb 56 Kb 64 Kb 80 Kb 96 Kb 112 Kb 128 Kb

User I/OCS144 94 94

TQ144 94 94PQ/HQ240 164 164 164 164 164 164 164 164

BG256 180 180BG352 260 260 260BG432 316 316 316 316BG560 404 404 404 404

FG256 176 176 176 176FG456 260 284 312FG600 404 404 404FG680 500 514 514

Page 81: Designing for 100+MHz 1. 2 1999 Designs Demand...  Higher system speed  Higher integration —smaller size, less power, better reliability  Lower cost

81Designing for 100+MHz

Designing for 100+ MHzDesigning for 100+ MHz