designing for 100+ mhz
DESCRIPTION
Designing for 100+ MHz. 1999 Designs Demand. Higher system speed Higher integration smaller size, less power, better reliability Lower cost Shorter development time Better product differentiation. Traditional Multi-Chip Boards. Discrete design components CPU, memory - PowerPoint PPT PresentationTRANSCRIPT
1Designing for 100+MHz
Designing for 100+ MHzDesigning for 100+ MHz
2Designing for 100+MHz
1999 Designs Demand...1999 Designs Demand...
Higher system speed
Higher integration— smaller size, less power, better reliability
Lower cost
Shorter development time
Better product differentiation
3Designing for 100+MHz
Traditional Multi-Chip BoardsTraditional Multi-Chip Boards
Discrete design components— CPU, memory— bus transceivers, PCI controller, FIFOs— Ethernet controller, Graphics accelerator,
MPEG, DSP, etc.— programmable logic as glue and custom function
Advantages:— well-documented sophisticated functions— readily available as IP in silicon
4Designing for 100+MHz
Multi-Chip Board ProblemsMulti-Chip Board Problems
Physical size
Power consumption and reliability
PC board signal integrity
Limited flexibility— prevents design modifications and upgrades— prevents product diversification— prevents product customization
Poor product differentiation— standard parts = standard architecture
5Designing for 100+MHz
FPGA AdvantagesFPGA Advantages
Smaller size
Lower power consumption
Better signal integrity— fewer PC-board issues
Enhanced flexibility— easy modifications, upgrades, etc.
Enhanced product differentiation— proprietary architectures
6Designing for 100+MHz
FPGAs Users Want...FPGAs Users Want...
System clock rate of 100+ MHz
>100,000 gates
Efficient design methodologies
Availability of well-documented Cores
Reasonable cost
7Designing for 100+MHz
The FPGA SolutionThe FPGA Solution
4th Generation FPGALogic+Memory+Routing
Multi-Standard Select I/O
Temperature Sensing
Delay-Locked Loop for Fast Clock and I/O
3.3 ns SynchronousDual-Port SRAM
500 Mbps SelectMAP Configuration
8Designing for 100+MHz
Now the Challenge...Now the Challenge...
Together, we can do it...— we’ll supply the ingredients...— you use them intelligently
But don’t forget...— the clock period is less than 10 ns !
Design a 100+ MHz system
9Designing for 100+MHz
Designing for 100+ MHz.Designing for 100+ MHz.
Volts, Amps, and Watts— PCB signal distribution— chip inputs and outputs— power and thermal considerations
Ones and zeros— logic emulation
Bits and bytes— memory hierarchy
10Designing for 100+MHz
’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10Year
Clock Frequency
Trace Length MHz
Inches per 1/4 Clock Period
2048
1024
512
256
128
64
32
16
8
4
2
1
Moore Meets EinsteinMoore Meets Einstein
Speed Doubles Every 5 Years…...But the speed of light never changes
11Designing for 100+MHz
Volts, Amps, and WattsVolts, Amps, and Watts PCB design issues
— capacative loading— transmission lines and termination
Chip inputs and outputs— clock distribution and DLLs— I/O standards
Power and thermal considerations— temperature sensing diode — power supply decoupling
Configuration— new SelectMAP mode
12Designing for 100+MHz
Capacitive LoadingCapacitive Loading
Capacitance slows outputs and increases power— output delay increase:
– ~ 25 ps per pF of additional loading— output power dissipation increase:
– 11 µW per MHz per pF with 3.3-V swing
Sources of capacitance— 10 pF max for each device pin— 2 pF per inch for narrow traces ( 0.8 pF/cm )— 130 pF per inch2 for copper areas ( 20 pF/cm2)
IBIS files provide output impedance details
13Designing for 100+MHz
Transmission LinesTransmission Lines
Some traces must be treated as transmission lines to minimize ringing— transmission line if round trip > transition time— lumped-capacitance if round trip < transition time
Signal delay on a PCB:— 140 to 180 ps per inch ( 50 to 70 ps/cm)
Lumped-capacitance trace length:— 3 inches max for a 1-ns transition time (7.5 cm)— 6 inches max for a 2-ns transition time (15 cm)
14Designing for 100+MHz
50 Ω
50 Ω
50 Ω
VCC
50 Ω
100 pF
100 Ω
100 Ω
(50 Ω Total)
22 Ω 27 Ω
Traditional Thevenintermination at the end
Dynamic termination at the end is better and saves power
Series termination at the source is best single source and destination only!
Terminated Transmission Lines Terminated Transmission Lines Reflections and ringingReflections and ringing
15Designing for 100+MHz
Clock
Data
On-Chip Clock DistributionOn-Chip Clock Distribution
Clock distribution introduces delay— larger chips suffer more clock delay
CLBIOB
16Designing for 100+MHz
IOBFlip-Flop
QDData
ClockClock
DistributionDelay
Clock
Required Data Valid(without delay)
Required Data Valid(with delay)
Delay
Clock Delay Problems Clock Delay Problems
Clock delay increases clock-to-output times
Clock delay leads to unacceptable input hold time— set-up time is negative
Additional data delay can eliminate the hold time — set-up time becomes positive— but tolerance build-up widens the data-valid window
17Designing for 100+MHz
DLLs Maximize I/O SpeedDLLs Maximize I/O Speed
Clock-to-output time plus set-up time determinesthe I/O speed and data bandwidth— min clock period = max clock-to-out + max set-up
Traditional solution:— use highly buffered, balanced clock trees
– needed to reduce internal clock skew– cannot totally eliminate the delay
The Virtex solution:— use a Delay-Locked-Loop ( DLL )
– aligns the internal and external clocks– effectively eliminates the clock-distribution delay
18Designing for 100+MHz
Clock
Data
ComparatorError
Delay
Virtex Has 4 Independent DLLsVirtex Has 4 Independent DLLs
DLLs adjust clock delay to align internal and external clocks— digital closed-loop control — 25 to 200-MHz range, 35-picosecond resolution
CLBIOB
19Designing for 100+MHz
Fast Clock-to-Out With DLLFast Clock-to-Out With DLL
Clock
3.8 ns
Virtex FPGA Virtex FPGA
Q
DLL
D
DLL
1.9 ns
0.5 ns
160 MHz inter-chip data rate— 16-mA LVTTL— IOB register to IOB register
20Designing for 100+MHz
LVTTL Data Rate with DLLLVTTL Data Rate with DLL
1.4 ns measured clock-to-output delay
Output standard = LVTTL Fast 16mA
(OBUF_F_16)
Temp=100C, Vdd=2.375V, Vcco=3.3V
Waveforms:
1: CLKIN
2: DATA OUT (no DLL)
3: DATA OUT (DLL deskewed)
Timing
w/o DLL w/ DLL
r->r r->f r->r r->f
3.9n 3.9n 1.4n 1.4n
21Designing for 100+MHz
Other DLL FunctionsOther DLL Functions
Double the incoming clock frequency — fast internal operation – slow external clock
Clock mirroring to the PCB
Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
Adjust clock duty cycle to 50-50
Create four quadrature clock phases— input four sequential bits per clock period
22Designing for 100+MHz
25 MHz 25% Duty
Cycle
25 MHz 50% Duty Cycle
Virtex FPGA
1X
Duty Cycle CorrectionDuty Cycle Correction
~25% duty cycle in – 50% duty cycle out
DLLDLL
23Designing for 100+MHz
Clock Doubling and MirroringClock Doubling and Mirroring
Clock mirror with less than 100 ps skew— simplifies PCB clock distribution
Virtex
Zero-Delay Internal Clock Buffer
37 MHz74 MHz #1
74 MHz #2
74 MHz Internal
37 MHz Internal
System Clock
SDRAM
Inside FPGA
Inside FPGA
SystemClock
1 Input Load ExactlyAligned
ExactlyAligned
Actual HDTV Customer Example
SDRAM
DLL 2DLL 2
DLL 1DLL 1
24Designing for 100+MHz
66MHz
Clock
132 MHz Clock
Virtex FPGA
2X
DLLDLL
Precise Clock MirroringPrecise Clock Mirroring
2x system clock for board use
25Designing for 100+MHz
CLKIn 200 MHz
CLKout 200 MHz
CLKDV 12.5 MHz
Clock DivisionClock Division
Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16— maintain synchronous edges
26Designing for 100+MHz
Multi-Standard SelectI/OMulti-Standard SelectI/OGTL+
5V Tolerant
2.5V SSTL
1.8V
3.3V LVTTL
5V
MicroProcessorMicroProcessor SRAMSRAM
DSPDSP
Mixed SignalMixed Signal
Busses/Backplanes(3/5V PCI, ISA, GTL…)
Busses/Backplanes(3/5V PCI, ISA, GTL…)
FLASHFLASH
SDRAMSDRAMSDRAM
27Designing for 100+MHz
Mix & Match Output StandardsMix & Match Output Standards
User-supplied voltages determine output swing— 3.3 V, 2.5 V, 1.5 V— one voltage per bank— a bank is half of a chip edge
Output characteristics are programmable on a per-pin basis— push-pull or open-drain— LVTTL drive strength
– 2-mA to 24-mA sink and source current— LVTTL Slew rate
28Designing for 100+MHz
InternalReference
VREF
Input
Input
Input
Input
Input
Input
VREF
Mix & Match Input StandardsMix & Match Input Standards
Internal or user-supplied threshold voltage— selectable on a per-pin basis— one user-supplied
threshold voltage per bank
Programmable over-voltage protection— 5-V tolerant or diode
clamp to VCCO— selectable on a per-pin basis
29Designing for 100+MHz
SSTL Clock-to-Out With DLLSSTL Clock-to-Out With DLL
200 MHz inter-chip data rate— SSTL 3, Class II— IOB register to IOB register
Clock
2.8 ns
Virtex FPGA Virtex FPGA
Q
DLL
D
DLL
1.9 ns
0.3 ns
(Stub Series Transceiver Logic)
30Designing for 100+MHz
SSTL Data Rate with DLLSSTL Data Rate with DLL
Output standard = SSTL 3 Class 2
(OBUF_SSTL3_II)
Temp=100C, Vdd=2.375V, Vcco=3.3V, Vtt=1.5V
Waveforms:
1: CLKIN
2: DATA OUT (no DLL)
3: DATA OUT (DLL deskewed)
Timing
w/o DLL w/ DLL
r->r r->f r->r r->f
3.5n 3.8n 1.1n 1.3n
1.3 ns measured clock-to-output delay— much lower noise than LVTTL
31Designing for 100+MHz
From FPGA to System ComponentFrom FPGA to System Component‘Redefining the FPGA’‘Redefining the FPGA’
"Virtex moves FPGAs from glue to system component” - Ron Neale, EE
GT
L+
High Speed System Backplane
Low VoltageCPU
LVTTL
SD
RA
M (
133M
Hz)
SSTL3
Cache SRAM (Mbytes)
LVCMOS
Chip 1 Chip 1
x1 CLK x2 CLK
32Designing for 100+MHz
Power and Thermal IssuesPower and Thermal Issues
Power and heat are serious concerns
All CMOS power consumption is dynamic— proportional to VCC
2
— proportional to capacitance— proportional to frequency
Virtex conserves power— 2.5-V supply voltage— small geometries and short interconnects
reduce capacitance
33Designing for 100+MHz
384 16-bit Counters 2.5 W Total
768 8-bit Counters 3.7 W Total
1536 16-bit Counters 9.8 W Total
3072 8-bit Counters 14.7 W Total
XCV300
XCV1000
Virtex Power ConsumptionVirtex Power Consumption
Virtex is designed to conserve power— 100 MHz 16-bit counters
– 12.5 MHz average transition rate– 6.5 mW per counter including clock distribution
— 100 MHz 8-bit counters– 25 MHz average transition rate– 5 mW per counter including clock distribution
34Designing for 100+MHz
DXP
DXN
VirtexFPGA SBMCLK
SBMDATA
ALERT
MaximMAX1617
Thermal ManagementThermal Management
Temperature-sensing diode— matched to maxim MAX 1617 A/D— programmable alarms— similar to the Pentium II solution
35Designing for 100+MHz
Power Supply DecouplingPower Supply Decoupling
CMOS power-supply current is dynamic— current pulse every active clock edge
Peak current can be 5x the average current— instantaneous current peaks can only be
supplied by decoupling capacitors
Use one 0.1 µF ceramic chip capacitor for each power-supply pin— low L and R are more important than high C— double up for lower L and R if necessary— use direct vias to the supply planes, close to the
power-supply pins
36Designing for 100+MHz
VirtexFPGA
WE, CS Data
Virtex ConfigurationVirtex Configuration
New byte-wide SelectMAP mode— up to 528 Mbps at 66 MHz
– simple handshake protocol— up to 400 Mbps at 50 MHz
– no handshake required
Configuration bit-stream length— 0.5 Mbits to 6.1 Mbits
CSAddress
ConfigurationEPROM
Control Logic(EPLD)
Busy
37Designing for 100+MHz
Volts, Amps, and Watts: RecapVolts, Amps, and Watts: Recap PCB design issues
— minimize capacitance for higher speed— terminate transmission lines to reduce ringing
Chip inputs and outputs— use DLLs to maximize I/O bandwidth— use SelectI/O to interface with different standards
Power and thermal considerations— use the sensing diode to manage chip temperature— decouple the power supply well
Configuration— configure faster with the SelectMAP mode
38Designing for 100+MHz
Designing for 100+ MHz.Designing for 100+ MHz.
Volts, Amps, and Watts— PCB Signal Distribution— chip Inputs and Outputs— power and Thermal Considerations
Ones and zeros— logic Emulation
Bits and bytes— memory hierarchy
39Designing for 100+MHz
Spending the 10 ns BudgetSpending the 10 ns Budget
Fast logic requires fast function generators— signals often pass through several
function generators
Routing delays must also be kept short— there are routing delays between every
function generator
Arithmetic delays are important— carry chains often create critical paths
40Designing for 100+MHz
You Don’t Have To Be An ExpertYou Don’t Have To Be An Expert
You don’t have to be an FPGA architecture expert to implement high-performance designs— the benefits of a good architecture are automatic
– all the logic goes faster – software provides easy access to the features
You can achieve high-performance only with a good FPGA architecture— a good FPGA empowers its users
You’ll design better if you know the architecture— matching your design style to the available features increases
performance and/or lowers cost
41Designing for 100+MHz
CarryFnctGen
CarryFnctGen
CarryFnctGen
CarryFnctGen
Virtex CLBVirtex CLB Logic and arithmetic delay reduction demands
improvements in the CLB
Virtex CLB is divided into two slices, each with:– 2 function generators– 2 flip-flops– 2 bits of carry logic
42Designing for 100+MHz
Fast Function GeneratorsFast Function Generators
Each function generator emulates 2 to 3 levels of logic— a 10-level logic path typically requires
3 to 5 Function Generators in series— at 100 MHz, they must be less than
2 ns each including the routing
Virtex has 0.6-ns function generators— leaves 1.4 ns for each route
43Designing for 100+MHz
F5 F5
FnctGen
F6FnctGen
FnctGen
FnctGen
Connecting Function GeneratorsConnecting Function Generators Some functions need several function generators
— F5 MUXs connect pairs of function generators– functions with 5 to 9 inputs
— F6 MUXs connect all 4 function generators– functions with 6 to 17 inputs
44Designing for 100+MHz
CarryFnctGen
CarryFnctGen
CarryFnctGen
CarryFnctGen
CarryFnctGen
CarryFnctGen
CarryFnctGen
CarryFnctGen
Fast Local RoutingFast Local Routing Local routing provides fast interconnects
— in a CLB, Function Generators connect with minimal routing delays
— fast paths between adjacent CLBs increases flexibility
45Designing for 100+MHz
Use Pipelining for SpeedUse Pipelining for Speed
Shorter clock periods means doing less each period— create a pipeline structure— pipeline stages operate concurrently— more functions are done at the same time— throughput increases
All function generators have output flip-flops— most pipeline support is “free”
46Designing for 100+MHz
In directly cascaded pipelines the flip-flopsare not free
One SRLUT can implementup to 16 bits of delay— shift data in and select
the appropriate tap
16
-Bit
Sh
ift
Re
gis
ter
16-Bit Pipeline in One LUT16-Bit Pipeline in One LUT
Input
Output
DelaySelect
47Designing for 100+MHz
Fast Logic Needs Fast RoutingFast Logic Needs Fast Routing
Our typical design with 3 to 5 CLBs needed an average routing delay of 1.4 ns or less— the Virtex routing
architecture deliversthis performance
Delay is independentof direction— dependably
short delays
Vector-based Interconnect
The circles show 1.4-ns routing regions
48Designing for 100+MHz
Go Farther, FasterGo Farther, Faster
Virtex achieves its speed through a hierarchy of highly buffered routing resources— wires span 1, 2, or 6 CLBs
The Virtex routing architecture is designed for large arrays— today’s FPGAs are big…
but tomorrow’s will be even bigger
Virtex is designed to maintain its performance even in very large arrays
49Designing for 100+MHz
No Routing CongestionNo Routing Congestion
For high-speed applications, routing must be dependably fast— not just capable of being fast
In the past, high device utilization has caused routing congestion— critical nets might be forced to meander
Virtex minimizes these problems— abundant resources prevent congestion
If it needs to be fast, it will be fast – automatically!
50Designing for 100+MHz
CLB CLB CLB CLB CLB
Built-in Tri-State BussesBuilt-in Tri-State Busses
Bi-directional busses are supported directly by tri-state buffers built into each CLB— two drivers per CLB— segmentable every four CLB columns
51Designing for 100+MHz
Arithmetic – A Special CaseArithmetic – A Special Case
Adders, accumulators, counters, and comparators all depend on carry chains
Carry-chain logic is usually much deeper than the rest of the design— 32 levels for a 16-bit ripple adder— too deep to use function generators at 100 MHz— arithmetic delays would limit performance
Dedicated carry logic provides the desired speed— 16-bit adders can operate at up to
200 MHz register-to-register
52Designing for 100+MHz
Wide ArithmeticWide Arithmetic
64-bit adders would require 128 levels of logic— expensive complex carry schemes would be needed
to preserve performance
Virtex minimizes the carry propagation delay— 100 ps per bit pair— zero routing delay between CLBs
Minimal performance loss for each extra bit
16-bit adders operate at up to 200 MHz64-bit adders operate at up to 135 MHz
53Designing for 100+MHz
Efficient Virtex MultipliersEfficient Virtex Multipliers
Cascade vs. tree structure— cascade simpler and smaller— tree is faster
Virtex gives the best of both worlds— as fast as a tree— smaller than a cascade
160 MHz clock rate for pipelined 16 x 16 multiplier
4 x 4 8 x 8 16 x 16
CascadeTreeVirtex Tree
4 x 4 8 x 8 16 x 16
CascadeTreeVirtex Tree
Del
ayN
umbe
r of C
LBs
54Designing for 100+MHz
0 1
0
0 1
0
0 1
0
0 1
0 1
Fast Address DecodersFast Address Decoders
Wide address decoderscould slow operation— wide AND gates with
invertable inputs
Virtex carry-chain MUXscan act as AND gates— combine function
generator ANDs
64-bit decoders operateat up to 155 MHz
55Designing for 100+MHz
Speed Is Never WastedSpeed Is Never Wasted
You can never have too much performance— excess performance can always be traded for
size and cost reduction
Replace single-cycle functions with smaller multi-cycle versions— a 2-cycle multiplier is half the cost of a
single-cycle multiplier
Reduce costs by designing downto the performance you need
56Designing for 100+MHz
2X 2X
DLL2DLL2DLL1DLL1
90 MHz 180 MHz
45 MHz
Creating a High-Speed ClockCreating a High-Speed Clock
Logic sometimes needs to operate faster than the available clock— multiple RAM accesses in a single cycle— low-speed PCB clock distribution for power or
noise reduction
Virtex DLLs can double and redouble incoming clocks
57Designing for 100+MHz
Optimized for the FutureOptimized for the Future
Deep sub-micron technology permits larger and larger array sizes— poses new circuit-design challenges— changes the rules of FPGA architecture
Across-chip routing is the most vulnerable— could easily limit design performance
Virtex is designed for long-term growth— even long, across-chip routes will remain fast
Virtex is tomorrow’s FPGA… today!
58Designing for 100+MHz
10 ns 10 ns isis Long Enough Long Enough
Virtex CLBs can implement relatively complex functions in 10 ns— 0.6 ns per 4-input function generator
Virtex offers fast interconnections— even across-chip when fully utilized— fast tri-state buses
Support for very fast arithmetic operations— 16-bit adders at 200MHz
59Designing for 100+MHz
Implement Designs Implement Designs Automatically Automatically
You don’t have to be an FPGA wizard to use Virtex
Virtex is optimized for automated implementation— uniform structure
– efficient mapping/synthesis— ample routing
– simple placement and no congestion— predictable performance
– effective synthesis
IP cores speed design even more— validated functionality with guaranteed performance
60Designing for 100+MHz
Designing for 100+ MHzDesigning for 100+ MHz
Volts, Amps, and Watts— PCB signal distribution— chip inputs and outputs— power and thermal considerations
Ones and zeros— logic emulation
Bits and bytes— memory hierarchy
61Designing for 100+MHz
100+ MHz Memory100+ MHz Memory
Virtex memory operates up to 200 MHz
High-speed memory has two benefits— data storage
– “work-in-progress”
– input/output buffers, FIFOs
— accelerating complex functions– store pre-computed values in look-up tables
62Designing for 100+MHz
Data Storage HierarchyData Storage Hierarchy
Virtex supports 3 levels of memory hierarchy On-chip SelectRAM+
— small-to-medium memories — 0.6-ns read access time
On-chip Block SelectRAM+ — larger memories— true dual-ported operation— 3.3-ns read access time
Fast SelectI/O interfaces to external RAM— DLL boosts memory bandwidth
63Designing for 100+MHz
SelectRAM+SelectRAM+
SelectRAM+ uses CLB LUTs as user memory— 16-deep RAMs— 32-deep RAMs— 16-deep dual-ported RAMs— 16-deep shift registers
Cascadable for larger memories— 128 or more words deep— uses logic resources for expansion
64Designing for 100+MHz
Block SelectRAM+Block SelectRAM+
Up to 32 dual-ported 4096-bit RAM Blocks— synchronous read and write
True dual-port memory— each port has full read and write capability— different clocks for each port
Configurable aspect ratio— trade width for depth
– 4096 x 1 bit to 256 x 16 bits— separate configurations for each port
Dedicated routing for memory expansion
65Designing for 100+MHz
High-Speed Memory InterfacesHigh-Speed Memory Interfaces
SelectI0 and DLLs together provide fast access to many types of external memory
Xilinx currently offers two reference designs— fully synthesized— automatic placement and routing
SDRAM … up to 125 MHz
ZBTRAM … up to 143 MHz
(Zero Bus-Turn-around)
66Designing for 100+MHz
Input/Output Data BuffersInput/Output Data Buffers
High-performance systems need data buffers to decouple internal operation from I/O activity— I/O may be sporadic (burst-mode busses) — I/O may be faster or slower— I/O may be wider or narrower
I/O buffers can take several forms — dual-ported RAMs— ping-pong buffers— FIFOs
67Designing for 100+MHz
Dual-ported I/O BuffersDual-ported I/O Buffers
Block SelectRAM+ is ideal for I/O buffers— dual-ported operation
– independent clocks and controls– bridges between clock domains– simultaneous read and write
— port-specific aspect-ratio control– built-in rate/width conversions
SelectRAM+ provides similar benefits on a smaller scale
68Designing for 100+MHz
Ping-pong buffers are pairs of blocks that alternate between input and processing
SRLUT for small buffers— self-addressing input— 0.6-ns read access
Larger buffers can usethe dual-ported Block RAM— one address bit alternates
read/write areas— 3.3-ns read access
16-B
it S
hif
t R
egis
ter
16-B
it S
hif
t R
egis
ter
Select
ReadAddress
Input
Output
Ping Pong BuffersPing Pong Buffers
69Designing for 100+MHz
Small FIFOs can be implemented in SRLUTs— word count addresses the output data— increment and enable SRLUT to Push— decrement to Pop— enable only for both
16-Byte FIFO in 4 CLBs— 16 x 16 in 6 CLBs— 200+ MHz
Expandable for deeperFIFOs 1
6-B
it S
hif
t R
egis
ter
Input
Down
WordCounter
Up
Push
Pop
Small FIFOs in SRLUTsSmall FIFOs in SRLUTs
Output
70Designing for 100+MHz
Large FIFOs in Block RAMLarge FIFOs in Block RAM
Large FIFOs can use the dual-ported block RAM— add read and write
address counters
Asynchronous push and pop
Different port sizes give rate-for-width conversion
Block RAM FIFOs can operate at up to 170 MHz including flag logic
BlockSelectRAM+
Input Output
Push
Pop
Addrs Addrs
WE
Data Data
Co
un
ter
En En
Control LogicFull Empty
Co
un
ter
71Designing for 100+MHz
Pre-computing for SpeedPre-computing for Speed
Some functions are too complex for 10-ns logic implementation— pipelining is not always possible
An alternative is to pre-compute all the possible results and store them in memory— select a result according to the inputs
Function time is independent of complexity— 0.6 ns SelectRAM+ access time— 3.3 ns Block SelectRAM+ access time
The function table can be smaller than the logic
72Designing for 100+MHz
Multiplication By A ConstantMultiplication By A Constant
Sometimes, data has to be “scaled”— multiplied by a constant value
A full multiplier is too expensive— it can multiply by a variable— unnecessarily general and too
complex
Storing all multiples of the constant is a better alternative — smaller and much faster
Constant
InputMultiplier
ArrayScaledData
Input ScaledData
ProductTable
73Designing for 100+MHz
A 216-word product table is impractical— partition the input into nibbles
– use 16-word LUTs for nibble products– combine the partial products in adders
Roughly half the CLBs of a full multiplier— for a 16-bit Coefficient:
36 CLBs vs.62 CLBs
Pipeline the addersfor extra speed
ScaledData
Input
LUT
LUT
LUT
LUT
x16
x256
x4096
16-bit Scaler16-bit Scaler
74Designing for 100+MHz
The SRLUT mode can be used to update the table— “push-only” stack— last 16 bits loaded define the table
A simple accumulatorcomputes all productsof a new constant Output
ClearConstant
ChangeConstant
Reg-isterReg-
ister
Load
Changing the ConstantChanging the Constant
16
-Bit
Sh
ift
Re
gis
ter
Input
75Designing for 100+MHz
Large Function TablesLarge Function Tables
Larger functions can be implemented in the Block SelectRAM+— 12-input functions— micro-coded state machines
Data tables can also be implemented— sine/cosine tables for DSP, for example— dual-ported access gives the sine and cosine
simultaneously— a simple address offset gives 90º phase shift for
accessing sine and cosine from a single table
76Designing for 100+MHz
Block RAM/ROM CreationBlock RAM/ROM Creation
CORE Generator software creates RAMs and ROMs— simple GUI
interface
Initialization file is loaded into RAMs and ROMs at configuration time
77Designing for 100+MHz
Memory SummaryMemory Summary
Virtex has two kinds of internal memory — distributed SelectRAM+ for small RAMs— Block SelectRAM+ for larger RAMs
SelectRAM+— 0.6 ns read access time— 16- and 32-word RAMs— 16-word dual-ported RAMs— 16-word shift registers
– sequential write/random-access read– FIFOs, pipelining, LUT functions, etc...
78Designing for 100+MHz
Memory SummaryMemory Summary
Dual-ported 4096-bit Block SelectRAM+— 3.3 ns read access time— true dual-ported operation
– both ports are read/write– ports can be clocked asynchronously
— configurable aspect ratio– 4096 x 1 bit to 256 x 16 bits– configure ports differently for width/rate conversion
High-speed SelectI/O access to external RAM
79Designing for 100+MHz
Designing for 100+ MHzDesigning for 100+ MHz
Volts, Amps, and Watts— DLLs and flexible I/O standards— fast inter-chip communication— simple rules for good signal integrity
Ones and zeros— fast logic and fast interconnect— dependable high performance
Bits and bytes— distributed SelectRAM+— dual-ported Block SelectRAM+
80Designing for 100+MHz
The Virtex FamilyThe Virtex Family
The complete Virtex Data Sheet is on your AppLinx CD-ROMand at www.xilinx.com/partinfo/virtex.pdf
XCV50 XCV100 XCV150 XCV200 XCV300 XCV400 XCV600 XCV800 XCV1000
System Gates 57,906 108,904 164,674 236,666 322,970 468,252 661,111 888,439 1,124,022
Logic Cells 1,758 2,700 3,888 5,292 6,912 10,800 15,552 21,168 27,648
Block RAM 32 Kb 40 Kb 48 Kb 56 Kb 64 Kb 80 Kb 96 Kb 112 Kb 128 Kb
User I/OCS144 94 94
TQ144 94 94PQ/HQ240 164 164 164 164 164 164 164 164
BG256 180 180BG352 260 260 260BG432 316 316 316 316BG560 404 404 404 404
FG256 176 176 176 176FG456 260 284 312FG600 404 404 404FG680 500 514 514
81Designing for 100+MHz
Designing for 100+ MHzDesigning for 100+ MHz