advance digital design
DESCRIPTION
Advance Digital Design . Hassan Bhatti, Lecture 10. Field-Programmable Gate Arrays (FPGAs). Ease of reprogramming enable rapid prototyping Replacement of ASICs in low-volume end of the market Register rich tiled architecture of Functional units and a flexible channel based interconnections. - PowerPoint PPT PresentationTRANSCRIPT
Advance Digital Design
Hassan Bhatti, Lecture 10
Field-Programmable Gate Arrays (FPGAs) Ease of reprogramming enable rapid
prototyping Replacement of ASICs in low-volume
end of the market Register rich tiled architecture of
Functional units and a flexible channel based interconnections
Overview Continued ASIC Research center has xess boards
with Xilinx chips on them. Every Xilinx chip required Xilinx tool
to be compiled
FPGA Big Idea Basic idea: 2D array of combination logic blocks (CL)
and flip-flops (FF) with a means for the user to configure both:
1. the interconnection between the logic blocks,2. the function of each block.
Idealized FPGA Logic Block
4-input Look Up Table (4-LUT)1. implements combinational logic functions Register1. optionally stores output of LUT2. Latch determines whether read reg or LUT
Xilinx FPGA Xilinx are pioneers in FPGA, launch
first XC4000 FPGA in 1985. Other generations like Spartan/XL etc
are based on XC 4000. Each FPGA consist of
Configurable Logic Blocks CLBs, Routing Resources, IOB (Input Output Buffers) SRAM Based controller.
XC 4000
XC 4000 Continued….
Architecture of CLBs Each CLB has two 4-input Lookup Tables
(LUTs) and two registers. The two LUTs implement two independent
logic functions F and G. The outputs F’ and G’ from the two LUTs
inside each CLB can be combined to form a more complex function H.
CLBs are linked together to form carry and cascade chain circuits not shown in diagram).
Architecture of CLBs
Interconnect Resources of XC 4000
There are three types of interconnects1. Dedicated Inter connects (Direct) :
Lines provide routing b/w adjacent vertical and horizontal CLBs in the same row and column.
2. Double Length Lines: (Long lines) Transverse the distance of two CLBs before entering a switch matrix skipping every other CLBs.
3. Long Lines Span (Global): The entire array vertically and horizontally. They have splitters that segment the lines.
XC 4000 Interconnect ….
XC 4000 Interconnect ….
XC 4000 Interconnect ….
Inside Interconnects
Architecture Of PIP Break Point PIP
Connect or isolates two wire segments Cross point PIP
Turn Corners Multiplex PIP
Directional and buffered Select one of n input to output
XC 4000 IOB
Example Implement the following functions on a
single CLB of the XC4000 FPGA:
X = A’B’ (C + D) Y = AK + BK + C’D’K + AEJL
Use look up table F to implement X Use look up table G for AEJL Use F, G and H for Y:
Y = K(A+B + C’D’) + AEJL = KX’ + AEJL= KF’+G
Illustrated
Spartan 2 ASIC Center got Xess-100 which has
spartan-2 board. The architecture is based on XC-4000.
Inside the Board
Spartan-3E ArchitectureFundamental Elements
• Configurable Logic Blocks (CLBs)– Consists of RAM based look up table to implement logic and
storage elements that can be used as flip-flops or latches.
• Input Output Blocks (IOBs)– Controls the flow of data between IO pins and internal logic.
Supports many different signal standards. (Tri-state, bidirectional, LVTTL, etc.
• Block RAM (BRAM)• 18 bit Multiplier Blocks• Digital Clock Manager (DCM)
Spartan 3 Configurable Logic Blocks (CLB’s)
• CLBs contain Ram based lookup tables to implement logic and storage elements that can be used as flip-flops or latches.
• CLBs can be programmed to perform a wide variety of logic functions as well as store data.
Clock signal fromoutside world
Clocktree Flip-flops
Special clockpin and pad
Spartan 3E IO Blocks (IOB’s)
• IOB’s control flow of data between IO pins and the internal logic.
• Each IOB supports bidirectional data flow, 3-state operation, and numerous different signal standards. (We will typically use LVTTL). See data sheet.
• Very low cost, high-performance logic solution forhigh-volume, consumer-oriented applications• Multi-voltage, multi-standard SelectIO™ interface pins- Up to 376 I/O pins or 156 differential signal pairs- LVCMOS, LVTTL, HSTL, and SSTL single-endedsignal standards- 3.3V, 2.5V, 1.8V, 1.5V, and 1.2V signaling
I/O block continued
CLB’s – four slices per CLB
Top slice of CLB
Virtex Basic ArchitectureI/O Blocks (IOBs)
ConfigurableLogic Blocks (CLBs)
Clock Management (DCMs, BUFGMUXes)
Block SelectRAM™resource
Dedicated multipliers
Programmable interconnect
Slices and CLBs• Each Virtex-II CLB contains
four slices– Local routing provides feedback
between slices in the same CLB, and it provides routing to neighboring CLBs
– A switch matrix provides access to general routing resources
CIN
SwitchMatrix
BUFTBUF T
Slice S0
Slice S1
Local Routing
Slice S2
Slice S3
CIN
SHIFT
Slice Structure• The next few slides discuss
the slice features– LUTs– MUXF5, MUXF6,
MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)
– Carry Logic– MULT_ANDs– Sequential Elements
Combinatorial Logic
AB
CD
Z
Look-Up Tables• Combinatorial logic is stored in Look-Up Tables (LUTs)
– Also called Function Generators (FGs)– Capacity is limited by the number of inputs, not by the
complexity• Delay through the LUT is constant
A B C D Z0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1
. . .
1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1
Connecting Look-Up Tables
F5F8
F5F6
CLB
Slice S3
Slice S2
Slice S0
Slice S1 F5F7
F5F6
MUXF8 combines the two MUXF7 outputs (from the CLB above or below)
MUXF6 combines slices S2 and S3
MUXF7 combines the two MUXF6 outputs
MUXF6 combines slices S0 and S1
MUXF5 combines LUTs in each slice
Fast Carry Logic• Simple, fast, and complete
arithmetic Logic– Dedicated XOR gate for
single-level sum completion– Uses dedicated routing
resources – All synthesis tools can infer
carry logic
COUT COUT
SLICE S0
SLICE S1
Second Carry Chain
To S0 of the next CLB
To CIN of S2 of the next CLB
First Carry Chain
SLICE S3
SLICE S2
COUT
COUTCIN
CIN
CIN CIN CLB
CODI CI
S
LUT
CY_MUX
CY_XOR
MULT_AND
A
B
A x B
LUT
LUT
MULT_AND Gate• Highly efficient multiply and add implementation
– Earlier FPGA architectures require two LUTs per bit to perform the multiplication and addition
– The MULT_AND gate enables an area reduction by performing the multiply and the add in one LUT per bit
DCE
PRE
CLR
Q
FDCPE
DCE
S
R
Q
FDRSE
DCE
PRE
CLR
Q
LDCPE
G
_1
Flexible Sequential Elements• Either flip-flops or latches• Two in each slice; eight in each CLB• Inputs come from LUTs or from an
independent CLB input• Separate set and reset controls
– Can be synchronous or asynchronous• All controls are shared within a slice
– Control signals can be inverted locally within a slice
Shift Register LUT (SRL16CE)
• Dynamically addressable serial shift registers
– Maximum delay of 16 clock cycles per LUT (128 per CLB)
– Cascadable to other LUTs or CLBs for longer shift registers
• Dedicated connection from Q15 to D input of the next SRL16CE
– Shift register length can be changed asynchronously by toggling address A LUT
D QCE
D QCE
D QCE
D QCE
LUTD
CECLK
A[3:0]
Q
Q15 (cascade out)
IOB Element• Input path
– Two DDR registers• Output path
– Two DDR registers– Two 3-state enable
DDR registers• Separate clocks and
clock enables for I and O• Set and reset signals
are shared
Reg
Reg
DDR MUX
3-state
OCK1
OCK2
Reg
Reg
DDR MUX
Output
OCK1
OCK2
PAD
Reg
Reg
Input
ICK1
ICK2
IOB
SelectIO Standard• Allows direct connections to external signals of varied voltages and
thresholds– Optimizes the speed/noise tradeoff– Saves having to place interface components onto your board
• Differential signaling standards– LVDS, BLVDS, ULVDS– LDT– LVPECL
• Single-ended I/O standards– LVTTL, LVCMOS (3.3V, 2.5V, 1.8V, and 1.5V)– PCI-X at 133 MHz, PCI (3.3V at 33 MHz and 66 MHz)– GTL, GTLP– and more!
Digital ControlledImpedance (DCI)
• DCI provides– Output drivers that match the impedance of the traces– On-chip termination for receivers and transmitters
• DCI advantages– Improves signal integrity by eliminating stub reflections– Reduces board routing complexity and component count by eliminating external
resistors– Eliminates the effects of temperature, voltage, and process variations by using an
internal feedback circuit
Other Virtex-II Features• Distributed RAM and block RAM
– Distributed RAM uses the CLB resources (1 LUT = 16 RAM bits)– Block RAM is a dedicated resources on the device (18-kb blocks)
• Dedicated 18 x 18 multipliers next to block RAMs• Clock management resources
– Sixteen dedicated global clock multiplexers– Digital Clock Managers (DCMs)
Distributed SelectRAM Resources
• Uses a LUT in a slice as memory• Synchronous write• Asynchronous read
– Accompanying flip-flops can be used to create synchronous read
• RAM and ROM are initialized duringconfiguration
– Data can be written to RAMafter configuration
• Emulated dual-port RAM – One read/write port– One read-only port
RAM16X1S
O
DWEWCLKA0A1A2A3
LUT
RAM32X1S
O
DWEWCLKA0A1A2A3A4
RAM16X1D
SPO
DWEWCLKA0A1A2A3DPRA0 DPODPRA1DPRA2DPRA3
Slice
LUT
LUT
Block SelectRAM Resources• Up to 3.5 Mb of RAM in 18-kb
blocks– Synchronous read and write
• True dual-port memory– Each port has synchronous read and
write capability– Different clocks for each port
• Supports initial values• Synchronous reset on output latches• Supports parity bits
– One parity bit per eight data bits
DIADIPAADDRAWEAENASSRA
CLKA
DIBDIPB
WEBADDRB
ENBSSRB
DOA
CLKB
DOPA
DOPBDOB
18-kb block SelectRAM memory
Dedicated Multiplier Blocks• 18-bit twos complement signed operation• Optimized to implement Multiply and Accumulate functions• Multipliers are physically located next to block SelectRAM™ memory
18 x 18 Multiplier
Output (36 bits)
Data_A (18 bits)
Data_B (18 bits)
4 x 4 signed
8 x 8 signed
12 x 12 signed
18 x 18 signed
Global Clock Routing Resources
• Sixteen dedicated global clock multiplexers– Eight on the top-center of the die, eight on the bottom-center– Driven by a clock input pad, a DCM, or local routing
• Global clock multiplexers provide the following:– Traditional clock buffer (BUFG) function– Global clock enable capability (BUFGCE)– Glitch-free switching between clock signals (BUFGMUX)
• Up to eight clock nets can be used in each clock region of the device– Each device contains four or more clock regions
Digital Clock Manager (DCM)• Up to twelve DCMs per device
– Located on the top and bottom edges of the die– Driven by clock input pads
• DCMs provide the following:– Delay-Locked Loop (DLL)– Digital Frequency Synthesizer (DFS)– Digital Phase Shifter (DPS)
• Up to four outputs of each DCM can drive onto global clock buffers– All DCM outputs can drive general routing
Spartan-3 versus Virtex-II• Lower cost• Smaller process = lower core
voltage– .09 micron versus .15 micron– Vccint = 1.2V versus 1.5V
• Different I/O standard support– New standards: 1.2V LVCMOS,
1.8V HSTL, and SSTL– Default is LVCMOS, versus LVTTL
• More I/O pins per package• Only one-half of the slices
support RAM or SRL16s (SLICEM)
• Fewer block RAMs and multiplier blocks
– Same size and functionality• Eight global clock multiplexers• Two or four DCM blocks• No internal 3-state buffers
– 3-state buffers are in the I/O
SLICEM and SLICEL• Each Spartan™-3 CLB
contains four slices– Similar to the Virtex™-II
• Slices are grouped in pairs– Left-hand SLICEM (Memory)
• LUTs can be configured as memory or SRL16
– Right-hand SLICEL (Logic)• LUT can be used as logic only
CIN
SwitchMatrix
COUTCOUT
Slice X0Y0
Slice X0Y1
Fast Connects
Slice X1Y0
Slice X1Y1
CIN
SHIFTIN
Left-Hand SLICEM Right-Hand SLICEL
SHIFTOUT
Spartan-3E Features• More gates per I/O than Spartan-3• Removed some I/O standards
– Higher-drive LVCMOS– GTL, GTLP– SSTL2_II– HSTL_II_18, HSTL_I, HSTL_III– LVDS_EXT, ULVDS
• DDR Cascade– Internal data is presented on a single
clock edge
• 16 BUFGMUXes on left and right sides
– Drive half the chip only– In addition to eight global clocks
• Pipelined multipliers• Additional configuration
modes– SPI, BPI– Multi-Boot mode
Virtex-II Pro Features• 0.13 micron process• Up to 24 RocketIO™ Multi-Gigabit Transceiver (MGT) blocks
– Serializer and deserializer (SERDES)– Fibre Channel, Gigabit Ethernet, XAUI, Infiniband compliant transceivers, and others– 8-, 16-, and 32-bit selectable FPGA interface– 8B/10B encoder and decoder
• PowerPC™ RISC processor blocks– Thirty-two 32-bit General Purpose Registers (GPRs)– Low power consumption: 0.9mW/MHz– IBM CoreConnect bus architecture support