fpga intra-cluster routing crossbar design
DESCRIPTION
FPGA Intra-cluster Routing Crossbar Design. Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223. Generating Highly Routable Sparse Crossbars for PLDs. Guy Lemieux, Paul Leventis , David Lewis International Symposium on FPGAs, 2000 . - PowerPoint PPT PresentationTRANSCRIPT
FPGA Intra-cluster Routing Crossbar Design
Dr. Philip BriskDepartment of Computer Science and Engineering
University of California, Riverside
CS 223
Generating Highly Routable Sparse Crossbars for PLDs
Guy Lemieux, Paul Leventis, David LewisInternational Symposium on FPGAs, 2000
Basic Notation
Fully Populated Crossbar
• Full capacity – can connect as many signals as the number of outputs
• Flexibility – Can connect any input to any output
Full-capacity Minimal Crossbars
• Full capacity• Reduced Flexibility: you lose the ability to
connect any input to any output• p = m(m – n + 1)
switches
Full-capacity Minimal Crossbars
…
• Area savings is minimal if n >> m
Perfect and Sparse Crossbars
• Perfect crossbars– Can disjointly route any m-sized subset of the n inputs
to the m outputs– Both full and full-capacity minimal crossbars are perfect
• Sparse crossbars– Has p < m(m – n + 1) switches– Cannot be perfect
Bipartite Graph Representation
I1 I2 I3 I4 I5 I6
O1
O2
O3
O4
I1
I2
I3
I4
I5
I6
O1
O2
O3
O4
Evaluation Challenge
• How “routable” is a given crossbar?
– Build an FPGA, map 20+ applications, observe results• Slow, highly subject to the application mix
– Monte Carlo Test• Generate random test vectors• Route each test vector on the crossbar (network flow)• Report number of successes as a percentage• A highly routable sparse crossbar has a >= 95% success rate
Hall’s Theorm• Given a bipartite graph G = (V, E)– X, Y are the bipartite independent sets of G
G has a matching of X onto Y if and only if
N(v) is the set of neighbors of vertex v N(S) is the set of neighbors of all vertices in S
• Leverage Hall’s Theorem to generate routable sparse crossbars!
Practical Issues
• Cannot enumerate all subsets of m inputs• N(x) should be approximately equal for all
input vertices x in X– Otherwise, any subset containing a large number
of low-degree vertices is unlikely to be routable• N(y) should be approximately equal for all
output vertices y in Y– Symmetric argument
Hamming Distance and Coding Theory
• Represent N(v) as a bitvector bv– bv[i] = 1 if v fans out to Oi
• Hamming Distance– d(bv1, bv2)
• Strategy– Maximize d(bvi, bvj) for every pair of distinct vertices vi
and vj
Switch Placement Optimizer
• Start with initial switch placement• Generate random swap of switch positions– Accept the swap if there is an improvement– Otherwise, reject the swap
• Stop after a fixed number of swap candidates (e.g., 10K) fails to find an improvement
• Objective is to minimize:
Example
Identical Hamming costs before and after the swap
Before: cannot route {1, 2, 3}After: reduces Hamming costs
168x24 Crossbar, 10K Test Vectors
Altera Flex 8000 HP Plasma Hextant
# Switches vs. Routability
Using Sparse Crossbars within LUT Clusters
Guy Lemieux, David LewisInternational Symposium on FPGAs, 2001
Five Questions1. Will depopulation save area, require greater routing area, or create
unroutable architectures?2. Will depopulation reduce or increase routing delays?3. What amount of depopulation is reasonable?4. How much area or delay reduction can be attained, if any?5. What are the other effects of depopulating the cluster?
Architecture and Parameters
Results
Designing Efficient Input Interconnect Blocks for LUT Clusters Using
Counting and Entropy
Wenyi Feng and Sinan KaptanogluACM Transactions on Reconfigurable Technology and
Systems (TRETS), 1(1): article #6, March, 2008
Note: Paper is from Actel (now Microsemi)
Count Configurations (Details Omitted)
784 Configurations 312 Configurations 256 Configurations
Routing Requirement Vector (RRV)• An ordered list of N subsets
containing K distinct signals
• The ith subset is K distinct signals to route to the ith K-LUT
• Total number of RRVs for the crossbar:
M inputsKN outputs
Entropy of an Intra-cluster Routing Crossbar
• H = lg(# routable RRVs)– Accounts for equivalence of LUT inputs
• Why Entropy?– # routable RRVs is huge– Minimum number of configuration bits to program the crossbar– Inversely correlated with usage of global routing muxes (details
omitted)• If we reduce the routability of the crossbar, we will end up
programming more global routing muxes to compensate for the entropy loss
Conceptual Idea
intra-cluster crossbar
global routing
Theorem
• Let P and L be the number of muxes and switches in a crossbar– The entropy is at most Plg(L/P)– The entropy per switch is at most log(L/P) / (L/P) – These bounds are achieved only when each mux
has size L/P and each configuration realizes a unique RRV
• Proof omitted because I DO NOT HATE YOU!
What are we doing here?
• Lemieux and Lewis– Routability: Monte Carlo simulations– Area: Count switches
• Feng and Kaptanoglu– Routability: Crossbar entropy– Area: Entropy per switch– Caveat: Focus only on crossbars where we can count
routable, non-redundant RRVs!
Type-1 Crossbar
• 1-level– L2 muxes are driven
directly by crossbar input signals
– #routable RRVs depends on L2 crossbar topology
• Not area-efficient due to big L2 muxes
• Xilinx Virtex-style
Type-2 Crossbar
• 2-level– L1 is sparsely
populated– L2 is fully populated
• Fully populated L2 reduces area efficiency
• VPR– Fc,in determines L1
population density
Type-3 Crossbar
• 2-level, Partitioned– L1 partition Pi only drives
L2 partition Oi
– From input m to LUT input n, all paths go through muxes in Pi and Oi exclusively
– #Routable RRVs is the product of #Routable RRVs for each disjoint sub-crossbar
Proposed Type-3 Crossbar and Generation Algorithm
• Each sub-crossbar is Type-2• Can count #routable RRVs (Details omitted)
Entropy vs. # Switches
Entropy vs. Global Routing Mux Usage
The Bottom Line…
• Who cares…– Theoretical properties are cute– Actel/Microsemi did not use these crossbars in
their FPGAs
• Practical observation…– The cheaper you make the intra-cluster routing
crossbar, the more expensive the global routing…
A 65nm flash-based FPGA fabric optimized for low cost and power
Jonathan W. Greene, et al.International Symposium on FPGAs, 2011
Note: Paper is from Microsemi (Feng and Kaptanoglu are co-authors)
Corporate Secrets Divulged• They used a Clos Network– Three parameters: m, n, r
Clos Network Properties
• Used when the physical circuit switching needs to exceed the capacity of the largest feasible single crossbar
• Much cheaper than a fully populated nxn crossbar
Strict-sense Nonblocking Clos Network(m > 2n – 1)
• An unused input on an ingress switch can always be connected to an unused output on an egress switch, without reconfiguration!
Rearrangeably Nonblocking Clos Network(m > n)
• An unused input on an ingress switch can always be connected to an unused output on an egress switch, but reconfiguration may be necessary!
Recursive Clos Network Design• Scalable to any ODD
number of stages– Replace center crossbar with
a 3-stage Clos Network