tau 2015 spyrou fpga timing
TRANSCRIPT
Challenges in theStatic Timing Analysisof FPGA’sTom Spyrou TAU 20153/2015
Programmability: Where do FPGA’s fit?
2
Intel CPU
TI DSP
MultiCore
ManyCore
GPU
FPGA
ASSP
ASIC
Flexibility, Programming Abstraction
Performance, Area and Power Efficiency
CPU:• Market-agnostic• Accessible to many
programmers (C++)• Flexible, portable
ASIC• Market-specific• Fewer programmers• Rigid, less
programmable• Hard to build (physical)
FPGA:• Somewhat Restricted
Market• Harder to Program (Verilog)• More efficient than SW• More expensive than ASIC
3 / 61
FPGA End Markets
Entertainment Broadcast
BroadbandAudio/videoVideo display
StudioSatelliteBroadcasting
Wireless Networking Wireline
CellularBasestationsWireless LAN
SwitchesRouters
OpticalMetroAccess
Computer Storage OfficeAutomation
ServersMainframe
RAIDSAN
CopiersPrintersMFP
Instrumentation Security/Energy Mgmt. Auto
MedicalTest equipmentManufacturing
Card readersControl systemsATM
NavigationEntertainment
Military
Secure comm. Radar Guidance and control
Computerand Storage
Communications
IndustrialDigital Consumer
FPGA User Programming Model
User writes Verilog (or VHDL, or schematic) Quartus compiles the Verilog to a bitstream
- Synthesis: Verilog -> Gates- Tech-Mapping: Gates -> Device-specific LUTs & FF- Clustering: LUTs+FF -> LAB clusters- Placement: LABs –> placed LABS with an (x,y) position- Routing: Abstract connections -> exact routing- STA: Timing evaluated vs. constraints- Assembly: Routing converted to bitstream- Programming: Bitstream downloaded onto FPGA
(More on this in the Software Flow Section)
4
5
FPGA CADMap to LAB’s not standard cellsRouting is setting mux select line bits// Begin: Write Control
always @ (posedge wrbusy_int)begin
write0 <= 1'b1;write1 <= 1'b0;writex <= 1'b0;
end
always @ (negedge wrbusy_int)begin
write0 <= 1'b0;end
always @ (posedge write0_done)begin
write1 <= 1'b1;
// Begin: Write Controlalways @ (posedge wrbusy_int)begin
write0 <= 1'b1;write1 <= 1'b0;writex <= 1'b0;
end
always @ (negedge wrbusy_int)begin
write0 <= 1'b0;end
always @ (posedge write0_done)begin
write1 <= 1'b1;
// Begin: Write Controlalways @ (posedge wrbusy_int)begin
write0 <= 1'b1;write1 <= 1'b0;writex <= 1'b0;
end
always @ (negedge wrbusy_int)begin
write0 <= 1'b0;end
always @ (posedge write0_done)begin
write1 <= 1'b1;
Quartus II DatabaseDevice features and timing information
Merge
Programmer
TimingAnalysis
Placement& Routing
Power
AssemblerSimulator
3-rd Partyor Altera
EDA
Synthesis3-rd Partyor Altera
What is FPGA Fabric – Logic Array Block
6
Input Muxing Logic Cell
Optional DFF
Output Muxing
Bottom line: Quartus generates a configuration bitstream which sets the logic functions, and routing steering to instantiate one hardware design into the device.
LAB: X 20
®
®
®
®
®
Hard-BlocksRouting Fabric
7
Secondary Signals (CE, SLOAD, …)
FPGA Interconnect Model
8
0xab0f
0
VDIM
HD
IM
HD
IM
LIM
LEIM
ABCD
0x81
0xf0
0x14
0x44
0x24
CRAM Programming
LAB (4,6)
LAB (12,9)
V4
H3
H3
Wires are point-to-point Individual bits, not groups or
word-wise Statically programmed by SW
to establish the necessary connection
No bus, protocol, etc. routing (unless built on top)
Unique Challenges in STA of FPGA’s Fixed device with programmable LUTS, Routing and various IP
I would like to break down the challenges into categories
Verification of the un-programmed device- Many possible modes due to programmability- Delay Calculation of non-CMOS structures like pass gate muxes
Verification of a user’s compiled design- CRPR analysis can be very expensive
Large clock latency and skew, tree used versus mesh Long combinational paths with lots of re-convergent logic
- Slow logic that is still much faster than software on a CPU- Incremental moves affect function not just delay of instances
CRAM configuration constant changes Mode changes
9
Unique Challenges in STA of FPGA’s Periphery and Core have different challenges
- Programmable core logic implementing functions via look up tables- Peripheral IP blocks performing programmable but less flexible tasks
SerDes, DSP, RAM, Arm Core etc- Periphery blocks often implemented with ASIC style flows- Core is full custom with pass gates
Delay modelling and parasitic reduction are challenges- Both have challenges due to configurability
I cannot cover all the challenges and will focus on 3- LUT modelling- Mode explosion flat implementation with hierarchical modelling- Modelling pass gate based multiplexors
10
LUT Overview For the purposes of this
tutorial, let’s assume we have a 3-LUT, i.e. 3 inputs on the select lines to select one of 8 bits driven by the CRAM.
This 3-LUT can be used to model any logic function of 3 bits by assigning appropriate values to the CRAM.
We call the 8-bit value b[7:0] the LUTMASK.
11
A B C
CR
AM
Y
b0
b1
b2
b3
b4
b5
b6
b7
Timing Arcs Dependency on LUTMASK The existence and delays of
the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.- For example, if bits are all
0s, then Y = 0 and there are no arcs from any of the inputs to the output. This is a degenerate case.
12
A B C
CR
AM
Y
b0
b1
b2
b3
b4
b5
b6
b7
Timing Arcs Dependency on LUTMASK The existence and delays
of the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.- For example, if bits are
10001000 (as shown in the diagram), there is no arc for C=>Y. [This LUTMASK implements the logic function Y=A&B.]
Unateness is a function of LUTMASK- This configuration should
have positive-unate arcs- Ignoring unateness will hurt
fmax, but is not necessarily critical for early Quartus development
13
A B C
CR
AM
Y
0
0
0
1
0
0
0
1
Timing Arcs Dependency on LUTMASK The existence and delays of
the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.- Another example: if bits are
10101010 (as shown in the diagram), there is no arc for B=>Y or C=>Y. [This LUTMASK implements the logic function Y=A.]
14
A B C
CR
AM
Y
0
1
0
1
0
1
0
1
Enumerating Timing Arc Dependencies One method to identify all the
arcs as a function of the LUTMASK is to enumerate all 256 LUTMASK possibilities along with the arc dependencies.
This becomes unfeasible with a 6-LUT, where there are 64 bits driven by CRAM, resulting in 2^64 enumerations.
Alternate method is noticing pattern of dependencies.
15
A B C
CR
AM
Y
0
1
0
1
0
1
0
1
Enumerating Timing Arc Dependencies Positive unate arc for A=>Y
will exist if any of the first bit of the first level muxes is a 0 and the second bit of the same mux is a 1.- Formally, it may be written
as: (!b0 && b1) || (!b2 && b3) || (!b4 && b5) || (!b6 && b7)
Negative unate arc for A=>Y will exist if any of the first bit of the first level muxes is a 1 and the second bit of the same mux is a 0. Formally, it may be written
as: (b0 && !b1) || (b2 && !b3) || (b4 && !b5) || (b6 && !b7)
16
A B C
CR
AM
Y
0
1
0
1
0
1
0
1
LUT timing is an instance of case analysis
In Asic style STA case analysis can be slow
Happens once and not revisited during incremental timing
Symbolic simulation has acceptable runtime
In FPGA timing, especially incremental timing, the evaluation has to be done on every netlist modification that affects logic
17
Modes can explode for complex blocks Imagine a large block with many modes
- Mode dependent timing is used to gain accuracy in STA
This block is used by a parent block with many cell modes continuing up multiple levels
The number of possible modes can explode especially if automatic tools are used to enumerate them like PrimeTime’s extract_model command
Design teams want to do physical design at the highest possible level
Timing Modelling which needs to avoid an explosion of modes want to build models at a lower level
It is not uncommon to have a complex block like DSP with 10K modes
We have no problem building these models but they can be slow in STA even when the STA has been tuned for handling of many more modes than in ASIC flows.
PrimeTime and other commercial tools simultaneously build the graph for and delay calculate all modes. No commercial tool can load and link Altera’s full chip.
18
Two possible approaches Goal is to build models one level lower in the
Verilog hierarchy and provide a netlist of models to Quartus and PrimeTime
Perform Place and Route Hierarchically- More work for design teams- Less optimal results
Use ICC’s hierarchical Verilog + flat Spef to build a timing model below the top level
19
Hierarchical Place and Route / Extraction Perform Place and Route Hierarchically
- ProsSpef is divided naturally by hierarchical P&R and extractionManual floorplan of top level may improve QoR over automatic P&RRun time of P&R for lower level blocks will be dramatically faster
allowing more time for manual inspection and improvement of results- Cons
Design engineer must manually floorplan the top levelMultiple runs to manage or P&R and extractionPossible QoR degredation if floorplan is poorly done
20
Model extraction one level lower Use ICC’s hierarchical Verilog + an extracted flat Spef to build a timing
model below the top level- Pros
No change to construction flow- Cons
Some loss of accuracy on boundary rc delay calculation- Rc tree of boundary nets turned into lumped R and lumped C- Order 5% of the final gate in the path to the output/input of the model.
Approach- Read hierarchical Verilog in PT + flat spef + sdc for top level- Write_parasitics –format spef –nets [get_nets –hier sub_instance_name/*] for sub block- Write_parasitcs –format spef –nets [get_nets *] for top- Charactarize_context –environment –timing sub_instance_name- Avoid boundary nets in context with -no_boundary_annotations or no boundary nets in spef- Post process spef to remove prepended sub_instance_name from all names in map- Restart pt_shell with current_design as sub_module- Load spef and environment context- Extract_model
21
22
FIHM – Model Validation Flow Model comparison between:
- Flat model (golden)- Hierarchical model (consuming molecule timing liberty model)
Parasitics for hierarchical model validation generated by hacking the flat SPEF file:- Rename standard cell’s leaf pins to molecule’s boundary pins.- Zero R&C if nets connected within a molecule block. - Parasitics only extracted from flat SPEF for the nets connected to
top level elements or output ports.
23
Correlation Results
Testcase: mm_core_digital Total timing paths = 1444.
- 60 timing paths are pessimistic > 20ps as compared to flat model (4% of distribution)
- 14 timing paths are optimistic > 20ps as compared to flat model (1% of distribution) 95% of total paths agreed within ±20ps
24
Correlation Results
25
N-MOS gate multi-stage Multiplexors Multiplexors are pervasive in an FPGA They are designed using NMOS pass
gates to save area This causes a timing model challenge
The input pin capacitance changes with each select line configuration
Think of the Mux as a set of switches The output load is seen on the input
Usual use of Liberty assumes a fixed input capacitance or fixed receiver model
Quartus compiler uses fast spice but we want a model for PrimeTime as well
26
N-MOS gate multi-stage Multiplexors Select line enabled by CRAM 2 stage one hot mux
Input cap varies depending on path taken and if other side loads’ select lines are on or off
Each possible path through the multi-stage mux requires its own pin cap
Arc specific receiver model This is part of the CCS noise model
It would be nice if there were a more natural way to support arc and mode specific pin caps in Liberty
Other NMOS inputs
Incentive for EDA Companies to help As each process generation becomes more complex the number of
unique chip starts decreases.- Already 12K to less than 3K per year
Each chip that is designed will be increasingly hyper-optimized.- Custom tricks that need to be modeled at the gate level
FPGA use is increasing as its ability to run 1GHZ designs at reasonable power approaches
FPGA compilers will not be able to model every effect- ICC needs PrimeTime- Encounter needs ETS
Eventually FPGA compilers may need to output their programmed CRAM bits as constants and do a Super-Signoff in commercial STA tools- This could be a good possibility for Market growth
27