tau 2015 spyrou fpga timing

Challenges in theStatic Timing Analysisof FPGA’sTom Spyrou TAU 20153/2015

Programmability: Where do FPGA’s fit?

2

Intel CPU

TI DSP

MultiCore

ManyCore

GPU

FPGA

ASSP

ASIC

Flexibility, Programming Abstraction

Performance, Area and Power Efficiency

CPU:• Market-agnostic• Accessible to many

programmers (C++)• Flexible, portable

ASIC• Market-specific• Fewer programmers• Rigid, less

programmable• Hard to build (physical)

FPGA:• Somewhat Restricted

Market• Harder to Program (Verilog)• More efficient than SW• More expensive than ASIC

3 / 61

FPGA End Markets

Entertainment Broadcast

BroadbandAudio/videoVideo display

StudioSatelliteBroadcasting

Wireless Networking Wireline

CellularBasestationsWireless LAN

SwitchesRouters

OpticalMetroAccess

Computer Storage OfficeAutomation

ServersMainframe

RAIDSAN

CopiersPrintersMFP

Instrumentation Security/Energy Mgmt. Auto

MedicalTest equipmentManufacturing

Card readersControl systemsATM

NavigationEntertainment

Military

Secure comm. Radar Guidance and control

Computerand Storage

Communications

IndustrialDigital Consumer

FPGA User Programming Model

User writes Verilog (or VHDL, or schematic) Quartus compiles the Verilog to a bitstream

- Synthesis: Verilog -> Gates- Tech-Mapping: Gates -> Device-specific LUTs & FF- Clustering: LUTs+FF -> LAB clusters- Placement: LABs –> placed LABS with an (x,y) position- Routing: Abstract connections -> exact routing- STA: Timing evaluated vs. constraints- Assembly: Routing converted to bitstream- Programming: Bitstream downloaded onto FPGA

(More on this in the Software Flow Section)

4

5

FPGA CADMap to LAB’s not standard cellsRouting is setting mux select line bits// Begin: Write Control

always @ (posedge wrbusy_int)begin

write0 <= 1'b1;write1 <= 1'b0;writex <= 1'b0;

end

always @ (negedge wrbusy_int)begin

write0 <= 1'b0;end

always @ (posedge write0_done)begin

write1 <= 1'b1;

// Begin: Write Controlalways @ (posedge wrbusy_int)begin


end


write0 <= 1'b0;end


write1 <= 1'b1;

// Begin: Write Controlalways @ (posedge wrbusy_int)begin


end


write0 <= 1'b0;end


write1 <= 1'b1;

Quartus II DatabaseDevice features and timing information

Merge

Programmer

TimingAnalysis

Placement& Routing

Power

AssemblerSimulator

3-rd Partyor Altera

EDA

Synthesis3-rd Partyor Altera

What is FPGA Fabric – Logic Array Block

6

Input Muxing Logic Cell

Optional DFF

Output Muxing

Bottom line: Quartus generates a configuration bitstream which sets the logic functions, and routing steering to instantiate one hardware design into the device.

LAB: X 20

®

®

®

®

®

Hard-BlocksRouting Fabric

7

Secondary Signals (CE, SLOAD, …)

FPGA Interconnect Model

8

0xab0f

0

VDIM

HD

IM

HD

IM

LIM

LEIM

ABCD

0x81

0xf0

0x14

0x44

0x24

CRAM Programming

LAB (4,6)

LAB (12,9)

V4

H3

H3

Wires are point-to-point Individual bits, not groups or

word-wise Statically programmed by SW

to establish the necessary connection

No bus, protocol, etc. routing (unless built on top)

Unique Challenges in STA of FPGA’s Fixed device with programmable LUTS, Routing and various IP

I would like to break down the challenges into categories

Verification of the un-programmed device- Many possible modes due to programmability- Delay Calculation of non-CMOS structures like pass gate muxes

Verification of a user’s compiled design- CRPR analysis can be very expensive

Large clock latency and skew, tree used versus mesh Long combinational paths with lots of re-convergent logic

- Slow logic that is still much faster than software on a CPU- Incremental moves affect function not just delay of instances

CRAM configuration constant changes Mode changes

9

Unique Challenges in STA of FPGA’s Periphery and Core have different challenges

- Programmable core logic implementing functions via look up tables- Peripheral IP blocks performing programmable but less flexible tasks

SerDes, DSP, RAM, Arm Core etc- Periphery blocks often implemented with ASIC style flows- Core is full custom with pass gates

Delay modelling and parasitic reduction are challenges- Both have challenges due to configurability

I cannot cover all the challenges and will focus on 3- LUT modelling- Mode explosion flat implementation with hierarchical modelling- Modelling pass gate based multiplexors

10

LUT Overview For the purposes of this

tutorial, let’s assume we have a 3-LUT, i.e. 3 inputs on the select lines to select one of 8 bits driven by the CRAM.

This 3-LUT can be used to model any logic function of 3 bits by assigning appropriate values to the CRAM.

We call the 8-bit value b[7:0] the LUTMASK.

11

A B C

CR

AM

Y

b0

b1

b2

b3

b4

b5

b6

b7

Timing Arcs Dependency on LUTMASK The existence and delays of

the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.- For example, if bits are all

0s, then Y = 0 and there are no arcs from any of the inputs to the output. This is a degenerate case.

12

A B C

CR

AM

Y

b0

b1

b2

b3

b4

b5

b6

b7

Timing Arcs Dependency on LUTMASK The existence and delays

of the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.- For example, if bits are

10001000 (as shown in the diagram), there is no arc for C=>Y. [This LUTMASK implements the logic function Y=A&B.]

Unateness is a function of LUTMASK- This configuration should

have positive-unate arcs- Ignoring unateness will hurt

fmax, but is not necessarily critical for early Quartus development

13

A B C

CR

AM

Y

0

0

0

1

0

0

0

1

Timing Arcs Dependency on LUTMASK The existence and delays of

the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.- Another example: if bits are

10101010 (as shown in the diagram), there is no arc for B=>Y or C=>Y. [This LUTMASK implements the logic function Y=A.]

14

A B C

CR

AM

Y

0

1

0

1

0

1

0

1

Enumerating Timing Arc Dependencies One method to identify all the

arcs as a function of the LUTMASK is to enumerate all 256 LUTMASK possibilities along with the arc dependencies.

This becomes unfeasible with a 6-LUT, where there are 64 bits driven by CRAM, resulting in 2^64 enumerations.

Alternate method is noticing pattern of dependencies.

15

A B C

CR

AM

Y

0

1

0

1

0

1

0

1

Enumerating Timing Arc Dependencies Positive unate arc for A=>Y

will exist if any of the first bit of the first level muxes is a 0 and the second bit of the same mux is a 1.- Formally, it may be written

as: (!b0 && b1) || (!b2 && b3) || (!b4 && b5) || (!b6 && b7)

Negative unate arc for A=>Y will exist if any of the first bit of the first level muxes is a 1 and the second bit of the same mux is a 0. Formally, it may be written

as: (b0 && !b1) || (b2 && !b3) || (b4 && !b5) || (b6 && !b7)

16

A B C

CR

AM

Y

0

1

0

1

0

1

0

1

LUT timing is an instance of case analysis

In Asic style STA case analysis can be slow

Happens once and not revisited during incremental timing

Symbolic simulation has acceptable runtime

In FPGA timing, especially incremental timing, the evaluation has to be done on every netlist modification that affects logic

17

Modes can explode for complex blocks Imagine a large block with many modes

- Mode dependent timing is used to gain accuracy in STA

This block is used by a parent block with many cell modes continuing up multiple levels

The number of possible modes can explode especially if automatic tools are used to enumerate them like PrimeTime’s extract_model command

Design teams want to do physical design at the highest possible level

Timing Modelling which needs to avoid an explosion of modes want to build models at a lower level

It is not uncommon to have a complex block like DSP with 10K modes

We have no problem building these models but they can be slow in STA even when the STA has been tuned for handling of many more modes than in ASIC flows.

PrimeTime and other commercial tools simultaneously build the graph for and delay calculate all modes. No commercial tool can load and link Altera’s full chip.

18

Two possible approaches Goal is to build models one level lower in the

Verilog hierarchy and provide a netlist of models to Quartus and PrimeTime

Perform Place and Route Hierarchically- More work for design teams- Less optimal results

Use ICC’s hierarchical Verilog + flat Spef to build a timing model below the top level

19

Hierarchical Place and Route / Extraction Perform Place and Route Hierarchically

- ProsSpef is divided naturally by hierarchical P&R and extractionManual floorplan of top level may improve QoR over automatic P&RRun time of P&R for lower level blocks will be dramatically faster

allowing more time for manual inspection and improvement of results- Cons

Design engineer must manually floorplan the top levelMultiple runs to manage or P&R and extractionPossible QoR degredation if floorplan is poorly done

20

Model extraction one level lower Use ICC’s hierarchical Verilog + an extracted flat Spef to build a timing

model below the top level- Pros

No change to construction flow- Cons

Some loss of accuracy on boundary rc delay calculation- Rc tree of boundary nets turned into lumped R and lumped C- Order 5% of the final gate in the path to the output/input of the model.

Approach- Read hierarchical Verilog in PT + flat spef + sdc for top level- Write_parasitics –format spef –nets [get_nets –hier sub_instance_name/*] for sub block- Write_parasitcs –format spef –nets [get_nets *] for top- Charactarize_context –environment –timing sub_instance_name- Avoid boundary nets in context with -no_boundary_annotations or no boundary nets in spef- Post process spef to remove prepended sub_instance_name from all names in map- Restart pt_shell with current_design as sub_module- Load spef and environment context- Extract_model

21

22

FIHM – Model Validation Flow Model comparison between:

- Flat model (golden)- Hierarchical model (consuming molecule timing liberty model)

Parasitics for hierarchical model validation generated by hacking the flat SPEF file:- Rename standard cell’s leaf pins to molecule’s boundary pins.- Zero R&C if nets connected within a molecule block. - Parasitics only extracted from flat SPEF for the nets connected to

top level elements or output ports.

23

Correlation Results

Testcase: mm_core_digital Total timing paths = 1444.

- 60 timing paths are pessimistic > 20ps as compared to flat model (4% of distribution)

- 14 timing paths are optimistic > 20ps as compared to flat model (1% of distribution) 95% of total paths agreed within ±20ps

24

Correlation Results

25

N-MOS gate multi-stage Multiplexors Multiplexors are pervasive in an FPGA They are designed using NMOS pass

gates to save area This causes a timing model challenge

The input pin capacitance changes with each select line configuration

Think of the Mux as a set of switches The output load is seen on the input

Usual use of Liberty assumes a fixed input capacitance or fixed receiver model

Quartus compiler uses fast spice but we want a model for PrimeTime as well

26

N-MOS gate multi-stage Multiplexors Select line enabled by CRAM 2 stage one hot mux

Input cap varies depending on path taken and if other side loads’ select lines are on or off

Each possible path through the multi-stage mux requires its own pin cap

Arc specific receiver model This is part of the CCS noise model

It would be nice if there were a more natural way to support arc and mode specific pin caps in Liberty

Other NMOS inputs

Incentive for EDA Companies to help As each process generation becomes more complex the number of

unique chip starts decreases.- Already 12K to less than 3K per year

Each chip that is designed will be increasingly hyper-optimized.- Custom tricks that need to be modeled at the gate level

FPGA use is increasing as its ability to run 1GHZ designs at reasonable power approaches

FPGA compilers will not be able to model every effect- ICC needs PrimeTime- Encounter needs ETS

Eventually FPGA compilers may need to output their programmed CRAM bits as constants and do a Super-Signoff in commercial STA tools- This could be a good possibility for Market growth

27

tau 2015 spyrou fpga timing

Documents