® 1 timing analysis challenges for high speed cpu's at 90nm and below itrs predictions &...
TRANSCRIPT
RR
®®
1
Timing Analysis Challenges Timing Analysis Challenges for High speed CPU's at 90nm for High speed CPU's at 90nm
and belowand below
ITRS Predictions & Design ChallengesITRS Predictions & Design Challenges Timing Analysis at intelTiming Analysis at intel Current issues and solutionsCurrent issues and solutions Mid-term challengesMid-term challenges SummarySummary
Agenda:Agenda:
Avi Efrati, Moshe Kleyner
RR
®®
2
The VLSI Chip in 2010...The VLSI Chip in 2010...The VLSI Chip in 2010...The VLSI Chip in 2010...
Process Technology 25nm gate length Transistors 1,546 MLogic Transistors 300 MSize 280 mm 2
Clock frequency 11.5 GHz Chip I/O’s 3,840Wiring levels (metals) 9 - 10Voltage 0.8 - 1.0Power 120-218 WattsSupply current ~ 160 Amps
Source: ITRS ‘01 roadmap
RR
®®
3
Timing verification for Intel CPUsTiming verification for Intel CPUs Synchronous design style, mostlySynchronous design style, mostly
Multiple synchronized clocks, GHz rangeMultiple synchronized clocks, GHz range NO trend to asynchronous design in near futureNO trend to asynchronous design in near future
Deep pipeliningDeep pipelining
Internal static timer – TangoInternal static timer – Tango Cell-based, using abstract models for custom blocksCell-based, using abstract models for custom blocks Handles Handles transparenttransparent latches and sequential transparent latches and sequential transparent
loops, both BFS and DFS timing propagation optionsloops, both BFS and DFS timing propagation options Generates and uses proprietary abstract timing model for Generates and uses proprietary abstract timing model for
hierarchical timinghierarchical timing At each level an abstract timing model can be created for next At each level an abstract timing model can be created for next
levellevel Typically 2-3 timing hierarchy levelsTypically 2-3 timing hierarchy levels
PathMill used at device-level, produces same abstract modelPathMill used at device-level, produces same abstract model
RR
®®
4
What’s under the hood ?What’s under the hood ?
Handling transparent loopsHandling transparent loops False pathsFalse paths Hierarchical AnalysisHierarchical Analysis
Shell modelsShell models
RR
®®
5
Loops…Loops…
Combinational loops are disallowedCombinational loops are disallowed Local self-resetting circuitry may existLocal self-resetting circuitry may exist
Sequential loops existSequential loops exist Formed by combinational paths and transparent Formed by combinational paths and transparent
latcheslatches Actually form SCC (Strongly connected Actually form SCC (Strongly connected
component), handled automaticallycomponent), handled automatically Typical for FSM implemented with LatchesTypical for FSM implemented with Latches
clk
clk2 clk
clk#
RR
®®
6
False PathsFalse Paths Manual marking of false paths, Manual marking of false paths,
considered in timing analysisconsidered in timing analysis Automatic SAT-based false pathsAutomatic SAT-based false paths
Work done with K.Sakallah U.Mich.Work done with K.Sakallah U.Mich. Applied in combinational logicApplied in combinational logic
b=0 c=1 d=0 e=1 c=0
ab
c
def
g
zab
c
def
g
zab
c
def
g
zab
c
def
g
zab
c
def
g
zab
c
def
g
z
c=1 c=0
ab
c
def
g
z
RR
®®
7
Hierarchical AnalysisHierarchical Analysis Cannot analyze full-chip at transistor or Cannot analyze full-chip at transistor or
gate levelgate level Huge data, impractical run-timeHuge data, impractical run-time
Abstract blocks as compact modelsAbstract blocks as compact models Hide internal details not relevant at chip Hide internal details not relevant at chip
level, assume pre-defined clockslevel, assume pre-defined clocks As accurate as possible electrical interface As accurate as possible electrical interface
and timing modeland timing model Abstract model supports also timing Abstract model supports also timing
transparency – transparency – BLUE BOXBLUE BOX
RR
®®
8
Shell ModelShell Model
Interface cells and interconnect are preservedInterface cells and interconnect are preserved User may select deeper than 1 shellUser may select deeper than 1 shell User may expose some transparent latchesUser may expose some transparent latches
Balance core complexity versus amount of cells exposed Balance core complexity versus amount of cells exposed in full-chip, Deep Shell Modelin full-chip, Deep Shell Model
Cores are abstract timing modelsCores are abstract timing models Full-chip analysis uses shell models of blocksFull-chip analysis uses shell models of blocks
clk
QD
clk
QD
clk
QD
clk
QD
clk
QD
Combinational Cells
FF2
IN OUT
L1 L2 L3
FF1
Core Core
MB1 MB2
Flat FC interconnect
Electrical Shell elements
Core
RR
®®
9
Current and near-term Current and near-term challengeschallenges
CrossTalk impact on timingCrossTalk impact on timing Active interconnectActive interconnect Mixed abstraction, device to full-chipMixed abstraction, device to full-chip Use of domino as characterized cellsUse of domino as characterized cells SoC challengesSoC challenges
RR
®®
10
CrossTalk impact on TimingCrossTalk impact on Timing CrossTalk has noise and timing impactCrossTalk has noise and timing impact
Search for highest peak noise while…Search for highest peak noise while… Victim transitions – for timingVictim transitions – for timing Victim stable – for functional noiseVictim stable – for functional noise
CrossTalk timing effect may be approximated as a CrossTalk timing effect may be approximated as a Miller Xcap multiplier (MCF), but…Miller Xcap multiplier (MCF), but… Default MCF may over or under-estimate effectDefault MCF may over or under-estimate effect MCF is slope dependent, difficult to set upfrontMCF is slope dependent, difficult to set upfront AWE + superposition gives good results but may be too costly to AWE + superposition gives good results but may be too costly to
apply everywhereapply everywhere
Accuracy vs. run-time tradeoff is keyAccuracy vs. run-time tradeoff is key Timing filtering followed by local logic filteringTiming filtering followed by local logic filtering SMCF (smart MCF) or AWE-based peakSMCF (smart MCF) or AWE-based peak Timing iterations to converge CrossTtalk impactTiming iterations to converge CrossTtalk impact Very active research in last few years !!Very active research in last few years !!
RR
®®
11
Fitting SMCF to experimental Fitting SMCF to experimental datadata
Physically MCF depends on L=Tvic/TaggPhysically MCF depends on L=Tvic/Tagg Experimentally fitted with equation a-b*exp(-L)Experimentally fitted with equation a-b*exp(-L)
Best fitting of MCF
1.4
1.6
1.8
2
2.2
2.4
2.6
0 0.5 1 1.5 2 2.5 3Slope ratio, Tvic/Tagg
SM
CF
smcf interpolated toerr=0smcf best fitted toSmcf=a-b*exp(-L)smcf initially used inexperiments
RR
®®
12
““Active” InterconnectActive” Interconnect For quite some time interconnect is not negligible, now it becomes For quite some time interconnect is not negligible, now it becomes
active !active ! Repeaters may be buffers, inverters, latches, flopsRepeaters may be buffers, inverters, latches, flops VirtualVirtual (early design) or (early design) or realreal repeaters repeaters
Interconnect may be:Interconnect may be: Simple wireSimple wire Buffered (inverted or not)Buffered (inverted or not) Pipelined (and buffered)Pipelined (and buffered)
Pipelining the interconnect is considered simultaneously in RTL, Pipelining the interconnect is considered simultaneously in RTL, Floor Plan and early timingFloor Plan and early timing
Mutual Inductance impact being assessedMutual Inductance impact being assessed Asynchronous long-distance on-chip communication ?Asynchronous long-distance on-chip communication ?
Rcv
RcvDrv
RR
®®
13
Mixed AbstractionMixed Abstraction Layout becomes more cell-based…but circuit Layout becomes more cell-based…but circuit
families in cells are more complexfamilies in cells are more complex Some circuits may be characterized as cells, some may Some circuits may be characterized as cells, some may
require device-level analysisrequire device-level analysis Fluid cells & device-level optimizationFluid cells & device-level optimization
Comprehend devices, cells and abstract Comprehend devices, cells and abstract models in same runmodels in same run Single timing graphSingle timing graph May need on-the-fly dynamic analysis on parts of circuitMay need on-the-fly dynamic analysis on parts of circuit
Use circuit recognition capabilitiesUse circuit recognition capabilities Requires stimuli generationRequires stimuli generation
More detailed waves, not only slopeMore detailed waves, not only slope Sophisticated timing checks for dominoSophisticated timing checks for domino Propagate also pulses not only arrival timePropagate also pulses not only arrival time
RR
®®
14
Mixed-level TimingMixed-level Timing Cell, abstracts and devices co-exist at Cell, abstracts and devices co-exist at
analysis levelanalysis level Choose flexible abstraction/accuracy trade-offChoose flexible abstraction/accuracy trade-off
Core
Mixed device/cells/abstracts
RR
®®
15
Domino characterizationDomino characterization Regular or footless domino as characterized cellsRegular or footless domino as characterized cells
Will be supported in cell-based timingWill be supported in cell-based timing Additional domino latches, etc…Additional domino latches, etc…
Delay similar to static cells and latchesDelay similar to static cells and latches Checks are more complex !!…next pageChecks are more complex !!…next page
clk
inputs
Domino node
keeper
output
Footless And2
clk
inputs
Domino node
keeper
output
Domino And2
See Van Campenhout, Sakallah, Mudge paper 1999
RR
®®
16
Pulse Width ChecksPulse Width Checks Need sufficiently wide pulse at Need sufficiently wide pulse at
domino nodedomino node Ensure pulse width to next stageEnsure pulse width to next stage Ensure feedback can hold dataEnsure feedback can hold data
Modeling issuesModeling issues Slopes of inputsSlopes of inputs Pulse width per discharge pathPulse width per discharge path Translating inputs intersection into Translating inputs intersection into
pulse at domino nodepulse at domino node
Dis-allowing min-transparency Dis-allowing min-transparency converts pulse width to setup converts pulse width to setup checkcheck Non-transparency hold checkNon-transparency hold check
prechargeeval
a
b
Domino node
Domino node
Inputs
RR
®®
17
SoC challengesSoC challenges Multi-core CPU’s or high-integration SoCMulti-core CPU’s or high-integration SoC
New integration level in all areas – RTL, timing, layout, New integration level in all areas – RTL, timing, layout, testing etc…testing etc…
Timing challengesTiming challenges New level of hierarchical timing, more need for New level of hierarchical timing, more need for
functionality aware timing, better abstract modelsfunctionality aware timing, better abstract models Optimize interfaces without core re-designOptimize interfaces without core re-design Integrative approach, zoom-in from abstract to detailed in Integrative approach, zoom-in from abstract to detailed in
same environmentsame environment Multiple clocks, possibly asynchronous to each otherMultiple clocks, possibly asynchronous to each other Inter-module communication, protocols, early spec and Inter-module communication, protocols, early spec and
accurate verificationaccurate verification More in-die variation, instances of same module may More in-die variation, instances of same module may
operate at different Vcc/temperature etc…operate at different Vcc/temperature etc…
RR
®®
18
Mid-term challengesMid-term challenges
MIS – Multiple Input SwitchingMIS – Multiple Input Switching Process and environment variabilityProcess and environment variability
Voltage and TemperatureVoltage and Temperature Process variabilityProcess variability
Timing challenges due to leakage Timing challenges due to leakage reduction techniquesreduction techniques Sleep transistors – usage methodology Sleep transistors – usage methodology
and support in timingand support in timing
RR
®®
19
MIS – MIS – MMultiple ultiple IInput nput SSwitchingwitching More MIS situations as frequency increasesMore MIS situations as frequency increases
Less stages in clock cycleLess stages in clock cycle Slope steepness increases slower than frequencySlope steepness increases slower than frequency
Broad range of effectsBroad range of effects Single stage well knownSingle stage well known
Impact across stages more subtleImpact across stages more subtle Load stage may present different effective load Load stage may present different effective load
due to Miller couplingdue to Miller coupling Either slow-down or speed-upEither slow-down or speed-up
Holding side input by real driver versus “ideal Holding side input by real driver versus “ideal voltage” has accuracy impactvoltage” has accuracy impact
Characterization/modeling issuesCharacterization/modeling issues
RR
®®
20
One gate slow-down/ speed-upOne gate slow-down/ speed-up
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200
Time ps
Volts
12.6% pushout
Single inputswitches
a
b Vds incrementalacross top deviceIn series stack
Mitigate with legging
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200
Time ps
Volts
39.7% speedup
Single inputswitches
a
b
Effectively adds device strengths
RR
®®
21
Two gates, Fanout pull-inTwo gates, Fanout pull-in
15.6% speedup
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200
Time ps
Volts
a
b
c
o
c
c with a or b or both MIS Miller coupling c,o Position dependent No generic model
o o2
miller coupling,droop causesspeedup on o
mitigate with legging,pushing down stackif only one signal critical
o2
single inputswitching o
RR
®®
22
Fanout Signal LocationFanout Signal Location c with a, b or both MIS Either speedup or pushout based on connection
connected to pin a: -15.6% to 12.6% variation connected to pin b: -0.8% to 0.3% variation
a
b
c
o/c
c/o
o o2
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200
Time ps
Volts
RR
®®
23
MISMIS – Modeling issues – Modeling issues Not so easy to model in CBD (Cell-Based Not so easy to model in CBD (Cell-Based
Design)Design) Min/Max timing window provides a range of Min/Max timing window provides a range of
switching timesswitching times Window overlap of two inputs allows MIS but doesn’t Window overlap of two inputs allows MIS but doesn’t
guarantee itguarantee it
Assuming full MIS leads to over-designAssuming full MIS leads to over-design Most important to check MIS effect on min-delay Most important to check MIS effect on min-delay
which may lead to chip failurewhich may lead to chip failure Max delay MIS may only reduce operating frequencyMax delay MIS may only reduce operating frequency Possibly consider max-delay MIS as random variable over Possibly consider max-delay MIS as random variable over
overlap windowoverlap window
Easier to consider MIS in BFS timing Easier to consider MIS in BFS timing propagationpropagation
RR
®®
24
Process and Environment Process and Environment VariabilityVariability
Both deterministic and random variationBoth deterministic and random variation The absolute The absolute of CD does not decrease at same pace as of CD does not decrease at same pace as
channel lengthchannel length Thus relative value of L and Vt variation increasesThus relative value of L and Vt variation increases
Lower voltages, higher currentsLower voltages, higher currents Non-uniform Vdd on chip, consider Non-uniform Vdd on chip, consider Vdd in timingVdd in timing Big drivers may “starve” neighborsBig drivers may “starve” neighbors
Are variations causing significant critical path re-ordering ?Are variations causing significant critical path re-ordering ? ““Nominal” timing is not good enough to accurately Nominal” timing is not good enough to accurately
predict siliconpredict silicon Worst-casing all effects reduces design space or makes design Worst-casing all effects reduces design space or makes design
impossibleimpossible Consider chip map for deterministic variationsConsider chip map for deterministic variations Need statistical approach in STA for random effectsNeed statistical approach in STA for random effects
RR
®®
25
Reducing leakage powerReducing leakage power Most important for mobile and internet servers, as Most important for mobile and internet servers, as
important as speed !important as speed ! Standby leakage Standby leakage
power consumed when whole chip is idle, Tj is NOT high power consumed when whole chip is idle, Tj is NOT high (Spec temp. for mobile at 50C)(Spec temp. for mobile at 50C)
impact on battery life for portable devicesimpact on battery life for portable devices Active leakageActive leakage
power consumed due to device leakage when chip is power consumed due to device leakage when chip is working, and Tj is high (110C)working, and Tj is high (110C)
Subthreshold and Gate leakage significantly higherSubthreshold and Gate leakage significantly higher impact on overall chip thermal design power and impact on overall chip thermal design power and
frequencyfrequency PPtottot=P=Pswitchswitch + P + Pleak,,leak,,
RR
®®
26
Leakage Gating with Sleep Leakage Gating with Sleep TransistorTransistor
Leakage is a main concern below 90nmLeakage is a main concern below 90nm Partition the chip to allow individual control of the sleep transistorsPartition the chip to allow individual control of the sleep transistors
Sleep transistor is on while the block is workingSleep transistor is on while the block is working Sleep transistor is off while the block is idle Sleep transistor is off while the block is idle
Block A
SleepControl
Block B
SleepControl
Block C
SleepControl
Block D
SleepControl
RR
®®
27
Sleep transistors in timingSleep transistors in timing Difficult to comprehend in STADifficult to comprehend in STA
Many cells share same virtual ground through one Many cells share same virtual ground through one sleep transistor (legged/distributed in reality)sleep transistor (legged/distributed in reality)
Voltage of virtual ground depends on current Voltage of virtual ground depends on current drawn by all active gates on same sleep transistordrawn by all active gates on same sleep transistor
Need to guarantee max/min voltage on virtual groundNeed to guarantee max/min voltage on virtual ground How to verify statically min/max GND voltage How to verify statically min/max GND voltage
Need cell models and interaction models for Need cell models and interaction models for cells on different virtual groundcells on different virtual ground Logic grouping, by time of common switchingLogic grouping, by time of common switching Estimate current needed in worst caseEstimate current needed in worst case
Lack of support in timing tools is main limiting Lack of support in timing tools is main limiting factor for using this techniquefactor for using this technique
RR
®®
28
SummarySummary STA is a key component of chip designSTA is a key component of chip design
New VDSM and high frequency challengesNew VDSM and high frequency challenges Hierarchical models cope with full chip Hierarchical models cope with full chip
complexitycomplexity Electrical interaction across logical hierarchy Electrical interaction across logical hierarchy
boundariesboundaries CrossTalk, MIS, variability and more CrossTalk, MIS, variability and more
phenomena need efficient solutionsphenomena need efficient solutions Will require more dynamic device-level Will require more dynamic device-level
analysis within static timing toolsanalysis within static timing tools Closer interaction with Logic/SatisfiabilityCloser interaction with Logic/Satisfiability
RR
®®
29
ContributorsContributors
Noel MenezesNoel Menezes
Florentin DartuFlorentin Dartu
Ken StevensKen Stevens
Vladi TsipenyukVladi Tsipenyuk
Uri FirstUri First
Igor KellerIgor Keller
Abhijit DharchoudhuryAbhijit Dharchoudhury