01 / 2000 / pjz / 6149-1 sppdg 06 / 2006 / smc / 1 design for performance optimization of virtex...
TRANSCRIPT
01 / 2000 / PJZ / 6149-1SPPDG 06 / 2006 / SMC / 1
Design for PerformanceDesign for PerformanceOptimization of Virtex DevicesOptimization of Virtex Devices
Steve Currie – 6/26/2006Steve Currie – 6/26/2006
Mayo Clinic SPPDGMayo Clinic SPPDG
507-538-5460507-538-5460
[email protected]@mayo.edu
06 / 2006 / SMC / 2SPPDG
About the Mayo SPPDGAbout the Mayo SPPDG(Special Purpose Processor Development Group)(Special Purpose Processor Development Group)
• Not generally a “product delivering” organizationNot generally a “product delivering” organization• Risk reduction efforts, proof-of-concept, Risk reduction efforts, proof-of-concept,
prototypes, test vehiclesprototypes, test vehicles• Evaluate emerging technologiesEvaluate emerging technologies• Push existing technology to the limitsPush existing technology to the limits
• Commonly-known strengthsCommonly-known strengths• Power integrity analysis (power delivery Power integrity analysis (power delivery
design/analysis)design/analysis)• Signal integrity analysisSignal integrity analysis• High-speed design/testHigh-speed design/test• e.g., DC to 80 Gbps logice.g., DC to 80 Gbps logic
06 / 2006 / SMC / 3SPPDG
Experience with Virtex FPGAsExperience with Virtex FPGAs
• Done: 10Gbps in V2 (XC2V6000 -6, BF957)Done: 10Gbps in V2 (XC2V6000 -6, BF957)• 16-bit LVDS busses @ 690 Mbps16-bit LVDS busses @ 690 Mbps
• Soft-SERDES implementationSoft-SERDES implementation• Multiple clock domainsMultiple clock domains
• Doing: 50Gbps in V4 (FX100 & FX140 in FF1517 pacakge)Doing: 50Gbps in V4 (FX100 & FX140 in FF1517 pacakge)• SPI-5 (16+1 RocketIO @ 3.125 Gbps)SPI-5 (16+1 RocketIO @ 3.125 Gbps)• Dual 50 Gbps “lite” interfaces: 8+8 RocketIO @ 6.25 Dual 50 Gbps “lite” interfaces: 8+8 RocketIO @ 6.25
GbpsGbps• 400+ bit, ~200 MHz SRAM interface400+ bit, ~200 MHz SRAM interface
• DDR2, QDR2, 84+ GbpsDDR2, QDR2, 84+ Gbps• Using nearly all IO in all banksUsing nearly all IO in all banks• Phase-shifted IO reaching the different memory Phase-shifted IO reaching the different memory
modulesmodules• Heavy internal resource utilization (% TBD)Heavy internal resource utilization (% TBD)
1983719837
POE Prototype Two front panel and top down viewPOE Prototype Two front panel and top down view
06 / 2006 / SMC / 5SPPDG
General Concerns – 1General Concerns – 1
• % utilization of IO (SSO), core (timing/placement)% utilization of IO (SSO), core (timing/placement)• Higher speed processing and throughput could require Higher speed processing and throughput could require
intense IO operation (either intense IO operation (either “more” or “each-faster”“more” or “each-faster”))• Complex core processing at high speed requires Complex core processing at high speed requires
extensive pipelining, or perhaps duplicate processing extensive pipelining, or perhaps duplicate processing functions – functions – internal timinginternal timing becomes challenging! becomes challenging!
• Jitter/clockingJitter/clocking• High speed clock sources, fanout (routing, buffering), High speed clock sources, fanout (routing, buffering),
multiplication, clean-upmultiplication, clean-up• Clock recovery circuitsClock recovery circuits
• SSO, power deliverySSO, power delivery• Aggressive decoupling competes with power supply Aggressive decoupling competes with power supply
stabilitystability• Rules of thumbRules of thumb break down as utilization % increases break down as utilization % increases
06 / 2006 / SMC / 6SPPDG
General Concerns – 2General Concerns – 2
• Power-on reset, initial conditionsPower-on reset, initial conditions• Large current spikes coming out of configuration/”reset”Large current spikes coming out of configuration/”reset”• Defining initial conditions of “analog” elementsDefining initial conditions of “analog” elements
• Large/wide bus terminationLarge/wide bus termination• Discrete terminationDiscrete termination wastes less power than active termination, but at wastes less power than active termination, but at
the cost of the cost of large footprintlarge footprint• Competes with power delivery system componentsCompetes with power delivery system components• Could move to buried resistors, but there lies another set or Could move to buried resistors, but there lies another set or
problemsproblems
• ““Secret” how-tos and Secret” how-tos and inconsistent documentationinconsistent documentation• Many details of RocketIO operation were [mis]documented in the Many details of RocketIO operation were [mis]documented in the
various documents availablevarious documents available• We utilized an existing Titanium Support contract to get the “truth”We utilized an existing Titanium Support contract to get the “truth”
• 33rdrd-party IP often needed to push basic capability to acceptable -party IP often needed to push basic capability to acceptable performanceperformance• Attempting to saturate gigabit-ethernet with Xilinx “included” TCP/IP Attempting to saturate gigabit-ethernet with Xilinx “included” TCP/IP
stack vs pay-for optionstack vs pay-for option• Appnotes and their boundariesAppnotes and their boundaries (assumptions/limitations) should be (assumptions/limitations) should be
thoroughly understood before being used – don’t expect “cut and thoroughly understood before being used – don’t expect “cut and paste” simplicitypaste” simplicity
06 / 2006 / SMC / 7SPPDG
Our V2-Specific ChallengesOur V2-Specific Challenges
• Multiple clock domains inside the partMultiple clock domains inside the part• Location of global clock pins vs DCMs, etc.Location of global clock pins vs DCMs, etc.• Unusable jitter with clock multipliersUnusable jitter with clock multipliers
• Clean-up PLLs off-chipClean-up PLLs off-chip
• LVDS busses near their speed limitsLVDS busses near their speed limits• Needed soft-SERDES macro and precise clock-to-data Needed soft-SERDES macro and precise clock-to-data
alignmentalignment
• Using a large % of the chip resources complicates timingUsing a large % of the chip resources complicates timing• Hand-placement often required to make timingHand-placement often required to make timing
• Xilinx Titanium Support provided very valuable in-depth Xilinx Titanium Support provided very valuable in-depth knowledge and, hence, solutions to some problemsknowledge and, hence, solutions to some problems• Having Having consultation during the designconsultation during the design phase is better phase is better
than having them debug/patch after the problems existthan having them debug/patch after the problems exist
06 / 2006 / SMC / 8SPPDG
Our V4-Specific Challenges – 1Our V4-Specific Challenges – 1
• Core speed didn’t scale up from V2 as other capabilities didCore speed didn’t scale up from V2 as other capabilities did• We were hoping for 400 MHz, which appears unlikelyWe were hoping for 400 MHz, which appears unlikely
• Requires “dual-parallel” data path at ½-rate inside, which Requires “dual-parallel” data path at ½-rate inside, which increases the core-usageincreases the core-usage
• Package design is good, but power delivery recommendations Package design is good, but power delivery recommendations don’t suit complex designsdon’t suit complex designs• Evaluation boards don’t follow these recommendationsEvaluation boards don’t follow these recommendations• SSO is still a problem, and the somewhat cumbersome SSO SSO is still a problem, and the somewhat cumbersome SSO
calculator is critical to make this workcalculator is critical to make this work• Thorough Thorough power-delivery system analysispower-delivery system analysis (HFSS, SiWave) (HFSS, SiWave)
requires knowledge of the package construction (and on-requires knowledge of the package construction (and on-chip/package decoupling) which is difficult to acquire (NDA, chip/package decoupling) which is difficult to acquire (NDA, etc.)etc.)
• Crosstalk analysis shows need for painful routing of memory IOCrosstalk analysis shows need for painful routing of memory IO
• RocketIO require a significant power filtering network for each RocketIO require a significant power filtering network for each transceiver (whether each transceiver is used, or not), further transceiver (whether each transceiver is used, or not), further complicating an already dense layoutcomplicating an already dense layout
06 / 2006 / SMC / 9SPPDG
Our V4-Specific Challenges – 2Our V4-Specific Challenges – 2
• Power consumptionPower consumption• RocketIO were planned to be 10+ Gbps, hence they consume more RocketIO were planned to be 10+ Gbps, hence they consume more
power than if they had been designed for the current-errata power than if they had been designed for the current-errata maximum: 3.125 Gbps (and “Step 1” maximum: 6.25 Gbps)maximum: 3.125 Gbps (and “Step 1” maximum: 6.25 Gbps)
• Initial estimates showed 35 Watts per-FPGA for our desired capability Initial estimates showed 35 Watts per-FPGA for our desired capability – now a cooling challenge as well– now a cooling challenge as well
• Power delivery systemPower delivery system• No room for discrete termination AND decoupling, thus active No room for discrete termination AND decoupling, thus active
termination (even with the power cost) is preferred over the problems termination (even with the power cost) is preferred over the problems with buried resistors (cost/debug)with buried resistors (cost/debug)
• RocketIO usage requires RocketIO usage requires 8b/10b8b/10b per latest errata per latest errata• Effectively reduces throughput capacity by 20%Effectively reduces throughput capacity by 20%• Eliminates SPI-5 and 8-bit, 50Gbps interfacesEliminates SPI-5 and 8-bit, 50Gbps interfaces• Run-length problem, but 8b/10b also is DC-free: overkillRun-length problem, but 8b/10b also is DC-free: overkill
• Could consider custom encoding scheme, but the 8b/10b is a Could consider custom encoding scheme, but the 8b/10b is a “free” hard-macro in the RocketIO (fast, no extra resources “free” hard-macro in the RocketIO (fast, no extra resources used)used)
• Limited channel-bonding capabilitiesLimited channel-bonding capabilities• Must do channel bonding in the core for unsupported interface Must do channel bonding in the core for unsupported interface
protocols (increased power, core-usage)protocols (increased power, core-usage)
06 / 2006 / SMC / 10SPPDG
Our V5–Specific ConcernsOur V5–Specific Concerns
• NDA-protected conversations have made us fond NDA-protected conversations have made us fond of the V5 roadmap, but there are concernsof the V5 roadmap, but there are concerns• Schedule and feature-set reliabilitySchedule and feature-set reliability• V4 slipped/changed repeatedly… what to V4 slipped/changed repeatedly… what to
expect from V5?expect from V5?• Implied SEE sensitivity with the addition of Implied SEE sensitivity with the addition of
configuration frame ECC (post-configuration configuration frame ECC (post-configuration checking) – a 65nm problem?checking) – a 65nm problem?
06 / 2006 / SMC / 11SPPDG
Problem SummaryProblem Summary
• Signal Integrity AnalysisSignal Integrity Analysis• Lots of SSO, dense routing, crosstalk (non LVDS data paths)Lots of SSO, dense routing, crosstalk (non LVDS data paths)• RocketIO link analysisRocketIO link analysis• All require I/O All require I/O spice modelsspice models from Xilinx which must first be validated from Xilinx which must first be validated
against hardwareagainst hardware• Also require interconnect models (transmission lines)Also require interconnect models (transmission lines)
• Power analysis/integrityPower analysis/integrity• Power supply selection must tie in with Power supply selection must tie in with decoupling designdecoupling design• Very low impedance power delivery helps with SSO, but is Very low impedance power delivery helps with SSO, but is
problematic for power supplies (extensive analysis of package, problematic for power supplies (extensive analysis of package, board, decoupling, supply required)board, decoupling, supply required)
• Internal timing constraints and problemsInternal timing constraints and problems• Need for “hands on” place/route inside FPGAs to get peak Need for “hands on” place/route inside FPGAs to get peak
performanceperformance• Design consultationDesign consultation might be appropriate (we used Xilinx might be appropriate (we used Xilinx
Titanium Service)Titanium Service)
• Architecture design for lowest clock jitterArchitecture design for lowest clock jitter• Clock circuitry is different from V2-V2P-V4-V5Clock circuitry is different from V2-V2P-V4-V5• Need “inside” knowledge: design consultation, againNeed “inside” knowledge: design consultation, again
• ChipScope is a good internal debugging toolChipScope is a good internal debugging tool
06 / 2006 / SMC / 12SPPDG
One Specific Problem:One Specific Problem:High Speed Bus Clock/Data AlignmentHigh Speed Bus Clock/Data Alignment• Problem: Multi-bit data bus and clock are captured at target FPGA Problem: Multi-bit data bus and clock are captured at target FPGA
with imperfect alignmentwith imperfect alignment
• A V2 solution: xapp268A V2 solution: xapp268• Assumes all clock and data signals that make up a bus arrive Assumes all clock and data signals that make up a bus arrive
“close” in phase, and uses DCM-delay to sample the clock “close” in phase, and uses DCM-delay to sample the clock with itself to find the “middle” of the clock for capture with itself to find the “middle” of the clock for capture alignmentalignment
• Clever, but isn’t finding the center of the data windowClever, but isn’t finding the center of the data window• Requires global clock input and DCM for xapp to work as Requires global clock input and DCM for xapp to work as
intendedintended• Global clock input needed per bus – not so easyGlobal clock input needed per bus – not so easy
• A more data-centric solutionA more data-centric solution• Measure goodness of DLL setting by checking bit error rates Measure goodness of DLL setting by checking bit error rates
on the data bitson the data bits• Identifies the best clock to data alignment based off an Identifies the best clock to data alignment based off an
“averaged” data window“averaged” data window• Uses upstream data generation, local data-compareUses upstream data generation, local data-compare• More core resources used, but supports very high More core resources used, but supports very high
speeds, large/small busses, worse-matched routing, etc.speeds, large/small busses, worse-matched routing, etc.
JAN_3 / 2005 / ELHA / 21333
CLOCK PHASE TO DATA EYE RELATIONSHIP FOR FULL-RANGE,HALF-RANGE, AND MODIFIED HALF-RANGE PHASE CONTROL ALGORITHM
( Phase Setting for Xilinx FPGA High Speed Receivers Illustrating Where AlgorithmFrom Xilinx Application Note 268 Placed Clock Edge With Respect to Data Eye;
The Percentage From Center of Data Eye That Clock Edge Was Placed ByAlgorithm Relative to Half the Eye Indicated By Value In Parentheses )
U1RX1 (94%)U2RX1 (52%)
U2RX2 (12%)
= Clock Phase Setting After Reset
U3RX1 (30%)U4RX1 (17%)
U4RX2 (44%)U4RX3 (33%)
U6RX1 (25%)
U7RX1 (1%)
U2RX1 (9%)
U2RX2 (1%)U3RX1 (5%)
U4RX1 (30%)U4RX2 (12%)
U4RX3 (11%)U6RX1 (35%)
U7RX1 (12%)
U1RX1 (88%)
U1RX1 (84%)U2RX1 (39%)
U2RX2 (0%)
256-255
Full-Range
Half-Range
Half-Range Modified
0DCM Phase Setting
06 / 2006 / SMC / 15SPPDG
High Speed BusHigh Speed BusClock to Data AlignmentClock to Data Alignment
• Problem: Multibit data bus and clock are captured at target FPGA Problem: Multibit data bus and clock are captured at target FPGA with imperfect alignmentwith imperfect alignment• V4 solutionsV4 solutions
• IDELAY and ISERDESIDELAY and ISERDES• Per-bit clock to data alignment capability and hard Per-bit clock to data alignment capability and hard
SERDES macroSERDES macro• New clock resources compared to V2 (PMCD)New clock resources compared to V2 (PMCD)
• V5 solutionsV5 solutions• IDELAY + ODELAYIDELAY + ODELAY• New/changed clock resources compared to V4New/changed clock resources compared to V4
06 / 2006 / SMC / 16SPPDG
SummarySummary
• Rules of thumb don’t cut itRules of thumb don’t cut it• Analysis and design are required to provide the kind/quantity of clean power needed for Analysis and design are required to provide the kind/quantity of clean power needed for
large, heavily utilized devices at high speedlarge, heavily utilized devices at high speed• Signal integrity analysis is required for dense routing and fast signalsSignal integrity analysis is required for dense routing and fast signals
• Devices change significantly from family to familyDevices change significantly from family to family• Unless you want to be an expert with each, hire design consultationUnless you want to be an expert with each, hire design consultation• What was once hard may become easy, but it also means that what once worked might not What was once hard may become easy, but it also means that what once worked might not
any longer (design reuse)any longer (design reuse)
• Data paths get more complicated with speedData paths get more complicated with speed• Must manage clock/data alignmentMust manage clock/data alignment• Framing is required to properly align bussesFraming is required to properly align busses• Proper signal-integrity methodology becomes mandatoryProper signal-integrity methodology becomes mandatory
• Power consumption is significant, but must be clean as wellPower consumption is significant, but must be clean as well• Requires simultaneous analysis of package, board, and other power-delivery system Requires simultaneous analysis of package, board, and other power-delivery system
componentscomponents• RocketIO require an extensive power filter networkRocketIO require an extensive power filter network
• Clock architectureClock architecture• RocketIO: Recovered clocks, dedicated MGTCLK inputs (what frequency is best for the RocketIO: Recovered clocks, dedicated MGTCLK inputs (what frequency is best for the
PLLs?) and the problems (e.g., jitter) with each requires intimate knowledge of the FPGA PLLs?) and the problems (e.g., jitter) with each requires intimate knowledge of the FPGA architecturearchitecture
• General: must understand on-chip clock resources and their limitations (“geographic” General: must understand on-chip clock resources and their limitations (“geographic” restrictions, jitter requirements OR jitter generated)restrictions, jitter requirements OR jitter generated)
• Communications protocol implementations are somewhat limitedCommunications protocol implementations are somewhat limited• Hard-macros cater to a select set of protocolsHard-macros cater to a select set of protocols• Intrinsic performance limitations make some implementations improbable (E.g., SPI-5)Intrinsic performance limitations make some implementations improbable (E.g., SPI-5)