01 / 2000 / pjz / 6149-1 sppdg 06 / 2006 / smc / 1 design for performance optimization of virtex...

01 / 2000 / PJZ / 6149-1SPPDG 06 / 2006 / SMC / 1

Design for PerformanceDesign for PerformanceOptimization of Virtex DevicesOptimization of Virtex Devices

Steve Currie – 6/26/2006Steve Currie – 6/26/2006

Mayo Clinic SPPDGMayo Clinic SPPDG

507-538-5460507-538-5460

[email protected]@mayo.edu

06 / 2006 / SMC / 2SPPDG

About the Mayo SPPDGAbout the Mayo SPPDG(Special Purpose Processor Development Group)(Special Purpose Processor Development Group)

• Not generally a “product delivering” organizationNot generally a “product delivering” organization• Risk reduction efforts, proof-of-concept, Risk reduction efforts, proof-of-concept,

prototypes, test vehiclesprototypes, test vehicles• Evaluate emerging technologiesEvaluate emerging technologies• Push existing technology to the limitsPush existing technology to the limits

• Commonly-known strengthsCommonly-known strengths• Power integrity analysis (power delivery Power integrity analysis (power delivery

design/analysis)design/analysis)• Signal integrity analysisSignal integrity analysis• High-speed design/testHigh-speed design/test• e.g., DC to 80 Gbps logice.g., DC to 80 Gbps logic

06 / 2006 / SMC / 3SPPDG

Experience with Virtex FPGAsExperience with Virtex FPGAs

• Done: 10Gbps in V2 (XC2V6000 -6, BF957)Done: 10Gbps in V2 (XC2V6000 -6, BF957)• 16-bit LVDS busses @ 690 Mbps16-bit LVDS busses @ 690 Mbps

• Soft-SERDES implementationSoft-SERDES implementation• Multiple clock domainsMultiple clock domains

• Doing: 50Gbps in V4 (FX100 & FX140 in FF1517 pacakge)Doing: 50Gbps in V4 (FX100 & FX140 in FF1517 pacakge)• SPI-5 (16+1 RocketIO @ 3.125 Gbps)SPI-5 (16+1 RocketIO @ 3.125 Gbps)• Dual 50 Gbps “lite” interfaces: 8+8 RocketIO @ 6.25 Dual 50 Gbps “lite” interfaces: 8+8 RocketIO @ 6.25

GbpsGbps• 400+ bit, ~200 MHz SRAM interface400+ bit, ~200 MHz SRAM interface

• DDR2, QDR2, 84+ GbpsDDR2, QDR2, 84+ Gbps• Using nearly all IO in all banksUsing nearly all IO in all banks• Phase-shifted IO reaching the different memory Phase-shifted IO reaching the different memory

modulesmodules• Heavy internal resource utilization (% TBD)Heavy internal resource utilization (% TBD)

1983719837

POE Prototype Two front panel and top down viewPOE Prototype Two front panel and top down view

06 / 2006 / SMC / 5SPPDG

General Concerns – 1General Concerns – 1

• % utilization of IO (SSO), core (timing/placement)% utilization of IO (SSO), core (timing/placement)• Higher speed processing and throughput could require Higher speed processing and throughput could require

intense IO operation (either intense IO operation (either “more” or “each-faster”“more” or “each-faster”))• Complex core processing at high speed requires Complex core processing at high speed requires

extensive pipelining, or perhaps duplicate processing extensive pipelining, or perhaps duplicate processing functions – functions – internal timinginternal timing becomes challenging! becomes challenging!

• Jitter/clockingJitter/clocking• High speed clock sources, fanout (routing, buffering), High speed clock sources, fanout (routing, buffering),

multiplication, clean-upmultiplication, clean-up• Clock recovery circuitsClock recovery circuits

• SSO, power deliverySSO, power delivery• Aggressive decoupling competes with power supply Aggressive decoupling competes with power supply

stabilitystability• Rules of thumbRules of thumb break down as utilization % increases break down as utilization % increases

06 / 2006 / SMC / 6SPPDG

General Concerns – 2General Concerns – 2

• Power-on reset, initial conditionsPower-on reset, initial conditions• Large current spikes coming out of configuration/”reset”Large current spikes coming out of configuration/”reset”• Defining initial conditions of “analog” elementsDefining initial conditions of “analog” elements

• Large/wide bus terminationLarge/wide bus termination• Discrete terminationDiscrete termination wastes less power than active termination, but at wastes less power than active termination, but at

the cost of the cost of large footprintlarge footprint• Competes with power delivery system componentsCompetes with power delivery system components• Could move to buried resistors, but there lies another set or Could move to buried resistors, but there lies another set or

problemsproblems

• ““Secret” how-tos and Secret” how-tos and inconsistent documentationinconsistent documentation• Many details of RocketIO operation were [mis]documented in the Many details of RocketIO operation were [mis]documented in the

various documents availablevarious documents available• We utilized an existing Titanium Support contract to get the “truth”We utilized an existing Titanium Support contract to get the “truth”

• 33rdrd-party IP often needed to push basic capability to acceptable -party IP often needed to push basic capability to acceptable performanceperformance• Attempting to saturate gigabit-ethernet with Xilinx “included” TCP/IP Attempting to saturate gigabit-ethernet with Xilinx “included” TCP/IP

stack vs pay-for optionstack vs pay-for option• Appnotes and their boundariesAppnotes and their boundaries (assumptions/limitations) should be (assumptions/limitations) should be

thoroughly understood before being used – don’t expect “cut and thoroughly understood before being used – don’t expect “cut and paste” simplicitypaste” simplicity

06 / 2006 / SMC / 7SPPDG

Our V2-Specific ChallengesOur V2-Specific Challenges

• Multiple clock domains inside the partMultiple clock domains inside the part• Location of global clock pins vs DCMs, etc.Location of global clock pins vs DCMs, etc.• Unusable jitter with clock multipliersUnusable jitter with clock multipliers

• Clean-up PLLs off-chipClean-up PLLs off-chip

• LVDS busses near their speed limitsLVDS busses near their speed limits• Needed soft-SERDES macro and precise clock-to-data Needed soft-SERDES macro and precise clock-to-data

alignmentalignment

• Using a large % of the chip resources complicates timingUsing a large % of the chip resources complicates timing• Hand-placement often required to make timingHand-placement often required to make timing

• Xilinx Titanium Support provided very valuable in-depth Xilinx Titanium Support provided very valuable in-depth knowledge and, hence, solutions to some problemsknowledge and, hence, solutions to some problems• Having Having consultation during the designconsultation during the design phase is better phase is better

than having them debug/patch after the problems existthan having them debug/patch after the problems exist

06 / 2006 / SMC / 8SPPDG

Our V4-Specific Challenges – 1Our V4-Specific Challenges – 1

• Core speed didn’t scale up from V2 as other capabilities didCore speed didn’t scale up from V2 as other capabilities did• We were hoping for 400 MHz, which appears unlikelyWe were hoping for 400 MHz, which appears unlikely

• Requires “dual-parallel” data path at ½-rate inside, which Requires “dual-parallel” data path at ½-rate inside, which increases the core-usageincreases the core-usage

• Package design is good, but power delivery recommendations Package design is good, but power delivery recommendations don’t suit complex designsdon’t suit complex designs• Evaluation boards don’t follow these recommendationsEvaluation boards don’t follow these recommendations• SSO is still a problem, and the somewhat cumbersome SSO SSO is still a problem, and the somewhat cumbersome SSO

calculator is critical to make this workcalculator is critical to make this work• Thorough Thorough power-delivery system analysispower-delivery system analysis (HFSS, SiWave) (HFSS, SiWave)

requires knowledge of the package construction (and on-requires knowledge of the package construction (and on-chip/package decoupling) which is difficult to acquire (NDA, chip/package decoupling) which is difficult to acquire (NDA, etc.)etc.)

• Crosstalk analysis shows need for painful routing of memory IOCrosstalk analysis shows need for painful routing of memory IO

• RocketIO require a significant power filtering network for each RocketIO require a significant power filtering network for each transceiver (whether each transceiver is used, or not), further transceiver (whether each transceiver is used, or not), further complicating an already dense layoutcomplicating an already dense layout

06 / 2006 / SMC / 9SPPDG

Our V4-Specific Challenges – 2Our V4-Specific Challenges – 2

• Power consumptionPower consumption• RocketIO were planned to be 10+ Gbps, hence they consume more RocketIO were planned to be 10+ Gbps, hence they consume more

power than if they had been designed for the current-errata power than if they had been designed for the current-errata maximum: 3.125 Gbps (and “Step 1” maximum: 6.25 Gbps)maximum: 3.125 Gbps (and “Step 1” maximum: 6.25 Gbps)

• Initial estimates showed 35 Watts per-FPGA for our desired capability Initial estimates showed 35 Watts per-FPGA for our desired capability – now a cooling challenge as well– now a cooling challenge as well

• Power delivery systemPower delivery system• No room for discrete termination AND decoupling, thus active No room for discrete termination AND decoupling, thus active

termination (even with the power cost) is preferred over the problems termination (even with the power cost) is preferred over the problems with buried resistors (cost/debug)with buried resistors (cost/debug)

• RocketIO usage requires RocketIO usage requires 8b/10b8b/10b per latest errata per latest errata• Effectively reduces throughput capacity by 20%Effectively reduces throughput capacity by 20%• Eliminates SPI-5 and 8-bit, 50Gbps interfacesEliminates SPI-5 and 8-bit, 50Gbps interfaces• Run-length problem, but 8b/10b also is DC-free: overkillRun-length problem, but 8b/10b also is DC-free: overkill

• Could consider custom encoding scheme, but the 8b/10b is a Could consider custom encoding scheme, but the 8b/10b is a “free” hard-macro in the RocketIO (fast, no extra resources “free” hard-macro in the RocketIO (fast, no extra resources used)used)

• Limited channel-bonding capabilitiesLimited channel-bonding capabilities• Must do channel bonding in the core for unsupported interface Must do channel bonding in the core for unsupported interface

protocols (increased power, core-usage)protocols (increased power, core-usage)

06 / 2006 / SMC / 10SPPDG

Our V5–Specific ConcernsOur V5–Specific Concerns

• NDA-protected conversations have made us fond NDA-protected conversations have made us fond of the V5 roadmap, but there are concernsof the V5 roadmap, but there are concerns• Schedule and feature-set reliabilitySchedule and feature-set reliability• V4 slipped/changed repeatedly… what to V4 slipped/changed repeatedly… what to

expect from V5?expect from V5?• Implied SEE sensitivity with the addition of Implied SEE sensitivity with the addition of

configuration frame ECC (post-configuration configuration frame ECC (post-configuration checking) – a 65nm problem?checking) – a 65nm problem?

06 / 2006 / SMC / 11SPPDG

Problem SummaryProblem Summary

• Signal Integrity AnalysisSignal Integrity Analysis• Lots of SSO, dense routing, crosstalk (non LVDS data paths)Lots of SSO, dense routing, crosstalk (non LVDS data paths)• RocketIO link analysisRocketIO link analysis• All require I/O All require I/O spice modelsspice models from Xilinx which must first be validated from Xilinx which must first be validated

against hardwareagainst hardware• Also require interconnect models (transmission lines)Also require interconnect models (transmission lines)

• Power analysis/integrityPower analysis/integrity• Power supply selection must tie in with Power supply selection must tie in with decoupling designdecoupling design• Very low impedance power delivery helps with SSO, but is Very low impedance power delivery helps with SSO, but is

problematic for power supplies (extensive analysis of package, problematic for power supplies (extensive analysis of package, board, decoupling, supply required)board, decoupling, supply required)

• Internal timing constraints and problemsInternal timing constraints and problems• Need for “hands on” place/route inside FPGAs to get peak Need for “hands on” place/route inside FPGAs to get peak

performanceperformance• Design consultationDesign consultation might be appropriate (we used Xilinx might be appropriate (we used Xilinx

Titanium Service)Titanium Service)

• Architecture design for lowest clock jitterArchitecture design for lowest clock jitter• Clock circuitry is different from V2-V2P-V4-V5Clock circuitry is different from V2-V2P-V4-V5• Need “inside” knowledge: design consultation, againNeed “inside” knowledge: design consultation, again

• ChipScope is a good internal debugging toolChipScope is a good internal debugging tool

06 / 2006 / SMC / 12SPPDG

One Specific Problem:One Specific Problem:High Speed Bus Clock/Data AlignmentHigh Speed Bus Clock/Data Alignment• Problem: Multi-bit data bus and clock are captured at target FPGA Problem: Multi-bit data bus and clock are captured at target FPGA

with imperfect alignmentwith imperfect alignment

• A V2 solution: xapp268A V2 solution: xapp268• Assumes all clock and data signals that make up a bus arrive Assumes all clock and data signals that make up a bus arrive

“close” in phase, and uses DCM-delay to sample the clock “close” in phase, and uses DCM-delay to sample the clock with itself to find the “middle” of the clock for capture with itself to find the “middle” of the clock for capture alignmentalignment

• Clever, but isn’t finding the center of the data windowClever, but isn’t finding the center of the data window• Requires global clock input and DCM for xapp to work as Requires global clock input and DCM for xapp to work as

intendedintended• Global clock input needed per bus – not so easyGlobal clock input needed per bus – not so easy

• A more data-centric solutionA more data-centric solution• Measure goodness of DLL setting by checking bit error rates Measure goodness of DLL setting by checking bit error rates

on the data bitson the data bits• Identifies the best clock to data alignment based off an Identifies the best clock to data alignment based off an

“averaged” data window“averaged” data window• Uses upstream data generation, local data-compareUses upstream data generation, local data-compare• More core resources used, but supports very high More core resources used, but supports very high

speeds, large/small busses, worse-matched routing, etc.speeds, large/small busses, worse-matched routing, etc.

JAN_3 / 2005 / ELHA / 21333

CLOCK PHASE TO DATA EYE RELATIONSHIP FOR FULL-RANGE,HALF-RANGE, AND MODIFIED HALF-RANGE PHASE CONTROL ALGORITHM

( Phase Setting for Xilinx FPGA High Speed Receivers Illustrating Where AlgorithmFrom Xilinx Application Note 268 Placed Clock Edge With Respect to Data Eye;

The Percentage From Center of Data Eye That Clock Edge Was Placed ByAlgorithm Relative to Half the Eye Indicated By Value In Parentheses )

U1RX1 (94%)U2RX1 (52%)

U2RX2 (12%)

= Clock Phase Setting After Reset

U3RX1 (30%)U4RX1 (17%)

U4RX2 (44%)U4RX3 (33%)

U6RX1 (25%)

U7RX1 (1%)

U2RX1 (9%)

U2RX2 (1%)U3RX1 (5%)

U4RX1 (30%)U4RX2 (12%)

U4RX3 (11%)U6RX1 (35%)

U7RX1 (12%)

U1RX1 (88%)

U1RX1 (84%)U2RX1 (39%)

U2RX2 (0%)

256-255

Full-Range

Half-Range

Half-Range Modified

0DCM Phase Setting

06 / 2006 / SMC / 15SPPDG

High Speed BusHigh Speed BusClock to Data AlignmentClock to Data Alignment

• Problem: Multibit data bus and clock are captured at target FPGA Problem: Multibit data bus and clock are captured at target FPGA with imperfect alignmentwith imperfect alignment• V4 solutionsV4 solutions

• IDELAY and ISERDESIDELAY and ISERDES• Per-bit clock to data alignment capability and hard Per-bit clock to data alignment capability and hard

SERDES macroSERDES macro• New clock resources compared to V2 (PMCD)New clock resources compared to V2 (PMCD)

• V5 solutionsV5 solutions• IDELAY + ODELAYIDELAY + ODELAY• New/changed clock resources compared to V4New/changed clock resources compared to V4

06 / 2006 / SMC / 16SPPDG

SummarySummary

• Rules of thumb don’t cut itRules of thumb don’t cut it• Analysis and design are required to provide the kind/quantity of clean power needed for Analysis and design are required to provide the kind/quantity of clean power needed for

large, heavily utilized devices at high speedlarge, heavily utilized devices at high speed• Signal integrity analysis is required for dense routing and fast signalsSignal integrity analysis is required for dense routing and fast signals

• Devices change significantly from family to familyDevices change significantly from family to family• Unless you want to be an expert with each, hire design consultationUnless you want to be an expert with each, hire design consultation• What was once hard may become easy, but it also means that what once worked might not What was once hard may become easy, but it also means that what once worked might not

any longer (design reuse)any longer (design reuse)

• Data paths get more complicated with speedData paths get more complicated with speed• Must manage clock/data alignmentMust manage clock/data alignment• Framing is required to properly align bussesFraming is required to properly align busses• Proper signal-integrity methodology becomes mandatoryProper signal-integrity methodology becomes mandatory

• Power consumption is significant, but must be clean as wellPower consumption is significant, but must be clean as well• Requires simultaneous analysis of package, board, and other power-delivery system Requires simultaneous analysis of package, board, and other power-delivery system

componentscomponents• RocketIO require an extensive power filter networkRocketIO require an extensive power filter network

• Clock architectureClock architecture• RocketIO: Recovered clocks, dedicated MGTCLK inputs (what frequency is best for the RocketIO: Recovered clocks, dedicated MGTCLK inputs (what frequency is best for the

PLLs?) and the problems (e.g., jitter) with each requires intimate knowledge of the FPGA PLLs?) and the problems (e.g., jitter) with each requires intimate knowledge of the FPGA architecturearchitecture

• General: must understand on-chip clock resources and their limitations (“geographic” General: must understand on-chip clock resources and their limitations (“geographic” restrictions, jitter requirements OR jitter generated)restrictions, jitter requirements OR jitter generated)

• Communications protocol implementations are somewhat limitedCommunications protocol implementations are somewhat limited• Hard-macros cater to a select set of protocolsHard-macros cater to a select set of protocols• Intrinsic performance limitations make some implementations improbable (E.g., SPI-5)Intrinsic performance limitations make some implementations improbable (E.g., SPI-5)

01 / 2000 / pjz / 6149-1 sppdg 06 / 2006 / smc / 1 design for performance optimization of virtex...

Documents