punt fabric failure - cisco · 2017-08-18 · asr9k punt fabric data path failure mahesh s {...

20
contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 1 White Paper ASR9K Punt Fabric Data Path Failure Mahesh S { [email protected] } Introduction The purpose of this white paper is to provide understanding and clarity on punt fabric data path failure message when seen during ASR9K router operation. The message when appears will be in the format “RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3- PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 1) (0/7/CPU0, 2) (0/7/CPU0, 3) (0/7/CPU0, 4) (0/7/CPU0, 5) (0/7/CPU0, 6) (0/7/CPU0, 7) “ This document should help customers, TME, TAC, DE/DT community and anyone who is interested in understanding the error message and whether any action needs to be taken or not . Abstract A high level understanding of ASR9K line card, fabric cards, route processor cards and chassis configurations is helpful. The document however doesn’t require readers to be familiar with hardware details. Necessary background information is provided before this papers attempts to explain error message. Suggestion on How to Read Author of this paper suggests following reading format to glean both essential details and how to use as reference document during debugging. During spare time and when no urgency to root cause Punt Fabric Data Path failure it will help to read all sections of the document. From the start of the document until section “Analyzing Faults”, this paper builds necessary background to isolate faulty component when such an error occurs. For Further Reading to build more knowledge on both NP and punt fabric faults use Cisco provided documentation (if any). If there is a specific question in mind for which a quick answer is needed, please use “FAQ” section. If question doesn’t show up in the section, then please if check this white paper answer the question. If we have router where fault has occurred, and we are in process of debugging to narrow down faulty component or to check if it’s a know issue, all sections starting from “Analyzing Faults” may help. Background Packet through switch fabric can traverse either two hops or three hops depending upon line card type. Typhoon generation line cards add an extra switch fabric element while Trident based line cards switch all traffic using fabric on router processor card only. The picture below shows fabric elements for both these line card types. The diagrams below show fabric connectivity to Route Processor Card.

Upload: others

Post on 14-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 1

White Paper

ASR9K Punt Fabric Data Path Failure

Mahesh S { [email protected] }

Introduction The purpose of this white paper is to provide understanding and clarity on punt fabric data path failure message when seen during ASR9K router operation. The message when appears will be in the format “RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 1) (0/7/CPU0, 2) (0/7/CPU0, 3) (0/7/CPU0, 4) (0/7/CPU0, 5) (0/7/CPU0, 6) (0/7/CPU0, 7) “ This document should help customers, TME, TAC, DE/DT community and anyone who is interested in understanding the error message and whether any action needs to be taken or not .

Abstract A high level understanding of ASR9K line card, fabric cards, route processor cards and chassis configurations is helpful. The document however doesn’t require readers to be familiar with hardware details. Necessary background information is provided before this papers attempts to explain error message.

Suggestion on How to Read Author of this paper suggests following reading format to glean both essential details and how to use as reference document during debugging.

• During spare time and when no urgency to root cause Punt Fabric Data Path failure it will help to read all sections of the document. From the start of the document until section “Analyzing Faults”, this paper builds necessary background to isolate faulty component when such an error occurs.

• For Further Reading to build more knowledge on both NP and punt fabric faults use Cisco provided documentation (if any).

• If there is a specific question in mind for which a quick answer is needed, please use “FAQ” section. If question doesn’t show up in the section, then please if check this white paper answer the question.

• If we have router where fault has occurred, and we are in process of debugging to narrow down faulty component or to check if it’s a know issue, all sections starting from “Analyzing Faults” may help.

Background Packet through switch fabric can traverse either two hops or three hops depending upon line card type. Typhoon generation line cards add an extra switch fabric element while Trident based line cards switch all traffic using fabric on router processor card only. The picture below shows fabric elements for both these line card types. The diagrams below show fabric connectivity to Route Processor Card.

Page 2: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 2

NP0$

NP2$

NP3$

NP1$FIA0%

CPU$

B0

B1 $$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$RSP0%

Switch$Fabric$RSP1%

A9K84T$

Trid

ent$Lin

e$ca

rd$

NP0$

NP2$

NP3$

NP1$FIA0%

CPU$

B0

B1

A9K88T$

Trid

ent$Lin

e$ca

rd$

NP4$

NP6$

NP7$

NP5$FIA1%

B2

B3

NP0$

NP2$

NP3$

NP1$FIA0%

CPU$

B0

B1

NP0%

NP1%

NP2%

NP3%

NP4%

NP5%

NP6%

NP7%FIA3%

FIA2%

FIA1%

FIA0% Switch

%Fabric%A

SIC%

$$$$$$$$$$$$$$$$$$$$$$

CPU$

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$RSP0%

Switch$Fabric$RSP1%

A9K84T$

A9K824x10G$

Triden

t$Line$card

$Typ

hoon$Lin

e$card$

Punt Fabric Diagnostic Packet Path Diagnostic application running on route processor card cpu injects diagnostic packets destined to each NP periodically. Diagnostic packet is loop backed inside NP and re-injected back towards route processor card cpu that sourced packet. This periodic health check of every NP using unique packet per NP by diagnostic application on route processor card will alert any functional errors on the data path during router operational conditions. It is essential to note that diagnostic application on both active route processor card and standby route processor card injects one packet per NP periodically and maintain a per NP success or failure count. When diagnostic packets are dropped for up to threshold number of drop tolerance count, the application will raise fault. Conceptual view of Diagnostic Path Before the document describes diagnostic path on Trident and Typhoon based line cards, this section will give a general outline of fabric diagnostic path both from active and standby route processor card towards NP on line card. Packet Path between Active Route Processor Card and Line card: Diagnostic packets injected from active route processor card into fabric towards NP are treated as unicast packets by switch fabric. Since for unicast packets, switch fabric chooses out going link based on current traffic load of the link, this helps to subject diagnostic packets to traffic load on the box. (When there as multiple outgoing links towards NP, switch fabric ASIC choose a link that is currently least loaded) The diagram below depicts diagnostic packet path sourced from active route processor card. Note that first link connecting FIA on line card to XBAR on route processor card is chosen all the time for packets destined to NP. Response packets from NP are subjected to link load distribution algorithm (if line card is typhoon based). This means that response packet from NP towards active route processor card can choose any of the back plane links connecting line cards to route processor card depending on the back plane link load.

Page 3: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 3

NP0$

NP6$

NP7$

NP1$

Fabric'ASIC'

(Mul0ple'FIA''or''

Single''XBAR'

LC$CPU$

F P G A

Absent On all

Typhoon Line card

Except 2x100Ge 1x100Ge

FGPA

Present On all Trident Based

Line card

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$

RSP0'

Switch$Fabric$

RSP1'

Ac0ve''RSP'CPU'

FPGA'

Ac0ve''RSP'FIA'

:$:$

From$RSP$to$NPU$diagnos>c$applica>on$injected$packet$NPU$to$RSP$response$packet$looped$back$inside$NPU$$$

Packet Path between Standby Route Processor Card and Line card:

NP0$

NP6$

NP7$

NP1$

Fabric'ASIC'

(Mul0ple'FIA''or''

Single''XBAR'

LC$CPU$

F P G A

Absent On all

Typhoon Line card

Except 2x100Ge 1x100Ge

FGPA

Present On all Trident Based

Line card

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$

RSP0'

Switch$Fabric$

RSP1'

Standby''RSP'CPU'

FPGA'

Standby''RSP'FIA'

:$:$

From$RSP$to$NPU$diagnos>c$applica>on$injected$packet$NPU$to$RSP$response$packet$looped$back$inside$NPU$$$

Diagnostic packets injected from standby route processor card into fabric towards NP are treated as multicast packets by switch fabric. Although multicast packet, there is no replication inside fabric. Every diagnostic packet sourced from standby route processor card still reaches only one NP at a time. Response packet from a NP towards route processor card is also multicast packet over fabric with no replication. Hence diagnostic application on standby route

Page 4: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 4

processor card receives single response packet from NPs one packet at a time. Thus diagnostic application keeps tab on every NP in the system by injecting one packet per NP and expecting response from every NP one packet at a time. Since for multicast packet switch fabric chooses out going link based on a field value in packet header, this helps to inject diagnostic packets over every fabric link between route processor card and line card back plane links. Standby route processor keeps tab on a NP health over every connecting fabric link between route processor card and line card slot. The diagram above depicts diagnostic packet path sourced from standby route processor card. Note that unlike active route processor card case, all links connecting line card to XBAR on route processor card is exercised. Response packets from NP take same back plane link that was used by the packet in route processor card to line card direction. This testing ensures that all links connecting standby route processor card to line card are monitored all the time. Punt Fabric Diagnostic Packet Path on Trident Line Card The diagram below depicts route processor card sourced diagnostic packets destined to NP that is loop backed towards route processor card. It is important to note data path links and ASICs that are common to all NPs as well links and components that are specific to sub set of NPs. (example B0 is common to NP0 and NP1 but FIA0 is common to all NPs. On route processor card end all links and data path asics and FGPA are common to all line cards and hence to all NPs in a chassis)

NP0$

NP2$

NP3$

NP1$FIA0%

CPU$

B0

B1

NP0%

NP1%

NP2%

NP3%

NP4%

NP5%

NP6%

NP7%FIA3%

FIA2%

FIA1%

FIA0% Switch%Fabric%ASIC%

$$$$$$$$$$$$$$$$$$$$$$

CPU$

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$RSP0%

Switch$Fabric$RSP1%

A9K84T$

A9K824x10G$

Trident$Line$card$Typhoon$Line$card$

RSP%CPU%

FPGA%

RSP%FIA%

Punt Fabric Diagnostic Packet Path on Typhoon Line Card The diagram below depicts route processor card sourced diagnostic packets destined to NP that is loop backed towards route processor card. It is important to note data path links and ASICs that are common to all NPs as well links and components that are specific to sub set of NPs. (example - B0 is common to NP0 and NP1 but FIA0 is common to all NPs. On route processor card end all links and data path asics and FGPA are common to all line cards and hence to all NPs in a chassis)

Page 5: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 5

NP0$

NP2$

NP3$

NP1$FIA0%

CPU$

B0

B1

NP0%

NP1%

NP2%

NP3%

NP4%

NP5%

NP6%

NP7%FIA3%

FIA2%

FIA1%

FIA0% Switch%Fabric%ASIC%

$$$$$$$$$$$$$$$$$$$$$$

CPU$

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$RSP0%

Switch$Fabric$RSP1%

A9K84T$

A9K824x10G$

Trident$Line$card$Typhoon$Line$card$

RSP%CPU%

FPGA%

RSP%FIA%

In the next few sections, the document attempts to depict packet path to every NP. This will be necessary to under stand punt fabric data path error message and also to locate failure point.

Punt Fabric Diagnostic Alarm and Failure Reporting Failure to get response from a NP in ASR9K system will result in an alarm. The decision to raise an alarm by online diagnostic application executing on route processor card happens when there are three consecutive failures. Diagnostic application maintains a three packets failure window for every NP. Active route processor card and standby route processor card diagnose independently and in parallel. Hence depending on fault location only active, only standby or both can report error and raise alarm. In default condition the frequency of diagnostic packet towards each NP is maintained at one packet per 60 seconds or one per minute. The format of alarm message is as shown below. “RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 1) (0/7/CPU0, 2) (0/7/CPU0, 3) (0/7/CPU0, 4) (0/7/CPU0, 5) (0/7/CPU0, 6) (0/7/CPU0, 7) “ The message should be read as failure to reach NP 1, 2, 3, 4, 5, 6, 7 on line card 0/7/cpu0 from route processor card 0/rsp0/cpu0. From the list of online diagnostic tests, we can see attributes of punt fabric loopback test using RP/0/RSP1/CPU0:ios(admin)#sh diagnostic content location 0/rsp1/cpu0 RP 0/RSP1/CPU0: Diagnostics test suite attributes: M/C/* - Minimal bootup level test / Complete bootup level test / NA

Page 6: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 6

B/O/* - Basic ondemand test / not Ondemand test / NA P/V/* - Per port test / Per device test / NA D/N/* - Disruptive test / Non-disruptive test / NA S/* - Only applicable to standby unit / NA X/* - Not a health monitoring test / NA F/* - Fixed monitoring interval test / NA E/* - Always enabled monitoring test / NA A/I - Monitoring is active / Monitoring is inactive Test Interval Thre- ID Test Name Attributes (day hh:mm:ss.ms shold) ==== ================================== ============ ================= ===== 1) PuntFPGAScratchRegister ---------> *B*N****A 000 00:01:00.000 1 2) FIAScratchRegister --------------> *B*N****A 000 00:01:00.000 1 3) ClkCtrlScratchRegister ----------> *B*N****A 000 00:01:00.000 1 4) IntCtrlScratchRegister ----------> *B*N****A 000 00:01:00.000 1 5) CPUCtrlScratchRegister ----------> *B*N****A 000 00:01:00.000 1 6) FabSwitchIdRegister -------------> *B*N****A 000 00:01:00.000 1 7) EccSbeTest ----------------------> *B*N****I 000 00:01:00.000 3 8) SrspStandbyEobcHeartbeat --------> *B*NS***A 000 00:00:05.000 3 9) SrspActiveEobcHeartbeat ---------> *B*NS***A 000 00:00:05.000 3 10) FabricLoopback ------------------> MB*N****A 000 00:01:00.000 3 11) PuntFabricDataPath --------------> *B*N****A 000 00:01:00.000 3 12) FPDimageVerify ------------------> *B*N****I 001 00:00:00.000 1 Here we see that PuntFabricDataPath frequency of packet is every minute and failure threshold is 3 implying loss of up to 3 packets is tolerated. Three consecutive packet losses result in an alarm. The test attributes shown are default values. Using admin config commands these attributes can be changed if desired.

Trident Line Card Diagnostic Packet Path A.1 NP0 Diagnostic Failure Fabric Diagnostic Path

NP0$

NP2$

NP3$

NP1$

FIA0%

LC$CPU$

B0

B1

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$

RSP0%

Switch$Fabric$

RSP1%

A9K94T$

Trident$Line$card$

RSP%CPU%

FPGA%

RSP%FIA%

Page 7: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 7

The diagram below depicts packet path between route processor card CPU and Line card NP0. Here we see that link connecting Bridge 0 (B0) and NP0 is the only link specific to NP0. All other links fall in common path. Also make note of packet path from route processor card towards NP0. Although there exists four links to use for packet destined towards NP0 from route processor card, the first link between route processor card and line card slot is used for packet in from route processor card to line card direction. While returned packet from NP0 can sent back to active route processor card over any of the two fabric link paths between line card slot to active route processor card. The choice of which one of the two links will be used depends on link load at that time. Response packet from NP0 towards standby route processor card uses all two links but one link at a time. The choice of link is based on header field that diagnostic application populates. A.1.1 NP0 Diagnostic Failure Analysis Single Fault Scenario If we detect a single pfm punt fabric data path failure alarm with NP0 in the failure message (if more than one faults occurred please refer to Multiple Fault Scenario section), the fault is only on fabric path connecting route processor card and line card’s NP0. “RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 0) “ [The discussion in this section of the document applies to any lien card slot in a chassis regardless of chassis type. Hence can be applied to all line card slots] Using the above data path diagram, we see that fault has to be in one or more of the following

• Link connecting NP0 and B0 • Inside B0 queues directed towards NP0 • Finally inside NP0.

Multiple Fault Scenario Mutiple NP faults If apart from PUNT_FABRIC_DATA_PATH_FAILED fault, if any other faults are observed on NP0 or the fault PUNT_FABRIC_DATA_PATH_FAILED is also reported by other NPs on the same line card, then fault isolation can be done by correlating all the faults. For example on NP0 if both PUNT_FABRIC_DATA_PATH_FAILED fault and LC_NP_LOOPBACK_FAILED [Please refer to appendix section to understand loopback fault ] fault occur, then NP has stopped processing packets. This could be an early indication of critical failure inside NP0. However if only one of the either faults occurred, then fault is localized to either punt fabric data path or on line card CPU to NP path. If more than one NP on a line card has punt fabric data path fault, then we need to walk up the tree path of fabric links to narrow down faulty component. Example if both NP0 and NP1 has fault, then fault has to be in B0 or link connecting B0 and FIA0. It’s less likely that both NP0 and NP1 run into critical internal error at the same time. Albeit less likely, nothing precludes from not having both NP0 and NP1 run into critical error fault due to a incorrect processing of a particular kind of packet or bad packet. Both route processor cards report Fault If 0/rsp0/cpu0 is in active redundancy state and if 0/rsp1/cpu0 is in standby state and if both route processor cards report fault to one or more NPs on a line card, then the fault may allude to

Page 8: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 8

check all common links and components on the data path between NP/NPs and both route processor cards. A.2 NP1 Diagnostic Failure The diagram below depicts packet path between route processor card CPU and Line card NP1. Here we see that link connecting Bridge 0 (B0) and NP1 is the only link specific to NP1. All other links fall in common path. Also make note of packet path from route processor card towards NP1. Although there exists four links to use for packet destined towards NP1 from route processor card, the first link between route processor card and line card slot is used for packet in from route processor card to line card direction. While returned packet from NP1 can sent back to active route processor card over any of the two fabric link paths between line card slot to active route processor card. The choice of which one of the two links will be used depends on link load at that time. Response packet from NP1 towards standby route processor card uses all two links but one link at a time. The choice of link is based on header field that diagnostic application populates.. Fabric Diagnostic Path

NP0$

NP2$

NP3$

NP1$

FIA0%

LC$CPU$

B0

B1

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$

RSP0%

Switch$Fabric$

RSP1%

A9K94T$

Trident$Line$card$

RSP%CPU%

FPGA%

RSP%FIA%

A.2.1 NP1 Diagnostic Failure Analysis Refer to section A.1.1 but apply the same reasoning for NP1 (instead of NP0) A.3 NP2 Diagnostic Failure The diagram below depicts packet path between route processor card CPU and Line card NP2. Here we see that link connecting Bridge 1 (B1) and NP2 is the only link specific to NP2. All other links fall in common path. Also make note of packet path from route processor card towards NP2. Although there exists four links to use for packet destined towards NP2 from route processor card, the first link between route processor card to line card slot is used for packet in from route processor card to line card direction. While returned packet from NP2 can sent back to active route processor card over any of the two fabric link paths between line card slot to active route processor card. The

Page 9: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 9

choice of which one of the two links will be used depends on link load at that time. Response packet from NP2 towards standby route processor card uses all two links but one link at a time. The choice of link is based on header field that diagnostic application populates.. Fabric Diagnostic Path

NP0$

NP2$

NP3$

NP1$

FIA0%

LC$CPU$

B0

B1

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$

RSP0%

Switch$Fabric$

RSP1%

A9K94T$

Trident$Line$card$

RSP%CPU%

FPGA%

RSP%FIA%

A.3.1 NP2 Diagnostic Failure Analysis Refer to section A.1.1 but apply the same reasoning for NP2 (instead of NP0) A.4 NP3 Diagnostic Failure The diagram below depicts packet path between route processor card CPU and Line card NP3. Here we see that link connecting Bridge 1 (B1) and NP3 is the only link specific to NP3. All other links fall in common path. Also make note of packet path from route processor card towards NP3. Although there exists four links to use for packet destined towards NP3 from route processor card, the first link between route processor card to line card slot is used for packet in from route processor card to line card direction. While returned packet from NP3 can sent back to active route processor card over any of the two fabric link paths between line card slot to active route processor card. The choice of which one of the two links will be used depends on link load at that time. Response packet from NP3 towards standby route processor card uses all two links but one link at a time. The choice of link is based on header field that diagnostic application populates.. Fabric Diagnostic Path

Page 10: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 10

NP0$

NP2$

NP3$

NP1$

FIA0%

LC$CPU$

B0

B1

$$RSP$$3$Switch$Fabric$$$$$$$$$$$$$$$$$$$$

Switch$Fabric$

RSP0%

Switch$Fabric$

RSP1%

A9K94T$

Trident$Line$card$

RSP%CPU%

FPGA%

RSP%FIA%

A.4.1 NP3 Diagnostic Failure Analysis Refer to section A.1.1 but apply the same reasoning for NP3 (instead of NP0)

Typhoon Line Card Diagnostic Packet Path In order to establish background for fabric punt packet, we use two examples in this section. First example uses NP1 and second example uses NP3. The description and analysis can be extended to other NPs on any typhoon based card. B.1 Typhoon NP1 Diagnostic Failure The diagram below depicts packet path between route processor card CPU and Line card NP1. Here we see that link connecting FIA0 and NP1 is the only link specific to NP1 path. All other links between line card slot and route processor card slot fall in common path. Links connecting fabric ASIC on line card to FIAs on line card will be specific to sub set of NPs. (Example both links between FIA0 and local fabric ASIC on linecard will be used for traffic to NP1) Also make note of packet path from the route processor card towards NP1 as depicted. Although there exists 8 links to use for packet destined towards NP1 from route processor card, a single path between route processor card to line card slot is used. While returned packet from NP1 can sent back to route processor card over 8 fabric link paths between line card slot to route processor card. Each of these 8 links are exercised one at a time when diagnostic packet is destined back to route processor card CPU.

Page 11: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 11

Fabric Diagnostic Path

NP0$

NP1$

NP2$

NP3$

NP4$

NP5$

NP6$

NP7$FIA3$

FIA2$

FIA1$

FIA0$ Switch$Fabric$ASIC$

!!!!!!!!!!!!!!!!!!!!!!

LC!CPU!

!!RSP!!3!Switch!Fabric!!!!!!!!!!!!!!!!!!!!

Switch!Fabric!

RSP0$

Switch!Fabric!

RSP1$A9K524x10G!

Typhoon!Line!card!

RSP$CPU$

FPGA$

RSP$FIA$

B.2 Typhoon NP3 Diagnostic Failure The diagram below depicts packet path between route processor card CPU and Line card NP1. Here we see that link connecting FIA1 and NP3 is the only link specific to NP3 path. All other links between line card slot and route processor card slot fall in common path. Links connecting fabric ASIC on line card to FIAs on line card will be specific to sub set of NPs. (Example both links between FIA0 and local fabric ASIC on linecard will be used for traffic to NP1) Also make note of packet path from route processor card depicted by towards NP3. Although there exists 8 links to use for packet destined towards NP3 from route processor card, a single path between route processor card to line card slot is used. While returned packet from NP3 can sent back to route processor card over 8 fabric link paths between line card slot to route processor card. Each of these 8 links are exercised one at a time when diagnostic packet is destined back to route processor card CPU. Fabric Diagnostic Path

NP0$

NP1$

NP2$

NP3$

NP4$

NP5$

NP6$

NP7$FIA3$

FIA2$

FIA1$

FIA0$ Switch$Fabric$ASIC$

!!!!!!!!!!!!!!!!!!!!!!

LC!CPU!

!!RSP!!3!Switch!Fabric!!!!!!!!!!!!!!!!!!!!

Switch!Fabric!

RSP0$

Switch!Fabric!

RSP1$A9K524x10G!

Typhoon!Line!card!

RSP$CPU$

FPGA$

RSP$FIA$

Page 12: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 12

Analyzing Faults This section will categorize faults into hard and transient cases and lists out steps to identify a fault is hard or transient fault. Once fault type is detected, the document also specifies CLIs that can be executed on router to understand fault and what actions can be applied. Transient Fault: If a Set PFM message is followed by clear PFM message, then fault had occurred and router has corrected the fault itself. Transient faults can occur to due to environmental conditions, recoverable faults in hardware components, and some times it can be hard to associate transient faults to any particular event. The suggested approach for transient errors is to only monitor for further occurrence of such errors. If transient fault occurs more than once, then treat transient fault as hard faults and use recommendations and setups to analyze such fault described in the next section. If faults occur and clear to an NP at the rate of at least one per minute and at least ten faults are observed towards a given NP within a span of ten minutes, even if faults are cleared at the end of barrage of fault and corresponding recovery of fault, we can treat such flurry of failures as hard fault. An example of transient fabric fault is listed below for clarity. RP/0/RSP0/CPU0:Feb 5 05:05:44.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[237686]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0) RP/0/RSP0/CPU0:Feb 5 05:05:46.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[237686]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0) Hard Fault: If a Set PFM message is not followed by clear PFM message, then fault had occurred and router has not corrected the fault itself by fault handling code or the nature of hardware fault is not recoverable. Hard faults can occur to due to environmental conditions, unrecoverable faults in hardware components. The suggested approach for hard errors is to use guidelines mentioned in section “Steps to Analyze Hard Faults”. An example of hard fabric fault is listed below for clarity. For the example message below there will not be a corresponding clear pfm message. RP/0/RSP0/CPU0:Feb 5 05:05:44.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[237686]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0) Recommendation: Under hard fault scenario, collect out of all the CLIs mentioned in section “Data To Collect Before SR Creation” and raise an SR. In urgent cases, after collecting all debug output, initiate route processor card or line card reload (depending on fault isolation). After reload, if error is not recovered, initiate RMA.

Steps to Analyze Transient Faults • The first step is to find out if error occurred once or multiple times. [Use CLI in 1 listed

below in this section] • The second step is to note down the current status; which is to find if the error is

outstanding or cleared. [Use CLI in 2 listed below in this section] • If error status is changing between set and clear, then one or more faults within fabric

d

Page 13: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 13

ata path is repeatedly occurring as well as rectified either by software of hardware. • To monitor future occurrence of the fault (when the last status of the error is CLEAR

and no new faults are occurring), provision either SNMP traps or run script that collects “sh pfm location all” output and searches for the error string periodically.

CLIs to Use 1: show logging | inc “PUNT_FABRIC_DATA_PATH” 2: show pfm location all ; Check for status type is SET or CLEAR

Steps to Analyze Hard Faults If we view fabric data path links on a line card a tree, whose details are described in section “Back Ground”, it’s immediate to infer that depending upon point of fault one or more NP may not be accessible. When multiple faults occur on multiple NPs, then use CLIs listed in this section to look at faults. CLIs to Use 1: show logging | inc “PUNT_FABRIC_DATA_PATH” The output might contain one or more different NPs (Example: NP2, NP3) 2: show controller fabric fia link-status location <lc>

Since both NP2 and NP3 (in section B.2) receive and send through single FIA, its reasonable to infer that fault is in associated FIA on the path. 3: show controller fabric crossbar link-status instance <0 and 1> location <LC or RSP>

If all NPs on line card are not reachable for diagnostic application, then its reasonable to infer that links connecting line card slot to route processor card might have fault of any of the ASIC that forwards traffic between route processor card and line card has fault.

Show controller fabric cross-bar link-status instance 0 location <lc> Show controller fabric cross-bar link-status instance 0 location 0/rsp0/cpu0 Show controller fabric cross-bar link-status instance 1 location 0/rsp0/cpu0 Show controller fabric cross-bar link-status instance 0 location 0/rsp1/cpu0 Show controller fabric cross-bar link-status instance 1 location 0/rsp1/cpu0

4. show controller fabric fia link-status location 0/rsp*/cpu0 show controller fabric fia bridge sync-status location 0/rsp*/cpu0

If all NPs on all line cards report fault, then most likely fault is on route processor card (active route processor card or standby route processor card). Please refer to link connecting route processor card CPU to FPGA and route processor card FIA in background section.

Show controller fabric fia link-status location 0/rsp0/cpu0 Show controller fabric fia link-status location 0/rsp1/cpu0 Show controller fabric fia bridge sync-status location 0/rsp0/cpu0 Show controller fabric fia bridge sync-status location 0/rsp1/cpu0

Past Failures In the past we have seen faults that were 99% recoverable and in most cases software initiated recovery action fixed faults. However in very rare cases, unrecoverable errors are seen that can only be fixed with RMA of cards. In this section we will pull from our past experience errors seen before so that they serve as guidance if similar errors are observed. Transient Error due to NP oversubscription

Page 14: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 14

RP/0/RP1/CPU0:Jun  26  13:08:28.669  :  pfm_node_rp[349]:  %PLATFORM-­‐DIAGS-­‐3-­‐PUNT_FABRIC_DATA_PATH_FAILED  :  Set|online_diag_rsp[200823]|System  Punt/Fabric/data  Path  Test(0x2000004)|failure  threshold  is  3,  (slot,  NP)  failed:    (0/10/CPU0,  0)    RP/0/RP1/CPU0:Jun  26  13:09:28.692  :  pfm_node_rp[349]:  %PLATFORM-­‐DIAGS-­‐3-­‐PUNT_FABRIC_DATA_PATH_FAILED  :  Clear|online_diag_rsp[200823]|System  Punt/Fabric/data  Path  Test(0x2000004)|failure  threshold  is  3,  (slot,  NP)  failed:    (0/10/CPU0,0)    Hard  Fault  due  to  NP  Lock    When PUNT_FABRIC_DATA_PATH_FAILED occurs, and if the failure is due to NP lock up, then faults similar to what is listed below will appear .    LC/0/2/CPU0:Aug  26  12:09:15.784  CEST:  prm_server_ty[303]:  prm_inject_health_mon_pkt  :  Error  injecting  health  packet  for  NP0    status  =  0x80001702    LC/0/2/CPU0:Aug  26  12:09:18.798  CEST:  prm_server_ty[303]:  prm_inject_health_mon_pkt  :  Error  injecting  health  packet  for  NP0    status  =  0x80001702    LC/0/2/CPU0:Aug  26  12:09:21.812  CEST:  prm_server_ty[303]:prm_inject_health_mon_pkt  :  Error  injecting  health  packet  for  NP0    status  =  0x80001702    LC/0/2/CPU0:Aug  26  12:09:24.815  CEST:  prm_server_ty[303]:  NP-­‐DIAG  health  monitoring  failure  on  NP0      LC/0/2/CPU0:Aug  26  12:09:24.815  CEST:  pfm_node_lc[291]:%PLATFORM-­‐NP-­‐0-­‐NP_DIAG  :  Set|prm_server_ty[172112]|Network  Processor  Unit(0x1008000)|  NP  diagnostics  warning  on  NP0.  LC/0/2/CPU0:Aug  26  12:09:40.492  CEST:  prm_server_ty[303]:  Starting    fast  reset  for  NP  0  LC/0/2/CPU0:Aug  26  12:09:40.524  CEST:  prm_server_ty[303]:  Fast  Reset  NP0    -­‐  successful  auto-­‐recovery  of  NP  

Failures between RSP440 and Typhoon line cards Cisco has fixed an issue where rarely links between RSP440 and Typhoon based line cards across backplane get retrained. The retraining of fabric link happens because the signal strength is not optimal. This issue is present in base releases of 4.2.1, 4.2.2, 4.2.3, 4.3.0, 4.3.1, and 4.3.2. SMU for all these releases is posted on CCO and tracked with DDTS CSCuj10837. When this known and fixed issue occurs on router, any of the following can occur

1. Link goes down and comes up. (transient) 2. Link is permanently down

In both the cases platform fault messages will appear on the screen. Depending on how fast link gets retrained, one or more NPs on the connected line card will report fault. Logs of this issue that occurred on a router at customer site is listed below. In this case all 8 NPs reported error. However, we have seen instances when link is retrained quickly only a subset of NPs on line card reported error. Jan 12 03:02:00 phlasr1.router1 3418: RP/0/RSP1/CPU0:Jan 12 03:02:00.857 : pfm_node_rp[357]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[348276]|System Punt/Fabric/data Path Test(0x2000004)|failure threshold is 3, (slot, NP)failed: (0/4/CPU0, 1) (0/4/CPU0, 2) (0/4/CPU0, 3) (0/4/CPU0, 4) (0/4/CPU0, 5) (0/4/CPU0, 6) (0/4/CPU0, 7) Jan 12 04:16:37 phlasr1.router1 3431: RP/0/RSP1/CPU0:Jan 12 04:16:37.725 : pfm_node_rp[357]: %PLATFORM-DIAGS-3 PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[348276]|System Punt/Fabric/data Path

Page 15: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 15

Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/4/CPU0, 1) (0/4/CPU0, 2) (0/4/CPU0, 3) (0/4/CPU0, 4) (0/4/CPU0, 5) (0/4/CPU0, 6) (0/4/CPU0, 7) Ltrace To confirm Link Retraining Using CLI mentioned below, we can be check if link got retrained.

RP/0/RSP0/CPU0:edge1.ord1#sh  controllers  fabric  ltrace  crossbar  location  0/rsp1/cPU0  |  include      link_retrain   Sep 20 07:38:49.772 crossbar 0/RSP1/CPU0 t1 detail xbar_fmlc_handle_link_retrain: rcvd link_retrain for (1,1,1),(2,1,0),0.

Here (1,1,1) and (2, 1, 0) depict end points. The parameters within () should be read as (SlotId, Asic-Type, Asic-Instance). Hence (1,1,1) implies on slot 1 (hence RSP1 in 9006 chassis), asic-type 1 (identifies generation of Fabric Xbar Asic) and Xbar Instance 1. (2,1,0) implies physical slot 2, (hence line card 0/0/cpu0 in 9006 chassis), xbar asic-type 1 and xbar instance 0. (1,1,1),(2,1,0),0. The trailing number (in this case 0) identifies link number (out of 4 fabric links that connect to each route processor card) . This will be between 0 to 3.

Online Diagnostic Test Report A summary of all online diagnostic tests and failure along with last time stamp when a test passed are listed in the output of “show diagnostic result location <node> [test <test-id> detail ]” (Test-Id for punt fabric data path failure is 10. A list of all tests along with frequency of test packets can be seen with “show diagnostic content location <node>” ) Output of punt fabric data path test result will be similar to sample output listed below. RP/0/RSP0/CPU0:ios(admin)#show diagnostic result location 0/rsp0/cpu0 test 10 detail Current bootup diagnostic level for RP 0/RSP0/CPU0: minimal Test results: (. = Pass, F = Fail, U = Untested) ___________________________________________________________________________ 10 ) FabricLoopback ------------------> . Error code ------------------> 0 (DIAG_SUCCESS) Total run count -------------> 357 Last test execution time ----> Sat Jan 10 18:55:46 2009 First test failure time -----> n/a Last test failure time ------> n/a Last test pass time ---------> Sat Jan 10 18:55:46 2009 Total failure count ---------> 0 Consecutive failure count ---> 0

FAQ 1. Does the primary or standby route processor card sends the keep alives or online

diagnostic packets to every NP in the system. ? A: Yes. Both route processor cards send online diagnostic packets to every NP.

2. Is the path the same when route processor card1 is active? A : Diagnostic path is same whether route processor card0 or route processor card1. Path is dependent of state of route processor card. Section “Punt Fabric Diagnostic

Pack

Page 16: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 16

et Path” has more details.

3. How often does the route processor card send keep alives and how many keep alives do we need to miss to trigger the alarm? A: In default condition every minute a packet is sent towards every NP. It requires three consecutive miss to trigger fault.

4. How do we determine whether an NP is or has been over subsribed? A: One way to check if NP is over subscribed in the past or is currently oversubscribed is to check for certain kind of drops inside NP and for tail drops in FIA. IFDMA drops inside NP occur when NP is oversubscribed and cannot keep up with incoming traffic. FIA tail drops occur when egress NP asserting flow control (asking ingress line card to send less traffic). Under flow control scenario, FIA has tail drops.

5. How do we determine whether an NP is locked up? A: Typically NP lock up is cleared by fast reset. The reason for fast reset will clearly state NP lock up.

6. What do we see if an NP is completely dead due to an HW failure? A: We see both Punt Fabric Data Path failure for that NP as well as NP Loopback test failure. (NP loopback test failure message similar to as shown in Appendix section and punt fabric data path failure message similar to as shown in the Introduction section of this paper will appear)

7. Please explain the meaning and numbering of (9,1,0),(5,1,0),1 in ltrace message “Sep 3 13:47:07.027 crossbar 0/7/CPU0 t1 detail xbar_fmlc_handle_link_retrain: rcvd link_retrain for (9,1,0),(5,1,0),1. “ A: Here (9,1,0) and (5, 1, 0) depict end points. The parameters within () should be read as (SlotId, Asic-Type, Asic-Instance). Hence (9,1,0) implies on slot 9 (hence line card 0/7/CPU0), asic-type 1 (identifies generation of Fabric Xbar Asic) and Xbar Instance 0. (5,1,0) implies physical slot 5, (hence route processor card 0/rsp1/cpu0 in 9010 chassis), xbar asic-type 1 and xbar instance 0. (9,1,0),(5,1,0),1. The trailing number (in this case 1) identifies link number (out of 4 fabric links that connect typhoon line card to each route processor card) . This will be between 0 to 3.  

8. Just because a diag message is sourced from one route processor card does not mean that it comes back to the same one. Yes or No? A: Since diagnostic packets are sourced from both route processor cards and tracked on per route processor card basis, a diagnostic packet sourced from a route processor card will be loop backed to same route processor card by NP.

Page 17: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 17

9. The CSCuj10837 SMU and critical announcement is only for the link retrain event. How did we determine to put out an announcement for this and not the other events? Is this just the most common event, and if so again how did they determine that this is the cause of 99% of these messages? A: After considerable effort spent in root cause analysis and verification of fix, hardware team is confident that link retraining issues will be fixed. It is software teams analysis that all or most of the diagnostic failures seen before are tied to link retraining. Hence BU has acknowledged the issue tracked by CSCuj10837 as the single most prevalent cause for all the issues seen till now between RSP440 and Typhoon based line cards. This SMU and root cause analysis does not apply for any failures seen on Trident based line cards.

10. How long it takes to retrain serdes once the decision to do so is made? A: Decision to retrain is made as soon as fault is detected. Fabric drivers take up to 5 seconds to detect link failure. Once failure is detected link gets retrained (in the absence of hard fault) within 2 seconds. Hence in the worst case traffic will black hole for upto 5 seconds plus retrain time.

11. At what point is the decision to retrain made? A: As soon as link fault is detected, the decision to retrain is made by fabric asic driver.

12. So it's only between FIA on active route processor card and fabric that we use the first link and then after that it's the least loaded link when there are multiple links available? A: Correct. First link connecting to first xbar instance on active route processor is used to inject traffic into fabric. Response packet from NP can reach back to active route processor card on any of all the links connecting back to route processor card. The choice of link depends on link load.

13. During the retrain are all packets sent over that fabric link lost? A: Yes. In the worst case this can be up to 7 seconds. However on average with in 3 seconds link is retrained if fault is transient. If link has fatal unrecoverable fault for any reason, the link is admin shutdown and traffic is re-routed around the faulty link. Hence within on average 5 seconds (3 seconds to detect and 2 seconds for retraining to fail) traffic drop within switch fabric will stop and faulty link will be isolated and not used for forwarding traffic.

14. How frequently do we expect to see a crossbar serdes retrain event after customers have the fix for CSCuj10837? A: Once customer has fix for CSCuj10837, serdes retrain on back plane link between RSP440 and Typhoon based line card should never occur. If it does, then SR case has to be raised and proper debugging should be done.

Data To Collect Before SR Creation

Page 18: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 18

Minimum logging to collect before any action is taken is

• Show logging • Show pfm location all • admin sh diagn result loc 0/rsp0/cpu0 test 8 detail • admin sh diagn result loc 0/rsp1/cpu0 test 8 detail • admin sh diagn result loc 0/rsp0/cpu0 test 9 detail • admin sh diagn result loc 0/rsp1/cpu0 test 9 detail • admin sh diagn result loc 0/rsp0/cpu0 test 10 detail • admin sh diagn result loc 0/rsp1/cpu0 test 10 detail • admin sh diagn result loc 0/rsp0/cpu0 test 11 detail • admin sh diagn result loc 0/rsp1/cpu0 test 11 detail • show controller fabric fia link-status location <lc> • show controller fabric fia link-status location <both rsp> • show controller fabric fia bridge sync-status location <both rsp> • show controller fabric crossbar link-status instance 0 location <lc> • show controller fabric crossbar link-status instance 0 location <both rsp> • show controller fabric crossbar link-status instance 1 location <both rsp> • show controller fabric ltrace crossbar <both rsp> • show controller fabric ltrace crossbar <affected lc> • show tech fabric location <fault showing lc> file <path to file> • show tech fabric location <both rsp> file <path to file>

Some Useful Diagnostic Commands show diagnostic ondemand settings show diagnostic content location < loc > show diagnostic result location < loc > [ test {id|id_list|all} ] [ detail ] show diagnostic status admin#diagnostic start location < loc > test {id|id_list|test-suite} admin#diagnostic stop location < loc > admin#diagnostic ondemand iterations < iteration-count > admin#diagnostic ondemand action-on-failure {continue failure-count|stop} admin-config#[ no ] diagnostic monitor location < loc > test {id | test-name} [disable] admin-config# [ no ] diagnostic monitor interval location < loc > test {id | test-name} day hour:minute:second.millisec admin-config# [ no ] diagnostic monitor threshold location < loc > test {id | test-name} failure count

Conclusion As of 4.3.4 release time frame all issues related to punt fabric data path failure are addressed. The only SMU for this issue for releases prior to 4.3.4 can be obtained from CCO using CSCuj10837. The platform team has put in state of the art fault handling so that router recovers in sub second if and when any data path recoverable failure occur. We recommend to use this document to understand problem even if no such fault has been observed on your system.

Other References

Page 19: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 19

https://supportforums.cisco.com/docs/DOC32083#What_to_collect_if_there_is_still_an_issue

Appendix

NP LoopBack Diagnostic Path Diagnostic application executing on line card CPU will keep tab on the health of NP by periodically monitoring working status of NP. A packet is injected from line card CPU destined to local NP which the NP should loopback and return to line card CPU. Any loss in such periodic packets is flagged using platform message. An example of such message is below “LC/0/7/CPU0:Aug 18 19:17:26.924 : pfm_node[182]: %PLATFORM-PFM_DIAGS-2-LC_NP_LOOPBACK_FAILED : Set|online_diag_lc[94283]|Line card NP loopback Test(0x2000006)|link failure mask is 0x8” This means this test failed to get loopback packet from NP3: "link failure mask is 0x8" i.e. bit 3 set==>NP3. Output of CLIs below can help to get more details.

• admin show diagnostic result location 0/x/cpu0 test 9 detail • show controllers NP counter NP(0-3) location 0/x/cpu0

NP0$

NP2$

NP3$

NP1$

FIA0%LC$CPU$

B0

B1

A9K.4T$

Trident$Line$card$

LC$$PCIe$SWITCH$

Fabric Debug Commands Commands listed in this section apply to all trident based line cards as well as typhoon based 100Ge line card. Since “bridge” FPGA is missing on typhoon based line cards (except 100Ge typhoon based line cards) all CLI under “sh controller fabric fia bridge …” do not apply.

Page 20: Punt Fabric Failure - Cisco · 2017-08-18 · ASR9K Punt Fabric Data Path Failure Mahesh S { maheshk@cisco.com } Introduction The purpose of this white paper is to provide understanding

White Paper

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 20

This pictorial representation will help to map cli to location of data path. This show help in isolating drop or fault point.

Author

Mahesh S [email protected] October/2013

 

Printed in USA C11-539588-01 04/10