understanding the driver trace file february 20, 2013

21
Understanding the driver trace file February 20, 2013

Upload: curtis-hodge

Post on 03-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding the driver trace file February 20, 2013

Understanding thedriver trace file

February 20, 2013

Page 2: Understanding the driver trace file February 20, 2013

February 20, 2012QLogic Confidential2

Understanding the driver trace file

Using the driver trace file to determine Fabric issues

• Where does it come from?• It is generated by the switch’s port application and is collected during the

creation of the support file• It cannot be modified

• What does it provide?• Associates timestamps to various Fabric events• Gives many of the port counters a frame of reference

Training material and supporting files can be found at: \\nasaroo\LOVE\support\docs\training\Reading_the_Driver_Trace_File

Page 3: Understanding the driver trace file February 20, 2013

QTS Weekly Report/QLogic Confidential February 20, 20123

Using the driver trace file to determine Fabric issues

What the driver trace file cannot do:• Record normal data/traffic flow, including these counters:

• Words In/Out• Frames In/Out• Words Rx/Tx• Frames Rx/Tx• TxWaits• PLOGI/PRLI/FLOGI• BBCR counters• Long Frames• Short Frames• EpIsoReason/EpConnect

• FReject• Decode Errors• Lost R_RDYs• Invalid CRC (except to Well-Known Addr)• LIP AL counters• Login/Logout (can be associated)• Fbusy (can be associated)• Bad Frames (some)

Understanding the driver trace file

Page 4: Understanding the driver trace file February 20, 2013

February 20, 2012QLogic Confidential4

Understanding the driver trace file

Using the driver trace file to determine Fabric issues

What the driver trace file can record:• Track the following counters:

• ALInit (Indirect)• ALInitError (Indirect)• Flow Errors (RBUF/TBUF)• InvalidDestAddr• LIP_F7_F7/LIP_F8_F7• Link Failures• Login/Logout (Indirect)• LossOfSync• Bad Frames - Invalid Source Addr & Well-Known Addr (Src and Dest IDs)• Primitive Sequence Errors (for LISM only)

• Tx/Rx Offline Sequence counters (OLS/NOS)• Tx/Rx Link Reset counters (LR/LRR sequences)• Class3Toss – Includes frame header data

• Destination ID• Source ID• Reject code

Page 5: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Prerequisites

February 20, 2012QLogic Confidential5

Have the customer provide:• Topology view of the Fabric• Date and approximate time of the occurrence(s)• Port(s) of switch or switches involved• Plus, NTP needs to be enabled when reviewing multiple switches

Page 6: Understanding the driver trace file February 20, 2013

February 20, 2012

Analyzing the driver trace file

Modifying the driver trace file

Perform these steps:• Sort the file messages based on the line sequence number• Delete all messages not associated with the occurrence

Page 7: Understanding the driver trace file February 20, 2013

QLogic Confidential7

Analyzing the driver trace file

Run the driver trace Expect script

Provided output:• Timestamps of Received and Transmitted Link Resets• A count of discarded frames based on the Destination ID• Reject code for the discarded frame

February 20, 2012

Page 8: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Driver trace Expect script output

February 20, 2012QLogic Confidential8

Timestamps of Received and Transmitted Link Resets

Apr 05 14:19:52.277 ag0Port: ag0p8 ENTRY: LR Transmit StateApr 05 14:19:59.244 ag0Port: ag0p5 ENTRY: LR Transmit StateApr 05 14:20:07.093 ag0Port: ag0p8 INPUT EVENT: LR ReceivedApr 05 14:20:10.710 ag0Port: ag0p5 INPUT EVENT: LR Received

total Rx LR : 10total Tx LR : 61total_frames: 242

Page 9: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Driver trace Expect script output

February 20, 2012QLogic Confidential9

Discarded frames based on the Destination ID and Reject code

switch D_ID counts for all rejected frames - Sorted by timeouts

D_ID: 020800 CNT: 121 Timeouts: 121 Steering: 0 D_ID: 020c00 CNT: 83 Timeouts: 81 Steering: 0 D_ID: 020500 CNT: 38 Timeouts: 38 Steering: 0

RJT_Code: a4 Reason: other CNT: 2 RJT_Code: 82 Reason: Rx Timeout CNT: 150 RJT_Code: a3 Reason: Tx Timeout CNT: 90

Page 10: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 2012QLogic Confidential10

Link States

Link State = 9Beginning of LISM. Commonly see LIP(F7,F7)

Link State = 7Offline sequence (OLS) transition. Normally followed by NOS (Not Operational Sequence)

Link State = 3Link Reset sequence (LR and LRR)

Link State = 1Link active and online

Page 11: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 2012QLogic Confidential11

Link State example: Shutdown and bootup of a Host25946: Mar 01 18:18:46.538 ag0p0 ENTRY: handleMediaLossOfSignal25951: Mar 01 18:18:46.538 ag0p0 ENTRY: auto_speed_negotiate_task (If speed is set vs. auto, it’ll be seen here)25952: Mar 01 18:18:46.538 ag0p0 auto-negotiation enabled (If topology is set (FL_Port/F_Port), it’ll be seen here)25955: Mar 01 18:19:17.969 ag0p0 NORM state, speed=4G25961: Mar 01 18:19:17.969 ag0p0 Sent LIP because of SYNC_ACQ25985: Mar 01 18:19:18.149 ag0p0 Link State = 925986: Mar 01 18:19:18.158 ag0p0 Received LIPf7f7 (F7,F7 indicates Loop Init w/o AL_PA. Comes from power-up or hot-

plug)26010: Mar 01 18:19:18.188 ag0p0 OUTPUT EVENT: Transmit OLS26012: Mar 01 18:19:18.188 ag0p0 Link State = 726016: Mar 01 18:19:18.219 ag0p0 INPUT EVENT: Link Init Timer Expiry26017: Mar 01 18:19:18.219 ag0p0 ENTRY: OLS Transmit State26018: Mar 01 18:19:18.219 ag0p0 Link State = 726020: Mar 01 18:19:18.219 ag0p0 INPUT EVENT: LR Received26021: Mar 01 18:19:18.219 ag0p0 ENTRY: LR Receive State26023: Mar 01 18:19:18.219 ag0p0 Link State = 326027: Mar 01 18:19:18.235 ag0p0 INPUT EVENT: Receive Buffer Drain Timer Expiry26028: Mar 01 18:19:18.235 ag0p0 ENTRY: LR Receive State26029: Mar 01 18:19:18.235 ag0p0 ENTRY resetRbufCreditFinish(), PBR = 026030: Mar 01 18:19:18.235 ag0p0 OUTPUT EVENT: Transmit LRR26031: Mar 01 18:19:18.235 ag0p0 Link State = 326033: Mar 01 18:19:18.235 ag0p0 INPUT EVENT: Idles Received26034: Mar 01 18:19:18.235 ag0p0 ENTRY: Active State26035: Mar 01 18:19:18.235 ag0p0 OUTPUT EVENT: Transmit IDLE26036: Mar 01 18:19:18.235 ag0p0 Link State = 1

Page 12: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 2012QLogic Confidential12

HBA failing to log into switch port as Loop

385402: Jan 28 16:25:15.356 ag0p0 Received LIPf7f7 Requesting AL_PA: Power-up or Hot plug385408: Jan 28 16:25:15.356 ag0p0 OUTPUT EVENT: Transmit OLS OLS = Offline Sequence385413: Jan 28 16:25:15.387 ag0p0 INPUT EVENT: Link Init Timer Expiry 385418: Jan 28 16:25:15.387 ag0p0 INPUT EVENT: Idles Received Login achieved385424: Jan 28 16:25:15.490 ag0p0 INPUT EVENT: RTTOV Expiry R_T_TOV (100ms) has expired385425: Jan 28 16:25:15.490 ag0p0 ENTRY: Wait for OLS State 385426: Jan 28 16:25:15.490 ag0p0 OUTPUT EVENT: Transmit NOS Not Operational Sequence

385439: Jan 28 16:25:15.537 ag0p0 OUTPUT EVENT: Transmit OLS385447: Jan 28 16:25:15.584 ag0p0 Link shutdown called from loop, old_port()385451: Jan 28 16:25:15.584 ag0p0 OUTPUT EVENT: Transmit OLS385456: Jan 28 16:25:15.615 ag0p0 INPUT EVENT: Link Init Timer Expiry 385457: Jan 28 16:25:15.615 ag0p0 ENTRY: OLS Transmit State385461: Jan 28 16:25:15.615 ag0p0 INPUT EVENT: Idles Received385467: Jan 28 16:25:15.718 ag0p0 INPUT EVENT: RTTOV Expiry R_T_TOV (100ms) has expired385468: Jan 28 16:25:15.718 ag0p0 ENTRY: Wait for OLS State385469: Jan 28 16:25:15.718 ag0p0 OUTPUT EVENT: Transmit NOS Not Operational Sequence

Page 13: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 2012QLogic Confidential13

Link State example: Link Reset due to SCSI Bus Reset

25564: Mar 01 15:34:16.507 ag0p14 INPUT EVENT: LR Received25565: Mar 01 15:34:16.507 ag0p14 ENTRY: LR Receive State25567: Mar 01 15:34:16.507 ag0p14 Link State = 325576: Mar 01 15:34:16.523 ag0p14 INPUT EVENT: Receive Buffer Drain Timer Expiry25577: Mar 01 15:34:16.523 ag0p14 ENTRY: LR Receive State25579: Mar 01 15:34:16.523 ag0p14 OUTPUT EVENT: Transmit LRR25580: Mar 01 15:34:16.523 ag0p14 Link State = 325582: Mar 01 15:34:16.523 ag0p14 INPUT EVENT: Idles Received25583: Mar 01 15:34:16.523 ag0p14 ENTRY: Active State25584: Mar 01 15:34:16.523 ag0p14 OUTPUT EVENT: Transmit IDLE25585: Mar 01 15:34:16.523 ag0p14 Link State = 1

Page 14: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 201214

Link State example: Loss of connection. Storage controller failure (Seq. repeats)37556: May 22 05:34:16.448 ag0p2 INPUT EVENT: Idles Received 37912: May 22 07:05:29.252 ag0p2 INPUT EVENT: Idles Received37557: May 22 05:34:16.448 ag0p2 ENTRY: Active State 37913: May 22 07:05:29.252 ag0p2 ENTRY: Active State37558: May 22 05:34:16.448 ag0p2 OUTPUT EVENT: Transmit IDLE 37914: May 22 07:05:29.252 ag0p2 OUTPUT EVENT: Transmit IDLE37559: May 22 05:34:16.448 ag0p2 Link State = 1 37915: May 22 07:05:29.252 ag0p2 Link State = 1....No other events to/from this port, until (+1 hour, 30 minutes)37817: May 22 07:04:25.002 ag0p2 Received LIPf8f7 (F8,F7 indicates requesting Loop Init, due to Loop/connection failure)37825: May 22 07:04:25.018 ag0p2 ENTRY: handleMediaLossOfSignal37840: May 22 07:05:28.993 ag0p2 Sent LIP because of SYNC_ACQ (+1 minute, 4 seconds)37847: May 22 07:05:29.039 ag0p2 OUTPUT EVENT: Transmit OLS37852: May 22 07:05:29.070 ag0p2 ENTRY: OLS Transmit State37853: May 22 07:05:29.070 ag0p2 Link State = 737855: May 22 07:05:29.070 ag0p2 INPUT EVENT: Idles Received37856: May 22 07:05:29.070 ag0p2 Link State = 737861: May 22 07:05:29.173 ag0p2 INPUT EVENT: RTTOV Expiry (R_T_TOV stands for Receiver-Transmitter TimeOut Value. This is the amount of time that the receiver logic 37862: May 22 07:05:29.173 ag0p2 ENTRY: Wait for OLS State uses to determine loss of sync on the wire. Its default value is 100 milliseconds.)37863: May 22 07:05:29.173 ag0p2 OUTPUT EVENT: Transmit NOS37864: May 22 07:05:29.173 ag0p2 Link State = 937865: May 22 07:05:29.176 ag0p2 Received LIPf7f737896: May 22 07:05:29.237 ag0p2 ENTRY: OLS Transmit State37897: May 22 07:05:29.237 ag0p2 Link State = 737899: May 22 07:05:29.237 ag0p2 INPUT EVENT: LR Received37900: May 22 07:05:29.237 ag0p2 ENTRY: LR Receive State37902: May 22 07:05:29.237 ag0p2 Link State = 337906: May 22 07:05:29.252 ag0p2 INPUT EVENT: Receive Buffer Drain Timer Expiry37907: May 22 07:05:29.252 ag0p2 ENTRY: LR Receive State37909: May 22 07:05:29.252 ag0p2 OUTPUT EVENT: Transmit LRR37910: May 22 07:05:29.252 ag0p2 Link State = 3

Page 15: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 2012QLogic Confidential15

Class 3 Toss – Dropped Frames

11735: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 3208827011736: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020800 0880000811737: Apr 05 13:54:19.518 ag0p12 Header(3-5) 3900002f 80b80184 0001780011738: Apr 05 13:54:19.518 ag0p12 Payload ce265b11 6f056a4b da41e79211739: Apr 05 13:54:19.518 ag0p12 CRC c729a1fb TLR 396d3000

11735: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 3208827011736: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020800 0880000811737: Apr 05 13:54:19.518 ag0p12 Header(3-5) 3900002f 80b80184 0001780011738: Apr 05 13:54:19.518 ag0p12 Payload ce265b11 6f056a4b da41e79211739: Apr 05 13:54:19.518 ag0p12 CRC c729a1fb TLR 396d3000

User Port where dropped Frame was recorded.65535 = Aux Port

Reject code for the dropped framea3 = Tx Timeout82 = Rx Timeouta4 = Other (DID/SID not in NS etc)88 = ALPA Policy

Destination FC-ID Source FC-ID

Page 16: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Reading driver trace events

February 20, 2012QLogic Confidential16

What happened?11671: Apr 05 13:54:19.436 ag0p9 Flow Control Error, RBUF empty count = 0 11904: Apr 05 13:54:19.522 ag0p9 INPUT EVENT: Idles Received11674: Apr 05 13:54:19.437 ag0p9 ENTRY: LR Transmit State 11906: Apr 05 13:54:19.522 ag0p9 Link State = 111675: Apr 05 13:54:19.437 ag0p9 OUTPUT EVENT: Transmit LR 11931: Apr 05 13:54:19.549 ag0p8 INPUT EVENT: Idles Received11677: Apr 05 13:54:19.437 ag0p9 INPUT EVENT: LRR Received 11934: Apr 05 13:54:19.549 ag0p8 Link State = 111689: Apr 05 13:54:19.452 ag0p9 RBUF on port 9 failed to empty during link credit reset 11942: Apr 05 13:54:19.725 ag0p12 OUTPUT EVENT: Transmit LR11691: Apr 05 13:54:19.452 ag0p9 OUTPUT EVENT: Transmit NOS 11944: Apr 05 13:54:19.725 ag0p12 INPUT EVENT: LRR Received11694: Apr 05 13:54:19.501 ag0p9 INPUT EVENT: OLS Received 11954: Apr 05 13:54:19.741 ag0p12 ENTRY: LRR Receive State11696: Apr 05 13:54:19.501 ag0p9 OUTPUT EVENT: Transmit LR 11959: Apr 05 13:54:19.741 ag0p12 INPUT EVENT: Idles Received11699: Apr 05 13:54:19.502 ag0p8 INPUT EVENT: LR Received 11960: Apr 05 13:54:19.741 ag0p12 ENTRY: Active State11700: Apr 05 13:54:19.502 ag0p8 ENTRY: LR Receive State 11961: Apr 05 13:54:19.741 ag0p12 Link State = 111705: Apr 05 13:54:19.507 ag0p9 INPUT EVENT: LRR Received11706: Apr 05 13:54:19.507 ag0p9 ENTRY: LRR Receive State11713: Apr 05 13:54:19.518 ag0p8 ENTRY: LR Receive State11715: Apr 05 13:54:19.518 ag0p8 TBUF blocked on port 12 during link credit reset.11716: Apr 05 13:54:19.518 ag0p8 RBUF on port 8 failed to empty during link credit reset11726: Apr 05 13:54:19.518 ag0p8 INPUT EVENT: LRR Received11735: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 32088270 #Dropped frame11736: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020800 0880000811740: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 32088260 #Dropped frame11741: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020900 0880000811745: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 32088270 #Dropped frame11746: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020800 0880000811750: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 32088270 #Dropped frame11751: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020900 0880000811755: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 32088270 #Dropped frame11756: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020800 0880000811760: Apr 05 13:54:19.518 ag0p12 Frame info for Port12: SOF 32088270 #Dropped frame11761: Apr 05 13:54:19.518 ag0p12 Header(0-2) 01020c00 00020900 0880000811899: Apr 05 13:54:19.522 ag0p9 ENTRY: LRR Receive State

Page 17: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Case Study1 – Loop Port Login Failure

February 20, 2012QLogic Confidential17

Problem Description:

When Brocade 825 HBA (driver version 3.2.0.0) connected QLogic 5802(Huawei SNS2120) in P2P mode, the link was all right. When changed the topology mode of Brocade 825 to loop (speed was configured to Auto), the state of HBA port was failed.

Resolution Description: The HBA should wait 2 seconds in OPEN-INIT-SELECT-MASTER to resolve Loop Initialization Master based on LISM. Also, if a LIP(f7,f7) is received anywhere within the Initialization process, the Loop Initialization process will revert back to the beginning (Start Open Loop Initialization).

Note: Case Study materials show two different attempts:1.) When switch port is set to GL_Port2.) When switch port is set to FL_Port.

Page 18: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Case Study2 – Single switch

February 20, 2012QLogic Confidential18

Problem Description:

Two disk arrays' controllers panicked at the same exact timeWe have an Xsan volume, built on four disk arrays called i-Stor (No-name Taiwanese storage). For Fibre Channel interconnect we utilize two 5600s.

Today we experienced the following issue:1) Both arrays registered 'FC Link Down' at the same time (April 5, 13:54:15, GMT +3)2) Both arrays controllers became completely unresponsive, we could not connect via IP or control them from LCD panel3) In FC Switch logs, a lot of link resets (Link down / Link up) were registered4) Xsan Volume, of course, became unresponsive, all processes hung up.

Would it be possible for you to check what exactly happened at lower levels of FC protocol at that time? That'd definitely help us a lot in investigating the problem. The arrays were connected to ports 5 and 8. Support files are attached.

Resolution Description: ATTO HBA stopped receiving frames, which appears to have caused instability of storage devices after the HBA recovered.

Page 19: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Case Study3 – Multi-switch

February 20, 2012QLogic Confidential19

Problem Description:

Fabric OutagesHi we are having a few problems with LUN's dropping off the fibre fabric at present and suspect that it is a switch configuration error or a un compatibly issue between the 5600 & 5200 switch in the 6 switch fabric. we have 9000 frame which are looking role out shortly. but in the mean time can you have a look at the logs from our current configuration.

Notes:1.) Needed customer to provide a date or timestamp to make it easier to search through the driver trace files.2.) NTP being enabled is required.3.) Used ‘show lsdb’ output to show ISL flow. Also looked at ‘show steering’ and ‘show fabric’.All are in the show-support.txt file.

Resolution Description: Too much traffic moving across certain ISLs causing congestionand dropped frames. Many of these frames are having to traverse two hops to get to theirrespective destination. Adjust ISL connections and/or Host and storage locations.

SW1

SW3

SW5 SW6

SW4

SW2

Page 20: Understanding the driver trace file February 20, 2013

Analyzing the driver trace file

Case Study4 – SB9200

February 20, 2012QLogic Confidential20

Problem Description:

Numerous Link Reset errors3 days ago we were getting these errors for this port every few minutes. We had been having difficulties with the Sun StorageTek SL8500 tape drive it was attached to, so our Oracle field engineer came out and replaced the drive itself, and both the SFPs, one in the drive sled and the 9200 switch. We had hoped all problems with this had been resolved, but today we have gotten two more of he same Link Reset errors from that same port after the replacements. Please take a look at the attached support file and provide a recommendation.

Resolution Description: RMA of failing blade port.

Page 21: Understanding the driver trace file February 20, 2013

February 20, 201221 QLogic Confidential