vmware error examples. scsi device io errors where to look for errors depending on the version of...

20
VMware Error Examples

Upload: janice-randall

Post on 19-Dec-2015

279 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

VMware Error Examples

Page 2: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Scsi Device IO errors

Page 3: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Where to look for errors

Depending on the Version of VMware, SCSI errors are either in:

var/log/Messages

OR vmkernel

The safest way to ensure you are provided the correct logs is to request a Vmsupport Dump.

Month DD, YYYYQLogic Confidential3

Page 4: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

SCSI Errors SCSI codes and Devices

Example Error:vmkernel: 8:23:44:19.128 cpu1:4142)ScsiDeviceIO: 1672: Command 0x28 to device "naa.6005076307ffc1370000000000001008" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.

Note: In some versions of VMware this error will start with a FCP

Breaking the error Down:

• Command 0x28 – This is the type of command the error is referencing. In this case a 0x28 is a read command. A good reference can be found here: http://en.wikipedia.org/wiki/SCSI_command

• naa.6005076307ffc1370000000000001008" – Is the target of the command.

Month DD, YYYYQLogic Confidential4

Page 5: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

SCSI Errors (Host/Device/Plugin) Codes

Example Error:vmkernel: 8:23:44:19.128 cpu1:4142)ScsiDeviceIO: 1672: Command 0x28 to device "naa.6005076307ffc1370000000000001008" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.

Breaking the error Down:• H:0x8 D:0x0 P:0x0 -Defines what device is reporting the error. H=Host (initiator) D=Device (target) P= Plugin. Host

and Device errors can be decoded here:

• Host: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1029039

• Device: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381

• Plugin: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2004086&sliceId=1&docTypeID=DT_KB_1_1&dialogID=295774258&stateId=0 0 295772714

• In this error the Host is reporting the error 0x8 which indicates the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target.

Note: The fact that the Host reported the error DOES NOT mean it is the cause. This is a common misconception.Month DD, YYYYQLogic Confidential5

Page 6: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

SCSI Errors Sense Codes

Example Error:vmkernel: 8:23:44:19.128 cpu1:4142)ScsiDeviceIO: 1672: Command 0x28 to device "naa.6005076307ffc1370000000000001008" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.

Breaking the error down:

0x5 0x24 0x0 – Is the sense data in Sense key, ASC, ASCQ format. Often this is all Zeros, but on occasion sense data is provided.

A good key for this information can be found here: http://en.wikipedia.org/wiki/Key_Code_Qualifier

For this error it says:

Month DD, YYYYQLogic Confidential6

Page 8: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Sample errors to Decode:

1) ScsiDeviceIO: 1672: Command 0x12 to device "eui.00173800049f0000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0x0. • Command 0x12 = Inquiry• eui.00173800049f0000 = Device name• D:0x2 = Device (initiator) reported a Check Condition• Sense Data 0x2 0x4 0x0 = Not Ready - Cause not reportable

The examples below will often be seen along side the ScsiDevice IO errors. They indicate the same events but as reported by the QLogic driver:

Month DD, YYYYQLogic Confidential8

Error Meaningvmkernel: 8:23:44:19.128 cpu7:4142)<6>qla2xxx 0000:09:00.0: scsi(8:0:0): Abort command issued -- 1 f0784a4 2002.

Indicates that a SCSI command abort has been issued to the target

vmkernel: qla2xxx 0000:03:00.0: scsi(1:0:1): DEVICE RESET ISSUED. Indicates a Device (LUN) reset has been issued to the target

vmkernel: qla2xxx 0000:03:00.0: scsi(1:0:1): DEVICE RESET SUCCEEDED Indicates a Device (LUN) reset has been Successfully processed by the target

Page 9: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

ASYNC errors

Example errors:

Month DD, YYYYQLogic Confidential9

Error Message Meaning /Indication

cpu4:4100)scsi(5): Asynchronous PORT UPDATE ignored 0000/0004/0600

This error indicates a fabric disruption occurred.

scsi(5): Asynchronous LOOP UP (10 Gbps). This indicates the loop came up at the noted speed (in this case 10G)

scsi(5): Asynchronous LOOP DOWN (10 Gbps). This indicates the loop came down at the noted speed (in this case 10G)

Page 10: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Link status messages

Example errors:

Month DD, YYYYQLogic Confidential10

Meaning / Indication

vmkernel: 0:00:21:01.647 cpu2:129) <6>scsi(0): LOOP DOWN detected.

The Loop is Down

vmkernel: 0:00:21:54.032 cpu2:129) <6>scsi(0): LOOP UP detected.

The Loop is Up

vmkernel: 0:00:21:26.285 cpu0:139) <6>scsi(0): Cable is unplugged...

The physical link is down

Message

Page 11: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

VMFS Heartbeat errors

Page 12: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

VMFS Heartbeat errors

Example Error:vobd: Mar 01 13:24:16.429: 776658042771us: [esx.problem.vmfs.heartbeat.timedout] 4f44b9b5-5c051bb1-12a0-001018XXXXXX VT-315-CLU-XXXXXX.

Breaking the error down:

• esx.problem.vmfs.heartbeat.timedout = This indicates that the ESX host connectivity to the volume degraded due to the inability of the host to renew its heartbeat for period of approximately 16 seconds (the VMFS lock breaking lease timeout).

• After the periodic heartbeat renewal fails, VMFS declares that the heartbeat to the volume has timed out and suspends all I/O activity on the device until connectivity is restored or the device is declared inoperable.

• 4f44b9b5-5c051bb1-12a0-001018XXXXXX = This is the UUID for the volume the error is referring to.

• VT-315-CLU-XXXXXX = This is the volume the error is referring to.

Month DD, YYYYQLogic Confidential12

Page 13: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Link failure and recovery

These errors are Typically followed by errors confirming loss of the connection like below:

• Hostd: [2012-03-01 13:24:16.429 FFEF8B90 info 'ha-eventmgr'] Event 73 : Lost access to volume 4f44b9b5-5c051bb1-12a0-001018XXXXXX (VT-315-CLU-XXXXXX) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Once the link recovers a similar set of errors will be logged:

• Mar 1 13:24:21 vmkernel: 8:23:44:24.620 cpu2:4142)FS3: 398: Reclaimed heartbeat for volume 4f44b9b5-5c051bb1-12a0-001018XXXXXX (VT-315-CLU-XXXXXX): [Timeout] [HB state abcdef02 offset 4063232 gen 5 stamp 776664617984 uuid 4f439f16-eb47d490-548c-00215e5de25c jrnl <FB 29008> drv 8

• Mar 1 13:24:21 vobd: Mar 01 13:24:21.922: 776663535861us: [esx.problem.vmfs.heartbeat.recovered] 4f44b9b5-5c051bb1-12a0-001018XXXXXX VT-315-CLU-XXXXXX.

• Mar 1 13:24:21 Hostd: [2012-03-01 13:24:21.922 32EAFB90 info 'ha-eventmgr'] Event 74 : Successfully restored access to volume 4f44b9b5-5c051bb1-12a0-001018XXXXXX (VT-315-CLU-XXXXXX) following connectivity issues.

Month DD, YYYYQLogic Confidential13

Page 14: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Misc. Log Messages

Page 15: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Log Messages and meanings

NOTE: These messages may be slightly different in newer versions of VMware

Month DD, YYYYQLogic Confidential15

Message Meaning"qla2x00_set_info starts at address = xxxxxxxx" Driver is reporting the starting address where the driver was loaded in case an oops occurs in

the driver.

"qla2x00: Found VID=xxxx DID=yyyy SSVID=zzzz SSDID=vvvv" Driver is reporting which adapter it has found during initialization.

"scsi(%d): Allocated xxxxx SRB(s)" Driver is reporting the number of simultaneous commands that can be executed by the adapter. The max_srbs option can change this number.

"scsi(%d): 64 Bit PCI Addressing Enabled" Driver is reporting that it has configured the adapter for 64bit PCI bus transfers.

"scsi(%d): Verifying loaded RISC code..." Driver is reporting that it has verified the RISC code and it is running.

"scsi(%d): Verifying chip..." extended" Driver is reporting that it has verified the chip on the adapter.

"scsi(%d): Waiting for LIP to complete..." Driver is reporting that it is waiting on the firmware to become ready.

"scsi(%d): LIP occurred, ..." Driver received a LIP async event from the firmware.

"scsi(%d) LOOP UP detected" Driver received a loop up async event from the firmware.

"scsi(%d) LOOP DOWN detected" Driver received a loop down async event from the firmware.

"scsi(%d): Link node is up" Driver received a point-to-point async event from the firmware.

"scsi%d: Topology - (%s), Host Loop address 0x0" Indicates the firmware connection type. %s will be one of the following: FL-PORT, N-PORT, F-PORT, NL-PORT, and host adapter loop ID.

Page 16: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Initialization messages and meanings (cont.)

Month DD, YYYYQLogic Confidential16

Message Meaning

"scsi%d : QLogic XXXXXX PCI to Fibre Channel Host Adapter: ... " Firmwareversion: 4.04.06, Driver version 7.08vm62"

Driver is reporting information discovered during its initialization. This information includes the board ID, firmware version, and driver version.

"qla%d Loop Down - aborting ISP" Indicates driver is attempting to restart the loop by resetting the adapter. Usually done by the driver when sync is not detected by the firmware for a long time (4+ minutes), and usually means that the adapter port is not connected to the switch or loop.

"scsi(%d): %s asynchronous Reset." Driver received an async reset event from the firmware. %s indicates the function name.

"qla2x00: ISP System Error - mbx1=%x, mbx2=%x, mbx3=%x" Driver received an async ISP system error event from the firmware. Additional information follows themessage (that is, mailbox values from the firmware).

"scsi(%d): Configuration change detected: value %d." Driver received a change in connection async event from the firmware. Additional information follows the message (that is, mailbox 1 value from the firmware).

"scsi(%d): Port database changed" Driver received a port database async event from the firmware.

"scsi(%d): RSCN,..." Driver received a registered state change notification (RSCN) async event from the firmware. Additionalinformation follows the message (that is, mailbox values from the firmware

"%s: Can't find adapter for host number %d\n" Indicates that the read from /proc/scsi/qla2X00 did not specify the correct adapter host number.%s indicates the function name.

"scsi(%d): Cannot get topology - retrying" Firmware return status indicating it is busy.

"%s(): **** SP->ref_count not zero\n" Indicates a coding error. %s is the function name.

Page 17: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Initialization messages and meanings (cont.)

Month DD, YYYYQLogic Confidential17

Message Meaning"qla_cmd_timeout: State indicates it is with ISP, But not in active array"

Indicates a coding error. %s is the function.

"cmd_timeout: LOST command state = 0x%x\n" Indicates the command is in an undefined state. 0x%x indicates the state number.

"qla2x00: Status Entry invalid handle" Driver detected an invalid entry in the ISP response queue from the firmware. %x indicates the queueindex.

"%s(): **** CMD derives a NULL TGT_Q\n" Indicates the command does not point to an OS target.

"scsi(%ld:%d:%d:%d): DEVICE RESET ISSUED.\n" Indicates a device reset is being issued to (host:bus:target:lun)."scsi(%ld:%d:%d:%d): LOOP RESET ISSUED.\n" Indicates a loop reset is being issued to (host:bus:target:lun).

"%s(): **** CMD derives a NULL HA\n“Or "%s(): **** CMD derives a NULL search HA\n"

Indicates the command does not point to the adapter structure.

"scsi(%ld:%d:%d:%d): now issue ADAPTER RESET.\n" Indicates an adapter reset is being issued to (host:bus:target:lun).

"scsi(%d): Unknown status detected %x-%x" Indicates the status returned from the firmware is not supported. %x-%x is the completion-scsistatuses.

"scsi(%ld:%d:%d:%d): Enabled tagged queuing, queue depth %d.\n"

Indicates the queue depth for the (host:bus:target:lun).

"PCI cache line size set incorrectly (%d bytes) by BIOS/FW," Indicates a correction in the cache size. %d is the cache size.

"scsi(%d): Cable is unplugged..." Indicates the firmware state is in LOSS OF SYNC; therefore, the cable must be missing.

Page 18: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Initialization messages and meanings (cont.)

Month DD, YYYYQLogic Confidential18

Message Meaning"qla2x00: Performing ISP error recovery - ha=%p." Indicates the driver has started performing an adapter reset.

"qla2x00_abort_isp(%d): **** FAILED ****" Indicates the driver failed performing an adapter reset.

"%s(%ld): RISC paused, dumping HCCR (%x) and schedule an ISP abort (big-hammer)\n“

Indicates the driver has detected the RISC in the pause state.

"scsi(%ld): Mid-layer underflow detected (%x of %x bytes) wanted "%xbytes...returning DID_ERROR status!\n"

Indicates an underflow was detected.

"%s(): Ran out of paths - pid %d" Indicates there are no more paths to try for the request. %s is the function name and %d is the mid-levelprocessor identifier (PID).

"WARNING %s(%d):ERROR Get host loop ID" Firmware failed to return the adapter loop ID.

"WARNING qla2x00: couldn't register with scsi layer\n" Indicates the driver could not register with the SCSI layer, usually because it could not allocate the memoryrequired for the adapter.

"WARNING scsi(%d): [ERROR] Failed to allocate memory for adapter\n"

Indicates the driver could not allocate all the kernel memory it needed.

"WARNING qla2x00: Failed to initialize adapter\n" Indicates that a previously occurring error is preventing the adapter instance from initializing normally.

"WARNING scsi%d: Failed to register resources.\n" Indicates the driver could not register with the kernel.

"WARNING qla2x00: Failed to reserve interrupt %d already in use\n"

Indicates the driver could not register for the interrupt IRQ because another driver is using it.

Page 19: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Initialization messages and meanings (cont.)

Month DD, YYYYQLogic Confidential19

Message Meaning"WARNING qla2x00: ISP Request Transfer Error" Driver received a Request Transfer Error async event from the firmware.

"WARNING qla2100: ISP Response Transfer Error" Driver received a Response Transfer Error asynchronous event from the firmware.

"WARNING Error entry invalid handle" Driver detected an invalid entry in the ISP response queue from the firmware. This error will cause anISP reset to occur.

"WARNING scsi%d: MS entry - invalid handle" Driver detected a management server command timeout.

Page 20: VMware Error Examples. Scsi Device IO errors Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages

Log messages valid only for ISP82xx

Month DD, YYYYQLogic Confidential20

Message Meaningqlcnic 0000:0d:00.1: PEG_HALT_STATUS1: 0x0, PEG_HALT_STATUS2: 0x0.

This message indicates that there has been a peg fault. A driver reset should follow this.

qla2xxx 0000:0d:00.7: HW State: FAILEDqla2xxx 0000:0d:00.7: Disabling the board

Firmware has fatally failed and the board will now be disabled.

qla2xxx 0000:0d:00.7: qla2xxx: RESET TIMEOUT! drv_state= 0x4 drv_active=0x6

The timeout for the successful completion of the reset of the firmware.

qla2xxx: Initialization TIMEOUT! The timeout for the successful completion of the Initialization of the firmware has occurred.

qla2xxx 0000:0d:00.7: HW State: QUIESCENT Firmware has been put into quiescent state.

qla2xxx 0000:0d:00.7: HW State: READY Firmware has been initialized properly and is now ready to be used.

qla2xxx 0000:0d:00.7: HW State: INITIALIZING State of hardware has been changed to initializing and the device is now being initialized. If properly initialized the device state should change to READY.

qla2xxx 0000:0d:00.7: qla2xxx: QUIESCENT TIMEOUT! drv_state= 0x4 drv_active=0x6

The timeout for the firmware to be in the quiescent mode has occurred.

qla2xxx 0000:0d:00.7: HW State: NEED RESETqla2xxx 0000:0d:00.7: qla82xx_abort_isp(4): reset_owner is 0x7

Driver need to reset the hardware and the owner of the reset operation is 0x7.