sx issue- debug/triage bkm · web viewthe purpose of this document is to explain basic triage and...

27
Sx WG Rev 0.2 Sx Issue Triage and Debug Steps Version 0.2 12/05/2014 1

Upload: others

Post on 11-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Sx WGRev 0.2

Sx Issue Triage and Debug Steps

Version 0.2

12/05/2014

Table of Contents1.Sx Issue- Debug/Triage BKM41.1Scope41.2Target audience41.2.1Details41.3Requirements41.3.1 Hardware requirements41.3.2 Software requirements41.4Description41.5System reset / shutdown unexpectedly during cycling101.6Requirements101.6.1 Hardware requirement101.6.2 Software requirements101.7Description111.7.1 Examples on collecting PMC log is shown below121.8Collect PMC log using RW tool – as explained below121.8.1 Requirements121.8.2 Description121.9System hangs in OS Phase151.10Requirements151.10.1Hardware requirement151.10.2 Software requirements151.11Description151.12BCDEDIT161.13How to analyze crash dump171.14OS Crash (BSOD) during cycling181.15System hangs in BIOS Phase (POST code hangs) during cycling181.16Requirements181.16.1 Hardware requirement181.16.2 Software requirements191.17Description191.18The hardware connections snap201.19Teraterm Configuration211.20The cause of Sx failure viz., ME211.21Requirements211.21.1 Hardware requirement211.21.2 Software requirements211.22Description221.23Point of Contact23

Revision Sheet

Release No.

Date

Revision Description

Rev. 0.2

12/05/2014

Revised Document

Sx Issue- Debug/Triage BKM

Scope

The purpose of this document is to explain basic triage and debug that needs to be performed during Sx failure before filing sighting. This is first level triage and debug hence this may not have comprehensive list of issue scenarios.

Target audience

Sx Validation and debug team

1.2.1 DetailsComment by Ramalingam, Natarajan: It should be 1.3, not sub-bullet of 1.2

· First level debug Steps for different Sx issues that we come across are explained below in details.

· System hangs with CATERR during cyclingComment by Ramalingam, Natarajan: We need title here As “CATERR failure”

· Collect MCDump using ITP – command to collect MCdump

· itp.unlock()

· import sys

· sys.path.append(r"\\hsw-tb\hsw\itp_scripts")

· import BdwMCDump [Note: this is for BDW CPU]

Collect AFD dump as explained below:Comment by Ramalingam, Natarajan: This is title and need to call out in bullet

Requirements 1.3.1 Hardware requirements

· ITP box

· 5V/2.5A adaptor

1.3.2 Software requirements

· Platform debug tool kit

· DFx Abstraction layer

· Python console

Description

Step1: Connect ITP on the board

Step2: Launch configuration console and select the appropriate target

 

Step3: Start the Intel DAL Python Console

Step4: Make sure ITP connection is established with the target by typing 'itp.devicelist'

Step5: Unlock the itp with 'itp.unlock' command.

Step6: Open Platform debug kit and navigate to State Freeze and dump under View

Step7: Select dump type as Hang Base and click on run.

Note: This will trigger to collect AFD and the progress could be seen on Message log window.

Step8: The dump file will be saved in the output folder mentioned in the output tab as shown below, also once after AFD get generated 'Run' tab will change to 'Stop'.

System reset / shutdown unexpectedly during cyclingComment by Ramalingam, Natarajan: This is main bullet.

Collect PMC log file using Stardebug– Refer this link for BKM

RequirementsComment by Ramalingam, Natarajan: This is sub set of 1.5, hence need re numbering 1.6.1 Hardware requirement

· UTAG

    1.6.2 Software requirementsdf

· Stardebug application

Description

Below are the steps to establish connection between Stardebug and PCH.

Step1: Connect Stardebug on the rear XDP port.

Step2: Download the latest version of stardebug from the below link: https://houston.fm.intel.com/wiki/doku.php?id=swtwiki:userdocs:tools:stardebug:start#download.

Step3: Locate Stardebug.exe from the extracted folder

 

 

Step4: On launching stardebug.exe the above highlighted debug blocks should be displayed. This indicates the connection to between stardebug and PCH is established.

1.7.1 Examples on collecting PMC log is shown below

· Once PCH get connected to Stardebug (Above step 4), log collection can be proceeded.

· Before initiating any log collection make sure the script file to create the log is placed in the same folder as stardebug.exe

· Locate 'dft' tab by typing ' sw dft' command

· Now initiate the command to start log file collection, enter run LptPmDump1.5M.lua as shown in the above screenshot.

· A text file Pmdump.text would have created in the same folder this is our PMC log file.

Collect PMC log using RW tool – as explained below1.8.1 Requirements1.6.1.1 Software requirements

· RWEverything

 1.8.2 Description

Step1: Launch RWEveryting and click on Memory icon as shown in the below figure.

 

 

Step2: Write 0x03030002 to 0xfed1f320

Step3: Read DWORD from 0xFED1F338

Step4: Decode the PMC value from the below table:

Reg 0x303 (bit 0 to bit 7):

· BIt 7: LTRESET# With Policy 1 (LTRST_POL1): This bit is set to '1' by hardware when a global reset is triggered by an LTRESET# assertion with LT_E2STS.LT_RESET_POLICY = 1.

· BIT 6: ME-Initiated Global Reset (ME_GBL): This bit is set to '1' by hardware when a global reset is triggered by an ME FW write of 1's to both GENCTL-"ME Partition Reset” and “GENCTL”. ME-initiated Host Reset with Power Cycle" in the same write cycle (this is ME FW's method of requesting a global reset).

· BIT 5: CPU Thermal Trip (CPU_TRIP): This bit is set to '1' by hardware when a global reset is triggered by a CPU thermal trip event (i.e. an assertion of the THRMTRIP# pin).

· BIT 4: ME-Initiated Power Button Override (ME_PBO): This bit is set to '1' by hardware when a global reset is triggered by an ME FW write of '1' to GENCTL."ME-Initiated Power Button Override".

· BIT 3: ICH Catastrophic Temperature Event (ICH_CAT_TMP): This bit is set to '1 by hardware when a global reset is “Triggered by a catastrophic temperature event from the ICH internal thermal sensor”.

· BIT 2: PMC SUS RAM Uncorrectable Error (PMC_UNC_ERR): This bit is set to '1' by hardware when a global reset is triggered due to an uncorrectable parity error on a data read from one of the PMC SUS well register files.

· BIT 1: Power Button Override (PB_OVR): This bit is set to '1' by hardware when a global reset is triggered by a power button override (i.e. an assertion of the PWRBTN# pin for 5 seconds).

· BIT 0: SUS Well Power Failure Status (SUSFLR_STS): This bit is set to '1' by hardware when a global reset is triggered by loss of SUS well power. This includes DeepSx entry and G3.

 Reg 0x304 (bit 8 to bit15):

· BIT 5: AS Well Power Failure (ASW_FLR): This bit is set to '1' by hardware when a global reset is triggered by an unexpected loss of ASW power (i.e. a de-assertion of APWROK at an unexpected time).

· BIT 4: SYS_PWROK Failure (SYSPWR_FLR): This bit is set to '1' by hardware when a global reset is triggered by an unexpected loss of SYS_PWROK. FW arms this global reset source via GBLRST_CTL.EN_SYSPWR_FLR.

· BIT 3: PCH_PWROK Failure (PCHPWR_FLR): This bit is set to '1' by hardware when a global reset is triggered by an unexpected loss of PCH_PWROK. FW arms this global reset source via GBLRST_CTL.EN_PCHPWR_FLR.

· BIT 2: PMC Firmware Global Reset (PMC_FW): This bit is set to '1' by hardware when a global reset is triggered by a request from PMC firmware (i.e. a write of '1' to the GBLRST_CTL.TRIG_GBL bit).

· BIT 1: ME Firmware Watchdog Timer (ME_WDT): This bit is set to '1' by hardware when a global reset is triggered by the second expiration of the ME firmware watchdog timer.

· BIT 0: PMC Firmware Watchdog Timer (PMC_WDT): This bit is set to '1' by hardware when a global reset is triggered by the second expiration of the PMC firmware watchdog timer.

 Reg 0x305 (bit 16 to bit 23):

· BIT 4: Over-Clocking WDT Expiration In ICC Survivability Mode (OC_WDT_EXP_ICCSURV): This bit is set to '1 by hardware when a global reset is triggered by the expiration of the over-clocking watchdog timer while running in a mode that has ICC survivability impact (OC_WDT_ICCSURV=1).

· BIT3: Over-Clocking WDT Expiration In Non-ICC Survivability Mode (OC_WDT_EXP_NO_ICCSURV): This bit is set to '1 by hardware when a global reset is triggered by the expiration of the over-clocking watchdog timer while running in a mode that does not have ICC survivability impact (OC_WDT_ICCSURV=0).

· BIT2: ADR GPIO Reset (ADR_GPIO_RST): This bit is set to '1' by hardware when a global reset is triggered by the assertion of the GPIO assigned to ADR.

· BIT1: ME HW Uncorrectable Error (ME_UNCOR_ERR): This bit is set to '1' by hardware when a global reset is triggered by ME hardware due to the detection of an uncorrectable ECC or parity error on a data read from one of its SRAM s.

· BIT0: CPU Thermal Runaway Watchdog Timer (CPU_THRM_WDT): This bit is set to '1' by hardware when a global reset is triggered by the expiration of the CPU Thermal Runaway Watchdog Timer.

· Collect window Eventvwr log file

· BKM: Run EventVwr Windows system

System hangs in OS Phase

(System control transferred from BIOS to OS) during cycling - Example, Blank screen, window display freeze.

Collect Windbg and analyze current status of system – Refer below for BKM:

Requirements1.10.1Hardware requirement

· Ajay's USB debug cable

1.10.2 Software requirements

· Windbg setup (x64 Preferable)

Description

Step1: Install the USB to USB convertor driver on both the host and target machine. (Driver copied here: \\akasha1\PSPV-Tools\windbg-driver). WinBlue OS has inbox driver for the cable and it will install the driver automatically.

Step2: Using USBVIEW tool find out the USB port1 on the target machine and connect the debug cable to port1 (Usually debug port is port1).

Step3: Change the BIOS settings as mentioned below by pressing F2 while booting,

Step4: Go to Intel Advanced Menu -> PCH-IO Configuration -> USB Configuration; and set XHCI Mode – Manual.

Step5: Route USB 2.0 pins to which HC?   -   Route Per-Pin and set all the pin to XHCI except pin#1 and pin#11. Pin#1& pin#11 should be routed with EHCI itself.

Step6: After seeing the BIOS, boot into the OS.

BCDEDIT

On an elevated command prompt run the below commands,

· bcdedit /debug on

· bcdedit /dbgsettings usb targetname: (type any name)

· bcdedit /set {dbgsettings} busparams 0.29.0  (bus, device and function of the usb root controller)

· Restart the target system. Open the windbg on the host machine and enter the target name under USB tab (File -> Kernal Debugging -> USB)

· Now the target will start pumping the debug messages to the kernel debug window

 

How to analyze crash dump

Step1: Navigate to file > open crash dump then select the crash dump to be analyzed.

Step2: In the command bar type 'analyze-v’, this is the command to analyze the crash dump.

OS Crash (BSOD) during cycling

· Collect dump file and analyze

· If no dump created then connect windbg and take dump – as explained above.

System hangs in BIOS Phase (POST code hangs) during cycling

· Collect BIOS Serial log - Refer below for the BKM

Requirements1.16.1 Hardware requirement

· RS232 Null-Modem cable

· RVP with debug BIOS flashed.

1.16.2 Software requirements

Any UART terminal utility like Putty or Teraterm

 Description

Step1: Flash debugs BIOS which is downloadable from client download [ex: HSW_LP_LPT_V106.3_Debug.rom]

Step2: Enter into Bios using F2 -> Intel Advance Menu-> Debug Configuration-> Serial Debug messages-> Set the value as per your requirement.

Step3: Connect Null-Modem cable to host and RVP    (May need a USB to Serial Adapter to connect to host)

Step4: Install Terminal program (Putty, Teraterm, and Termite) on host with following settings:

· Port: (Look in Device Manager/Ports)

· Baud rate: 115200

· Data: 8 bit

· Parity: none

· Stop: 1 bit

· Control: none

Step5: Open putty-> Set Serial-> select the com as shown in the client device manager

Step6: Boot the system (system will start pumping debug messages to Putty)

Step7: Stop log file after system boots to OS. 

The BIOS serial log will look like, BIOS_Serial dump.txt

 

The hardware connections snap

 

Teraterm Configuration

The cause of Sx failure viz., ME

Please collect ME debug log – refer to the below BKM

Requirements1.21.1 Hardware requirement

· Dediprog hardware

1.21.2 Software requirements

· Dediprog flash utility

· FITC tool

Description

Step1: Take the SPI BIOS file

Step2: Install FITC tool

Step3: Browse for BIOS full image(16MB) file and modify with below settings, build new image and flash it on your target system.

Step4: Make sure it's not a LAN-less image

Step5: In FITC, under ME Region -> Configuration -> ME Debug Event Service, set as shown below:

Step6: Please make sure in  Event Filters, group 87 has the value 0x1 (you can leave other groups as-is)

Step7: To record it, connect another computer to the same LAN as the DUT (note: the DUT must be connected using the built-in LAN, not any external PCIe card). On that other computer, run PDA (Platform Debug Analyzer) or WireShark, to record all the packets sent. You should see quite a lot (hundreds or more) packets sent on UDP port 64507.

Step8: Reproduce your issue on the target

Step9: Go to the location  \\akasha1\temp\nramalin\Tools and install PDA.

Step10: Connect LAN cable to target to host machine, ensure ping is successful

Step11: Launch PDA app, start capture the log

Check point before proceed on sighting:

· Make sure to latest BKC stack from here: http://pspv.intel.com/sites/PSIV-BKC/Reports/SitePages/Home.aspx

· Make sure system has all mandatory rework as applicable – Use this link to know applicable rework-https://sharepoint.gar.ith.intel.com/sites/RVP_CrescentBay/CRB-RVP-CSF/SitePages/Home.aspx?RootFolder=%2Fsites%2FRVP%5FCrescentBay%2FCRB%2DRVP%2DCSF%2FShared%20Documents%2FBroadwell%20U%20%28ULT%29RVP&FolderCTID=0x012000FFD2A2D256C7284E9CA6E17A025EA5E5&View={2DC03B06-DDA9-4EAA-ABA6-CA8A28FBF446}

· Make sure to use recommend bios settings – Refer this link to get recommended BISO settings http://bkclc1.amr.corp.intel.com:7076/api/service/1975925211/447894345.zip

· Verify if there is similar issue reported in Sx_WG – Refer this link to get known issue list - http://pspv.intel.com/sites/pspv/broadwell/Lists/Blocking%20Issues/AllItems.aspx?ShowInGrid=True&View={6D9F3299-A5E6-4D20-B7B3-1FD542C2D19F}&InitialTabId=Ribbon.List&VisibilityContext=WSSTabPersistence

Point of Contact

Please mail [email protected] for feedback/query.

1