reliability, availability and serviceability on linux

Sep 16, 2013

Open Source Group – Silicon Valley © 2013 SAMSUNG Electronics Co.

Not to be used for commercial purpose without getting permissionAll information, opinions and ideas herein are exclusively the author's own opinion

Reliability, Availability and Serviceability on

Linux

Mauro Carvalho ChehabLinux Kernel Expert

Samsung Open Source Group


What is RAS (1)● Used originally by IBM to measure mainframe robusteness

● Reliability

– Probability that a system will produce correct outputs

– Generally measured as Mean Time Between Failures (MTBF)

– Enhanced by features that help to avoid, detect and repair hardware faults

● Availability

– Probability that a system is operational at a given time

– Generally measured as a percentage of downtime per a period of time

● Examples:

– 99.9% (“three nines”) means 3.65 days unavailable per year

– 99.999% (“five nines”) means 5.26 minutes of downtime per year

– Minimal down-time for service and repair.

– Detect and correct hardware faults as opposed to detect and repair


What is RAS (2)● Serviceability (or maintainability)

– Simplicity and speed with which a system can be repaired or maintained

– Generally measured on Mean Time Between Repair

– Can be increased with redundant parts, and higher support grade (24/7/365)


Improving RAS (1)● In order to improve RAS, both IT services and hardware require improvements

● Examples of hardware measures

– CPU – to detect errors at instruction execution and L1/L2/L3 caches;

– Memory – add error correction logic (ECC) to detect and correct errors;

– I/O – add CRC checksums for tranfered data (PCIe has such feature);

– Storage – RAID, journal file systems, checksums;

– Power/cooling – component duplication, over-design, surge protector, UPS

– System – hot swap of components, predictive failure analysis, partitioning of system components, virtual machines running on redundant servers, clustering, dynamic software update, independent CPU for RAS

– RAS servers have features to hot add/replace/remove I/O cards that reduce

– down-time for adding new hardware. Replacing failing I/O cards based on

– PCIe AER features.

– memory mirroring and active/ative and active/standby comfigurations that reduce down-time


Improving RAS (2)● Examples of IT measures

– 24x7x365 days on-site support; low latency support from their vendors

● Usage of Virtual Machines

– vm migration with minimal application down-time

– Cloud computing

● Predictive analysis

– Hardware/OS should provide data to detect systems/components degradation

– It should have tools to analyze and (hot)replace those degraded components


RAS features on Linux (1)● Storage errors reported is supported since early versions, as RAID/SAS/SAN

controllers/drivers offer measurements.

– There are also userspace tools to manage it.

● Machine Check Architecture – MCA

– CPU errors are provided on x86 machines since Pentium 4

– Depending on the processor, it can also provide memory and bus errors

– Kernel implements it at mcelog subsystem

● Fatal errors produce panic() and are reported at console;

– At userspace, the mcelog tool reads the corrected/non-fatal error data from time to time and reports at console

● The Kernel-userspace API is obscure: userspace receives a dump of a series of registers;

– Decoding those errors are CPU-specific;

– Kernel decodes those errors for fatal errors;

– The userspace tool decodes those errors for non-fatal/corrected ones

– Errors are reported also via a kernel trace event;


RAS features on Linux (2)● EDAC (Error Detection and Correction) subsystem

– Provides a way to report errors detected by memory controllers to userspace;

– Some (old) drivers also report PCI errors via EDAC;

– Kernel decodes the error into the DIMM labels affected by an error;

● Errors are reported at console and via a kernel trace event

● The association between the memory architecture and DIMM is done via some files that are loaded by an userspace tool (edac-utils or rasdaemon)

– Most drivers talk directly with the memory controller (MC)

● That provides a more reliable error report

● BIOS data is not very reliable: on several cases, the same BIOS is used on different machines

– The DMI BIOS tables may contain the wrong DIMM labels● Race conditions may happen on BIOS that also collect error data

– There's one driver (ghes_edac) on Kernel 3.9+ that get errors from BIOS

● “firmware first” mode: BIOS tell OS to not talk with the MC directly


RAS features on Linux (3)● PCIe AER (Advanced Error Reporting)

– Some PCIe hardware provide ways to get AER error reports on OS;

– AER logs data at console;

– It also reports error via a kernel trace event

● Userspace tools:

– mcelog – collects and decodes MCA error events on x86;

– edac-utils – fills DIMM labels data and summarizes memory errors;

– rasdaemon

● collects errors via kernel trace events from several sources:

– MCA, EDAC and PCIe AER events● Fills DIMM labels data

● Store error data into a persistent database (sqllite3)

● Allow to latter query/summarize errors

● Use new resources available on Kernel 3.10


Typical D-RAM implementation

Source: http://lwn.net/Articles/250967/

A DIMM can have 1, 2 or 4 ranks

Bank 7Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

DRAMMemoryMatrix

Row

Dec

oder

Column Decoder

Bank n

Rank 1

DRAM Memory Matrix

http://lwn.net/Articles/250967/


Memory Arrangements on PC

IBM PC original architecture Classic server (most common) architecture

AMD-64 and newer Intel CPU architecture(Nehalem, Sandy Bridge and upcoming)

Images took from: http://lwn.net/Articles/250967/

NOTE: Nehalem-EX has an additional buffer chipbetween the RAM and CPU, called Intel SMB(Intel Scalable Memory Buffer)

● It means that the CPU memory controller doesn'tsee the DIMMs directly

● This is to improve performance when there arelots of CPU sockets (-EX machines)

● Only BIOS knows how the memory is organizedon Nehalem-EX

http://lwn.net/Articles/250967/


Evolution of RAS on Kernel (1)● Before Kernel 2.6.32

– EDAC reports errors via dmesg

– On EDAC, memory controllers are assumed to be Rank-based

– mcelog reports error via its own interface only

● Kernel 2.6.32

– Added kernel trace events on MCE;

● Kernel 3.5

– EDAC/HERM patches added support for modern memory architectures

● Modern Intel CPUs/MCs proper support (2002 and upper Intel systems)

– Memory controllers are DIMM-based,

– Memory controllers can be grouped in branches (FB-DIMM)● Added kernel trace events for EDAC;


Evolution of RAS on Kernel (2)● Kernel 3.9

– Added firmware first EDAC driver (ghes_edac);

– Added trace events for PCIe AER;

● Kernel 3.10 brings a series of new features useful for RAS events tracing:

– Allow to create independent tracing facility for each process using traces;

– Added blocking functionality to trace_pipe_raw;

– Added “uptime” clock reference for tracing events;

● While the rasdaemon tool works with kernels below 3.10, it is optimized to use those new features found on Kernel 3.10.


Firmware First x Hardware First● Hardware-first approach

– Errors come directly from the hardware

– BIOS doesn't handle it

● It is faster

● Can help to avoid long SMI interrupts

– Require a deep knowledge on the hardware

● Firmware-first approach

– BIOS and/or dedicated CPUs collect errors

– OS doesn't need to know deeply the hardware

– BIOS can mask/group errors, apply proprietary algorithms, avoid spurious report

– Unfortunately, current ACPI API doesn't expose the memory slot label, with makes harder to be used by the system admin


RASDAEMON (1)● It is a new tool

– Currently, provided on Fedora 18, Fedora 19 and rawhide

– Has:

● A daemon that waits for kernel trace events (rasdaemon)

● A tool to configure DIMMs and do RAS reports (ras-mc-ctl)

● Some contrib tools to test EDAC and to fake inject errors

– Example:

● Dell T620 with 2 Sandy Bridge-EP Xeon CPUs (E5-2670)

● 2 8GB dual-rank DIMMs (Samsung M393B1K70DH0-YK0)

● Driver: sb-edac

ras-mc-ctl –-layout +-----------------------------------------------------------------------------------------------+ | mc0 | mc1 | | channel0 | channel1 | channel2 | channel3 | channel0 | channel1 | channel2 | channel3 |-------+-----------------------------------------------------------------------------------------------+slot2: | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB |slot1: | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB | 0 MB |slot0: | 8192 MB | 0 MB | 0 MB | 0 MB | 8192 MB | 0 MB | 0 MB | 0 MB |-------+-----------------------------------------------------------------------------------------------+


RASDAEMON (2)

$ ras-mc-ctl --print-labelsLOCATION CONFIGURED LABEL SYSFS CONTENTS mc0 channel 0 slot 0 DIMM_A1 CPU_SrcID#0_Channel#0_DIMM#0 DIMM_A2 0:0:1 missing DIMM_A3 0:0:2 missing DIMM_A4 0:0:3 missing DIMM_A5 0:1:0 missing DIMM_A6 0:1:1 missing DIMM_A7 0:1:2 missing DIMM_A8 0:1:3 missing DIMM_A9 0:2:0 missing DIMM_A10 0:2:1 missing DIMM_A11 0:2:2 missing DIMM_A12 0:2:3 missing mc1 channel 0 slot 0 DIMM_B1 CPU_SrcID#1_Channel#0_DIMM#0 DIMM_B2 1:0:1 missing DIMM_B3 1:0:2 missing DIMM_B4 1:0:3 missing DIMM_B5 1:1:0 missing DIMM_B6 1:1:1 missing DIMM_B7 1:1:2 missing DIMM_B8 1:1:3 missing DIMM_B9 1:2:0 missing DIMM_B10 1:2:1 missing DIMM_B11 1:2:2 missing DIMM_B12 1:2:3 missing

# ras-mc-ctl --register-labels


RASDAEMON (3)

$ util/ras-mc-ctl --print-labelsLOCATION CONFIGURED LABEL SYSFS CONTENTS mc0 channel 0 slot 0 DIMM_A1 DIMM_A1 DIMM_A2 0:0:1 missing DIMM_A3 0:0:2 missing DIMM_A4 0:0:3 missing DIMM_A5 0:1:0 missing DIMM_A6 0:1:1 missing DIMM_A7 0:1:2 missing DIMM_A8 0:1:3 missing DIMM_A9 0:2:0 missing DIMM_A10 0:2:1 missing DIMM_A11 0:2:2 missing DIMM_A12 0:2:3 missing mc1 channel 0 slot 0 DIMM_B1 DIMM_B1 DIMM_B2 1:0:1 missing DIMM_B3 1:0:2 missing DIMM_B4 1:0:3 missing DIMM_B5 1:1:0 missing DIMM_B6 1:1:1 missing DIMM_B7 1:1:2 missing DIMM_B8 1:1:3 missing DIMM_B9 1:2:0 missing DIMM_B10 1:2:1 missing DIMM_B11 1:2:2 missing DIMM_B12 1:2:3 missing


RASDAEMON(4)

# rasdaemon -r -foverriding event (931) ras:mc_event with new print handlerrasdaemon: ras:mc_event event enabledrasdaemon: Enabled event ras:mc_eventoverriding event (847) ras:aer_event with new print handlerrasdaemon: ras:aer_event event enabledrasdaemon: Enabled event ras:aer_eventoverriding event (56) mce:mce_record with new print handlerrasdaemon: mce:mce_record event enabledrasdaemon: Enabled event mce:mce_recordrasdaemon: Listening to events for cpus 0 to 31Calling ras_mc_event_opendb()rasdaemon: cpu 0: Recording events at /var/lib/rasdaemon/ras-mc_event.dbcpu 12:rasdaemon: mc_event store: 0x19c6968rasdaemon: register inserted at db <...>-2742 [732433178] 2507.782000: mc_event: 2013-08-15 19:58:50 -0300 1 Corrected error: FAKE ERROR on DIMM_A1 (mc: 0 location: 0:0:0 grain: 7 for EDAC testing only)cpu 12:rasdaemon: mc_event store: 0x19c6968 <...>-2742 [732433178] 2507.864000: mc_event: 2013-08-15 19:58:50 -0300 1 Corrected error: FAKE ERROR on DIMM_B1 (mc: 1 location: 0:0:0 grain: 7 for EDAC testing only)cpu 12:rasdaemon: mc_event store: 0x19c6968

# ras-mc-ctl --summaryMemory controller events summary: Corrected on DIMM Label(s): 'DIMM_A1' location: 0:0:0:0 errors: 1 Corrected on DIMM Label(s): 'DIMM_B1' location: 1:0:0:0 errors: 1

Thank you.


Questions?

reliability, availability and serviceability on linux

Technology

enabled event ras

kernel trace event

kernel trace events

print handler rasdaemon

pcie aer

fake error

memory controllers

kernel decodes