devagent: analyzing the reliability of storage software ...esaule/nsf-pi-csr-2017... · [6]...

1
DevAgent: Analyzing the Reliability of Storage Software Stack via Smart Devices Mai Zheng, Computer Science Department, New Mexico StateUniversity [email protected] Background & Motivation New NVM-based components are revolutionizing the traditional storage systems, potentially creating new failuremodesdifficult tounderstand Observations & Approach Acknowledgement 0 0.5 1 1.5 2 0 1000 2000 3000 4000 5000 6000 7000 8000 0 = uncommitted, 1 = committed write requests 0 0.5 1 1.5 2 0 1000 2000 3000 4000 5000 6000 7000 8000 0 = dropped, 1 = committed write requests Results & Milestones This material is based upon work supported in part by the National Science Foundation (NSF) under Grant Number 1566554 (CRII). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF. [1] Understanding the Fault Resilience of File System Checkers (HotStorage’17) [2] On Fault Resilience of File System Checkers (FAST’17- WiP) [3] Do Not Blame Devices for All Failures (NVMW’17- Poster) [4] A Generic Framework for Testing Parallel File Systems (PDSW’16) [5] Reliability Analysis of SSDs under Power Fault (TOCS’16) [6] Emulating Realistic Flash Device Errors with High Fidelity (NAS’16- Poster ) SSDs exhibit different# of errors when tested on different OSes device driver block layer file system device user level OS (Kernel) SSD 1 SSD 2 SSD 3 Debian 6 (2.6.32) 317 991 2 Ubuntu 14 (3.16) 88 0 1 DevAgent-FW June 2015, flash-based servers in Algolia data center started corrupting files; developers “spent a big portion of two weeks just isolating machines and restoring them as quickly as possible ”; Samsung SSDs were mistakenly blamed & blacklisted (until one month later they identified a kernel bug) Failureexample1: Failureexample2: Limitation of exiting analysis tools: - heavily rely on kernel & assume kernel is correct We can no longer completely trust the kernel => analysis tools should minimize dependency/interference on kernel We can no longer focus on single component => cross-layer analysis We need to focus on generic & fundamental operations => interfaces b/w layers Publications: Prototyping DevAgent-FW on Cosmos OpenSSD Platform[3] Emulating realisticdevice behaviors for testing higher software layers [5][6] Exposing vulnerabilities in popular file system checkers [1][2] Analyzing fine-grained I/O behavior of parallel file systems [4] syscall interface device interface SCSI/NVMe system calls iSCSI NVMe/Fabrics DevAgent-SW DevAgent-Usr Two key functionalities: - record host/device interaction for diagnosis - emulatedevicebehavior for testing higher layers firmwareimplementation (minimal intrusioninto kernel) software implementation (good portability) - map function calls todevice- level commands for reasoning high-level semantics Identifying device-level impactof kernel bug patches [3] - real V.S. - emulated

Upload: others

Post on 26-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DevAgent: Analyzing the Reliability of Storage Software ...esaule/NSF-PI-CSR-2017... · [6] Emulating Realistic Flash Device Errors with High Fidelity (NAS’16-Poster) SSDs exhibit

DevAgent:AnalyzingtheReliabilityofStorageSoftwareStackviaSmartDevicesMaiZheng,ComputerScienceDepartment,NewMexicoStateUniversity

[email protected]

Background&Motivation

• NewNVM-basedcomponentsarerevolutionizingthetraditionalstoragesystems,potentiallycreatingnewfailuremodesdifficulttounderstand

Observations&Approach

Acknowledgement

0

0.5

1

1.5

2

0 1000 2000 3000 4000 5000 6000 7000 8000

0 = uncom

mitted, 1

= commi

tted

write requests

0

0.5

1

1.5

2

0 1000 2000 3000 4000 5000 6000 7000 8000

0 = drop

ped, 1 =

committe

d

write requests

Results&Milestones

This material is based upon work supported in part by the National Science Foundation(NSF) under Grant Number 1566554 (CRII). Any opinions, findings, and conclusions orrecommendations expressed in this material are those of the author(s) and do notnecessarily reflect the views of the NSF.

[1]UnderstandingtheFaultResilienceofFileSystemCheckers(HotStorage’17)[2]OnFaultResilienceofFileSystemCheckers(FAST’17-WiP)[3]DoNotBlameDevicesforAllFailures(NVMW’17- Poster)[4]AGenericFrameworkforTestingParallelFileSystems (PDSW’16)[5]ReliabilityAnalysisofSSDsunderPowerFault (TOCS’16)[6]EmulatingRealisticFlashDeviceErrorswithHighFidelity(NAS’16- Poster)

SSDsexhibitdifferent#oferrorswhentestedondifferentOSes

devicedriver

blocklayer

filesystem

device

userlevel

OS (Kernel) SSD 1 SSD 2 SSD 3Debian 6 (2.6.32) 317 991 2Ubuntu 14 (3.16) 88 0 1

DevAgent-FW

June2015,flash-basedserversinAlgoliadatacenterstartedcorruptingfiles;developers“spentabigportionoftwoweeksjustisolatingmachinesandrestoringthemasquicklyaspossible”;SamsungSSDsweremistakenly blamed&blacklisted(untilonemonthlatertheyidentifiedakernelbug)

• Failureexample1:

• Failureexample2:

• Limitation of exiting analysis tools:- heavily rely on kernel &assumekernel is correct

• Wecan no longer completely trust thekernel=> analysis tools shouldminimize dependency/interferenceon kernel

• Wecan no longer focus on single component=> cross-layer analysis• Weneed to focus on generic & fundamental operations => interfacesb/w layers

• Publications:

• Prototyping DevAgent-FW onCosmosOpenSSD Platform[3]

• Emulatingrealisticdevicebehaviorsfortestinghighersoftwarelayers[5][6]

• Exposingvulnerabilitiesinpopularfilesystemcheckers[1][2]

• Analyzingfine-grainedI/Obehaviorofparallelfilesystems[4]

syscallinterface

deviceinterface SCSI/NVMe

systemcalls

iSCSINVMe/Fabrics DevAgent-SW

DevAgent-Usr

• Twokeyfunctionalities:- recordhost/deviceinteraction

fordiagnosis- emulatedevicebehaviorfor

testinghigherlayersfirmwareimplementation

(minimalintrusionintokernel)

softwareimplementation(goodportability)

- mapfunctioncallstodevice-levelcommandsforreasoninghigh-levelsemantics

• Identifyingdevice-levelimpactofkernelbugpatches[3]

- real

V.S.

- emulated