nersc reliability data
DESCRIPTION
NERSC Reliability Data. NERSC - PDSI Bill Kramer Jason Hick Akbar Mokhtarani PDSI BOF, FAST08 Feb. 27, 2008. Production Systems Studied at NERSC. HPSS : 2 High Performance Storage Systems Seaborg : IBM SP RS/6000, AIX, 416 nodes (380 compute) Bassi : IBM p575 POWER 5, AIX, 122 nodes - PowerPoint PPT PresentationTRANSCRIPT
NERSC Reliability Data
NERSC - PDSI
Bill Kramer
Jason Hick
Akbar Mokhtarani
PDSI BOF, FAST08
Feb. 27, 2008
Production Systems Studied at NERSC
• HPSS: 2 High Performance Storage Systems• Seaborg: IBM SP RS/6000, AIX, 416 nodes (380 compute)• Bassi: IBM p575 POWER 5, AIX, 122 nodes• DaVinci: SGI Altrix 350 (SGI PropPack 4 64-bit Linux)• Jacquard: Opteron Cluster, Linux, 356 nodes• PDSF: Networked distributed computing, Linux• NERSC Global File-system: Shared file-system based on
IBM’s GPFS
Datasets
• Data were extracted from problem tracking database and paper records kept by the operations staff, and Vendor’s repair records
• Coverage is from 2001 - 2006, in some cases a subset of that period • Preliminary results on systems availability and component failure were
presented at HEC FSIO Workshop last Aug.• Have done a more detailed analysis trying to classify the underlying causes
of outage and failure
• Produced statistics for the NERSC Global File-system (NGF) and uploaded to the CMU website. This is different from fsstats; used fsstas to cross check results on some smaller directory tree on NGF
• Made workload characterization of selected NERSC applications available. They were produced by IPM, a performance monitoring tool.
• Made trace data for selected applications• Results from a number of I/O related studies done by other groups at
NERSC were posted to the website.
Results
• Overall systems availability is 96% - 99%• Seaborg and HPSS have comprehensive data for the
6 year period and show availability of 97% - 98.5% (scheduled and unscheduled outage)
• Disk drives failure rate for Seaborg show rates consistent with “aging” and “infant mortality”, average of 1.2%
• Tape drives failure for HPSS show the same pattern, average rate (~19%) - higher than manufacturer stated 3% MTBF
Average Annual Outage
Data since: Seaborg(2001), Bassi(Dec. 2005), Jacquard(July 2005), DaVanci(Sept. 2005), HPSS(2003), NGF(Oct. 2005), PDSF(2001)
Average Annual System-wide Outage
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Seaborg Bassi Jacquard Pdsf Davinci HPSS NGF
Percent
Scheduled Un-Scheduled
Seaborg and HPSS Annual OutageHPSS Annual Outage (Archive)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
2001 2002 2003 2004 2005 2006
Year
Minutes
SW, SCH SW, UNSCH HW, SCH HW, UNSCH
Seaborg Annual Outage
0
0.5
1
1.5
2
2.5
3
3.5
4
2001 2002 2003 2004 2005 2006
Year
Percent
Scheduled Unscheduled
HPSS Annual Outage
0
0.5
1
1.5
2
2.5
2001 2002 2003 2004 2005 2006
Year
Percent
Scheduled Unscheduled
Seaborg Annual Outage
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
2001 2002 2003 2004 2005 2006
Year
Minutes
HW, SCH HW, UNSCH SW, SCH SW, UNSCH
Disk and Tape Drives Failure
HPSS Tape Drives Replaced
0
2
4
6
8
10
12
14
16
18
20
2001 2002 2003 2004 2005 2006
Year
Number
0
5
10
15
20
25
30
35
Percent
Actual Percent
Seaborg Disks Replaced
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2001 2002 2003 2004 2005 2006
Year
Percent
0
10
20
30
40
50
60
70
80
Number
Percentage Actual
Extra Slides
Outage ClassificationsSeaborg Category Outage
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Accounting
Benchmark
Control Work Station
Dedicated test
Disk
File system
HW
HW Upgrade
LoadLeveler
OSF Power
Security
SW
SW Upgrade
Switch
Minutes
HPSS Outage by Category
0 5000 10000 15000 20000 25000
SW, SCH
SW, UNSCH
HW, SCH
HW, UNSCH
HPSS UPGRADE
NETWORK
OSF POWER
Minutes
Seaborg Data
• IBM SP RS/6000, AIX 5.2• 416 nodes; 380 compute
nodes• 4280 disk drives (4160
SSA, 120 Fibre Channel)• Large disks failure in 2003
can be attributed to “aging” of older drives and “infant mortality” of newer disks
Number of
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02
Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04
Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06
Jul-06Sep-06Nov-06
Date Installed
Number
Total SSA FC
Seaborg Disks Replaced
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2001 2002 2003 2004 2005 2006
Year
Percent
0
10
20
30
40
50
60
70
80
Number
Percentage Actual
Seaborg Disk Failure
0
2
4
6
8
10
12
Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02
Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04
Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06
Jul-06Sep-06Nov-06
Date
Number
SSA FASTT
HPSS Data
• Two HPSS systems available at NERSC
• Eight tape silos with 100 tape drives attached
• Tape drives seem to show the same failure pattern as seaborg’s disk drives, “aging” and “infant mortality”.
HPSS Tape Drives
0
20
40
60
80
100
120
Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02
Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04
Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06
Jul-06Sep-06Nov-06
Date Installed
Number
Total T9840A T9940A T9940B T10KA
HPSS Tape Drive Failure
0
1
2
3
Jan-01Mar-01May-01Jul-01Sep-01Nov-01Jan-02Mar-02May-02
Jul-02Sep-02Nov-02Jan-03Mar-03May-03Jul-03Sep-03Nov-03Jan-04Mar-04May-04
Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06
Jul-06Sep-06Nov-06
Date
Number
T9840A T9940A T9940B
HPSS Tape Drives Replaced
0
2
4
6
8
10
12
14
16
18
20
2001 2002 2003 2004 2005 2006
Year
Number
0
5
10
15
20
25
30
35
Percent
Actual Percent
NGF statsNGF File size (11/28/1007)
1
10
100
1000
10000
100000
1000000
10000000
1 3 715 31 63
127 255 5111023 2047 4095 8191
16383 32767 65535131071262143524287
1048575209715141943038388607167772153355443167108863
1342177272684354555368709111073741823
0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 655361310722621445242881048576209715241943048388608167772163355443267108864134217728268435456536870912
KB
NGF Directory Entries (11/28/2007)
1
10
100
1000
10000
100000
1000000
1 3 7 15 31 63 127 255 511 1023 2047 4095 8191 16383 32767 65535 131071 262143
0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
Seaborg Outages
HPSS Outages