sata port multipliers considered harmful · sata port multipliers considered harmful peng li†...

2
SATA Port Multipliers Considered Harmful Peng Li (Student), James Hughes , John Plocher , David J. Lilja University of Minnesota - Twin Cities, FutureWei Technologies Abstract By using SATA port multipliers (PMs), the storage ca- pacity of a system can be increased significantly. How- ever, there has been little work studying the reliability of the SATA PMs. To study the potential reliability is- sues, this work proposes a recreatable process to create actual failures of hard disk drives (HDDs). Experimental results show that at least one combination of SATA con- trollers, the SATA PMs, and the HDDs does not provide resiliency when a single HDD fails. 1 Introduction SATA connectivity normally consists of a single drive connected to a single SATA controller port via a single cable. As a result, the maximum number of drives in a storage system is limited by the number of available SATA controller ports. The SATA port multiplier (PM) can be used to expand the drive scalability of storage systems cost-effectively. However, there has been little work studying the reliability of the SATA PMs. As stor- age systems reach the petabytes and exabytes era, system reliability becomes more challenging. The consequence of ignoring the reliability issue of the SATA PMs and us- ing them for a large scale storage system could be catas- trophic. The goal of this paper is to study the reliability of the SATA PMs by conducting several experiments. Our ex- perimental results show that, if one of the HDDs con- nected to the SATA PM fails, the other HDDs will not be accessible. Architecture designers should consider this effect when designing a system using the SATA PMs. The contributions of this paper are: (1) a demonstration of a reproducible process for creating actual failures in the HDDs for reliability testing; (2) a demonstration that at least one combination of the SATA controllers, the SATA PMs, and the HDDs does not provide resiliency when a single HDD fails. This paper does not explain why this error occurs. Rather, its contribution is show- ing that this type of error can occur in real systems. It also demonstrates a successful testing methodology. Section 2 presents the detailed test approaches and the results. Conclusions are drawn in Section 3. 2 Experimental Methodology and Results We first do experiments on a system which does not con- tain the SATA PM as shown in Fig. 1. Then we do the same experiments on the system which uses the SATA PM as shown in Fig. 2. The Marvell board has two SATA controllers, which are SATA0 and SATA1. In Fig. 1, the HDD connected to SATA0 is the OS HDD, on which we install the OS and the fio program. The test results also are recorded on this HDD. We use the Seagate enterprise HDD as the OS HDD. The HDD which is connected to SATA1 is used for tests. During the experiments, we re- move the cover of the test HDD to emulate fatal errors. Both Western Digital consumer HDDs and Seagate en- terprise HDDs are used as the test HDD. 88F6282 SOC SATA0 SATA1 OS HDD Sheeva CPU Test HDD Marvell Board Marvell Board Figure 1: The system without the SATA PM. The software flow of the first test system is shown in Fig. 3. Tasks 1 and 2 are generated by the fio program. They issue random I/O operations to the two HDDs si-

Upload: others

Post on 04-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SATA Port Multipliers Considered Harmful · SATA Port Multipliers Considered Harmful Peng Li† (Student), James Hughes‡, John Plocher‡, David J. Lilja† †University of Minnesota

SATA Port Multipliers Considered Harmful

Peng Li† (Student), James Hughes‡, John Plocher‡, David J. Lilja†

†University of Minnesota - Twin Cities, ‡FutureWei Technologies

Abstract

By using SATA port multipliers (PMs), the storage ca-pacity of a system can be increased significantly. How-ever, there has been little work studying the reliabilityof the SATA PMs. To study the potential reliability is-sues, this work proposes a recreatable process to createactual failures of hard disk drives (HDDs). Experimentalresults show that at least one combination of SATA con-trollers, the SATA PMs, and the HDDs does not provideresiliency when a single HDD fails.

1 Introduction

SATA connectivity normally consists of a single driveconnected to a single SATA controller port via a singlecable. As a result, the maximum number of drives ina storage system is limited by the number of availableSATA controller ports. The SATA port multiplier (PM)can be used to expand the drive scalability of storagesystems cost-effectively. However, there has been littlework studying the reliability of the SATA PMs. As stor-age systems reach the petabytes and exabytes era, systemreliability becomes more challenging. The consequenceof ignoring the reliability issue of the SATA PMs and us-ing them for a large scale storage system could be catas-trophic.

The goal of this paper is to study the reliability of theSATA PMs by conducting several experiments. Our ex-perimental results show that, if one of the HDDs con-nected to the SATA PM fails, the other HDDs will not beaccessible. Architecture designers should consider thiseffect when designing a system using the SATA PMs.The contributions of this paper are: (1) a demonstrationof a reproducible process for creating actual failures inthe HDDs for reliability testing; (2) a demonstration thatat least one combination of the SATA controllers, theSATA PMs, and the HDDs does not provide resiliencywhen a single HDD fails. This paper does not explain

why this error occurs. Rather, its contribution is show-ing that this type of error can occur in real systems.It also demonstrates a successful testing methodology.Section 2 presents the detailed test approaches and theresults. Conclusions are drawn in Section 3.

2 Experimental Methodology and Results

We first do experiments on a system which does not con-tain the SATA PM as shown in Fig. 1. Then we do thesame experiments on the system which uses the SATAPM as shown in Fig. 2. The Marvell board has two SATAcontrollers, which are SATA0 and SATA1. In Fig. 1, theHDD connected to SATA0 is the OS HDD, on which weinstall the OS and the fio program. The test results alsoare recorded on this HDD. We use the Seagate enterpriseHDD as the OS HDD. The HDD which is connected toSATA1 is used for tests. During the experiments, we re-move the cover of the test HDD to emulate fatal errors.Both Western Digital consumer HDDs and Seagate en-terprise HDDs are used as the test HDD.

88F6282 SOC

SATA0 SATA1

OS HDD

Sheeva CPU

Test HDD

Marvell BoardMarvell Board

Figure 1: The system without the SATA PM.

The software flow of the first test system is shown inFig. 3. Tasks 1 and 2 are generated by the fio program.They issue random I/O operations to the two HDDs si-

Page 2: SATA Port Multipliers Considered Harmful · SATA Port Multipliers Considered Harmful Peng Li† (Student), James Hughes‡, John Plocher‡, David J. Lilja† †University of Minnesota

88F6282 SOC

SATA0 SATA1

OS HDD

Sheeva CPU

SATA PM

Marvell BoardMarvell Board

Test HDD 1 Test HDD 2

Figure 2: The system using the SATA PM.

multaneously. While the fio program is running, we re-move the cover of the test HDD to emulate fatal errors.The test HDD stops working quickly after its cover is re-moved due to environmental contamination. In the entireprocess, we use the Linux command ‘iostat’ to record theI/O status of the two HDDs on the OS HDD.

Task 1

Task 2

Linux

RAM

Fail

OK

The OS HDD

The Test HDD

iostat

SATA1

SATA0

Sheeva CPU

Figure 3: Software flow of the system in Fig. 1.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

2

28 54 80

106

132

158

184

210

236

262

288

314

340

366

392

418

444

470

496

522

548

574

600

626

652

678

704

730

756

782

T

P

S

Time (s)

The OS HDD The Test HDD

Begin to remove the cover of the Test HDD. 16 GB random read/write test finished for the OS HDD.

The Test HDD stops working (it's broken).

Figure 4: Test results of the system in Fig. 1 when usingthe Seagate enterprise HDD as the test HDD.

The Seagate enterprise HDD is first used as the testHDD. Fig. 4 shows the number of transfers per second(TPS) for the two HDDs as the test progresses. Westarted to remove the cover of the test HDD at about260s. The HDD stopped working after the entire coverwas removed (at 480s). It can be seen that the TPS of theOS HDD was not affected at all during the entire test pro-cess. It finishes all the required I/O operations issued bythe fio program. We then repeat the same experiment us-

ing the Western Digital consumer HDD as the test HDD,and get the same results.

The software flow of the second test system is shownin Fig. 5. Similar to the first experiment, Tasks 1 and2 are generated by the fio program. While the programis running, we remove the cover of the test HDD 2 toemulate fatal errors.

Task 1

Task 2

Linux

RAM

iostatSATA0

OK

Fail

The Test HDD 2

The Test HDD 1

OK

The OS HDD

SATA1

Sheeva CPU

SATA PM

Figure 5: Software flow of the system in Fig. 2.

0

1000

2000

3000

4000

5000

6000

2

28 54 80

106

132

158

184

210

236

262

288

314

340

366

392

418

444

470

496

522

548

574

600

626

652

678

704

730

756

782

T

P

S

Time (s)

The Test HDD 1 The Test HDD 2

Begin to remove the cover of the Test HDD 2. The Test HDD 1 stops working too (test has not been finished).

The Test HDD 2 stops working (it's broken).

Figure 6: Test results of the system in Fig. 2 when usingthe Seagate enterprise HDD as the test HDD.

We first use the Seagate enterprise HDD as the testHDDs, with results shown in Fig. 6. We started to re-move the cover of the test HDD 2 at about 320s. The testHDD 2 stopped working at about 520s after its cover hadbeen removed due to environmental contamination. Itcan be seen that the test HDD 1 also stopped working atabout 680s without finishing the required I/O operationsissued by the fio program, i.e., Task 1 shown in Fig. 5.We next use the Western Digital consumer HDD as thetest HDD and get the same experiment results.

3 Discussion and ConclusionThis paper presents a reproducible process for creatingactual failures in the HDDs for reliability testing. Whenthere was no SATA PM in the system, a single HDD fail-ure had no impact on the rest of the system. However,when the SATA PM was used, a forced failure of oneHDD caused the other HDD connected to the PM to alsofail. This means that at least one combination of SATAcontrollers, the SATA PMs, and the HDDs does not pro-vide resiliency when a single HDD fails. We must bevery careful when using the SATA PM to build a largescale storage system, or the consequence could be catas-trophic.

2