ido: intelligent data outsourcing with improved raid ......improved raid reconstruction performance...

24
IDO: I ntelligent D ata O utsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers Suzhen Wu § *, Hong Jiang*, Bo Mao* § Xiamen University *University of Nebraska–Lincoln

Upload: others

Post on 25-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance

in Large-Scale Data Centers

Suzhen Wu§*, Hong Jiang*, Bo Mao* §Xiamen University

*University of Nebraska–Lincoln

Page 2: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Data Deluge

2

Social Network

Business Intelligence

Scientific Simulation

Mobile Apps

2,300 tweets per

second

275 EB data flowing per day in 2020

How to safely store such a huge data volume proposes a big challenge to

the system administrators!

Page 3: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Where Are We?

3

Laptop and Desktop Data Center

Interruptible Event

Common Case

Page 4: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Disk Failure in the Real World

4

•  Higher  error  rates  than  expected  – Complete  disk  failures,  2%~4%  on  average;  – Latent  sector  errors,  3.45%;  

•  CorrelaBon  in  drive  failures  – e.g.,  aCer  one  disk  fails,  another  disk  failure  will  likely  occur  soon.  

•  RAID  reconstrucBon  becomes  an  operaBonal  state  in  data  centers  –  Increasing  disk  capacity  and  number  of  drives  

Page 5: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

More Observations

•  Linux software RAID (MD) mailing list: too many complains about the slow recovery speed.

•  Storage at Exascale: Some thoughts from Panasas CTO Garth Gibson. Disk failure is a normal case in exascale storage systems.

•  ……

5

Page 6: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

RAID Reconstruction Challenges

6

•  Online RAID Reconstruction:

     •  Two challenges:

– Real-time user performance;

– Window of vulnerability.

User I/O Requests

Reconstruction I/O Requests

How many user I/O requests can be eliminated from degraded RAID directly affects the reconstruction performance.

Page 7: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

The State of the arts

7

•  Optimizing the reconstruction workflow: – DOR (CMU PDL) – Live-block recovery (USENIX FAST’04)

– PRO (USENIX FAST’07)

•  Optimizing the user I/O requests: – MICRO (IEEE TC’08)

– WorkOut (USENIX FAST’09) – VDF (USENIX ATC’11)

Page 8: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Compare with State of the arts

8

Characteris*cs PRO  (FAST’07)

WorkOut  (FAST’09)

VDF  (USENIX’11)

IDO    (LISA’12)

ProacBve √

Temporal  Locality √ √ √ √

SpaBal  Locality √ √

User  I/O √ √ √

ReconstrucBon  I/O √ √

Page 9: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Observation 1

9

•  RAID reconstruction is an operational state in large-scale data centers which means reactive scheme is inefficient. – Reactive vs. Proactive?

•  Existing studies are all reactive schemes.

Page 10: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Example 1: Reactive vs. Proactive

10

Page 11: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Example 1: Reactive vs. Proactive

11

Page 12: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Observation 2

12

•  With large RAM and SSDs, the temporary locality is poor at HDD level. However, the spatial locality is good due to the sequential accesses of HDDs. – Temporal locality vs. Spatial locality?

•  Existing studies mostly focus on temporal locality and ignore spatial locality.

Page 13: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Example 2: Temporal vs. Spatial

13

a b c d(a) Request-based approach

Migrate requested “a” to Surrogate Set

a

(b) Zone-based approach a b c d

a b c d

Migrate hot zone to Surrogate Set

Page 14: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

The Motivation

14

0%

20%

40%

60%

80%

100%

WebSearch2 Financial2 Microsoft Project

Use

r I/O

traf

fic re

mov

ed fr

om

degr

aded

RA

IDReactive-requestReactive-zone

Proactive-requestProactive-zone

Page 15: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

IDO: Intelligent Data Outsourcing

15

•  The main idea: –  Proactively identify the hot data zones; –  Upon disk failure,

•  Recovery the hot data zones first;

•  Migrate the hot data zones to surrogate set;

•  Redirect the user I/O requests.

•  The design objectives –  Reducing reconstruction time; –  Improving the user I/O performance;

–  Applicable to other background tasks.

Page 16: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

System Overview

16

Failed Disk

New Disk

Software RAID Controller

Network

Stor

age

Nod

e

Stor

age

Nod

e Data Migration

Working / Degraded RAID Surrogate RAID Working / Surrogate RAID

RAID Reconstruction

IDO

RAID Reconstruction

Hot Zone Identifier

Data Migrator

Request Distributor

Data Reclaimer

Software RAID Controller

IDO

Request Distributor

Hot Zone Identifier

Data Migrator

Data Reclaimer

Page 17: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Performance Evaluation

17

•  IDO prototype is a built-in module in Linux MD, compared with WorkOut and VDF.

•  Intel Xeon 3440 processor, 8GB DDR memory, WDC WD1600AAJS SATA disks.

•  Trace-driven evaluations

Page 18: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

RAID5 Results

18

(a) Average Response Time during Recovery

(b) Reconstruction Time

0

10

20

30

40

50

60

Fin1 Fin2 Web2 Proj

Ave

rage

Res

pons

e T

ime

(ms) WorkOut

VDFIDO

0

500

1000

1500

2000

2500

3000

3500

Fin1 Fin2 Web2 Proj

Rec

onst

ruct

ion

Tim

e (s

)

WorkOutVDFIDO

Page 19: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

RAID6 Results

19

0

300

600

900

1200

1500

1800

Fin1 Fin2 Web2 Proj

Rec

onst

ruct

ion

Tim

e (s

)

WorkOutVDFIDO

0

10

20

30

40

Fin1 Fin2 Web2 Proj

Ave

rage

Res

pons

e T

ime

(ms) WorkOut

VDFIDO

(a) Average Response Time during Recovery

(b) Reconstruction Time

Page 20: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Detailed Real-time Results

20

(a) WebSearch2.spc

0.1

1

10

100

1000

0 500 1000 1500 2000 2500Use

r R

espo

nse

Tim

e (m

s)

Reconstruction Time (s)

WorkOut VDF IDO

(b) Microsoft Project

VDF ends

WorkOut ends

IDO ends

1

10

100

1000

0 100 200 300 400 500

Use

r R

espo

nse

Tim

e (m

s)Reconstruction Time (s)

WorkOut VDF IDO

VDF ends

WorkOut ends

IDO ends

Shorter Reconstruction Time Shorter Reconstruction Times

Shorter Reconstruction Time Lower user response times

Page 21: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Reduce I/Os and Sensitivity Study

21

•  Sensitivity & overhead analysis (in the paper).

0

20

40

60

80

100

Fin1 Fin2 Web2 Proj

Perc

enta

ge (%

of T

otal

)WorkOutVDF

3.4 1.3

IDO

•  Reduced I/Os:

Page 22: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Extendibility Evaluation

22

(a) Re-synchronization Time (b) Average Response Time

0

500

1000

1500

2000

2500

3000

Fin1 Fin2 Web2 Proj

DefaultWorkOutIDO

Re-

sync

hron

izat

ion

Tim

e (s

)

0510152025303540

Fin1 Fin2 Web2 Proj

Ave

rage

Res

pons

e T

ime

(ms) Default

WorkOutIDO

Page 23: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Summary of IDO

23

•  RAID reconstruction is an operational state in large-scale data centers!

•  Salient features of IDO: – Proactive; – Exploit both temporal and spatial localities; – Optimize both user and reconstruction IOs;

– Portability and extendibility.

Page 24: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends

Thanks!

24