journal-guided resynchronization for software raid

Journal-guided Resynchronizationfor Software RAID

Timothy E. Denehy,Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau

University of Wisconsin, Madison

RAID Consistent Update Problem

• RAID task is to maintain consistency

• Challenging in the face of crashes– Updates must be applied to more than one disk

• Inconsistency means window of vulnerability– Disk failure may lead to data loss

P P P PP P P P

High-end RAID Solution

• Consistent update with non-volatile memory– Logs writes in NVRAM until they reach disk

• Performance – logging to NVRAM is fast

• Reliability – data is safe in NVRAM

• Availability – recovery is fast

• But, enterprise systems are expensive

Software RAID Solutions• Consistent update is challenging

– Performance versus reliability trade-off

• Performance: resynchronization after crash– Scan entire volume to fix inconsistencies– Extremely slow, hours for 100s of GBs to days for TBs– Reliability: lengthens window of vulnerability– Availability: consumes array bandwidth

• Reliability: log intentions to a bitmap– Performance: extra writes to maintain bitmap

Cooperative Software RAID Solution

• Journaling file systems perform logging– Maintain file system data structure consistency– ext3, ReiserFS, JFS, NTFS

• Journal-guided resynchronization– New ext3 mode: declared mode– New software RAID interface: verify read– Achieves performance, reliability, availability

Journal-guided Resync Overview

• Crash: What writes were outstanding?– Narrow the range of possible inconsistencies– Obtain information from journal (declared mode)

• Restart: journal-guided resynchronization– Use journal to identify outstanding writes– Communicate locations to RAID (verify read)– Check redundancy and repair inconsistencies– Greatly reduce the time for resynchronization

Outline

• Problem

• ext3 Background and Analysis

• ext3 Declared Mode and RAID Verify Read

• Journal-guided Resynchronization

• Evaluation

• Conclusion

ext3 Modes

• Data-journaling mode– All data and metadata is written to the journal

• Ordered mode (default)– Only metadata is written to the journal– Strict ordering between data and metadata

• Writeback mode– Only metadata is written to the journal– No ordering between data and metadata

ext3 Transactions

• Updates are grouped into transactions

• Transaction states– Running – collect updates in memory– Commit – write updates to journal– Checkpoint – write updates to home locations

ext3 Journal Structures

• Journal superblock– Head and tail pointers into journal file– Transaction sequence number

• Descriptor block– List of home locations for upcoming blocks

• Commit block– Marks the end of a transaction

Data-journaling Write Analysis

P P P PP P P P

METADATA

Running

DATA DATA

Running: collect file system updates in memoryCommit: write desc, meta, data to journal, wait (bounded) write commit to journal, wait (bounded)

CommittingCheckpoint: write journaled blocks to home, wait (known) update superblock (known)

DESC11

METADATA

DATA DATA DATACOMM

Checkpointing

Data-journaling Summary

• Provides a record of all outstanding writes– Suitable for journal-guided resynchronization

• Offers poor performance

Block Type Write Location

superblock known, fixed

journal bounded, fixed

home metadata known, descriptors

home data known, descriptors

Ordered Write Analysis

P P P PP P P P

METADATA

Running

DATA DATA

Running: collect file system updates in memory pdflush may write data to home (unknown)

Commit: write data to home, wait (unknown) write desc and meta to journal, wait (bounded) write commit to journal, wait (bounded)

Committing

DESC11

METADATA

COMM11

• Does not provide outstanding write record– Unsuitable for journal-guided resynchronization

Ordered SummaryBlock Type Write Location

superblock known, fixed

journal bounded, fixed

home metadata known, descriptors

home data unknown

Outline

• Problem

• Evaluation

• Conclusion

Declared Mode

• Variation of ordered mode– Only metadata is journaled, strict ordering

• Declares its intent to write to home locations

• New journal structure: declare block– List of home data locations for the transaction

• Space and performance overheads

Declared Write Analysis

P P PPP P P P

P P P PP P P P

METADATA

Running

DATA DATA

Running: collect file system updates in memory pdflush may write data to home (unknown)

Commit: write declare to journal, wait (bounded) write data to home, wait (known) write desc and meta to journal, wait (bounded) write commit to journal, wait (bounded)

Committing

DESC11

METADATA

COMM11

DECL11

Software RAID Verify Read

• File system must communicate possible inconsistencies to the software RAID layer

• New interface: verify read request– Read block and verify its redundant information– Repair redundant information if inconsistent

P P P PP P P P

P= ?xorxor

Outline

• Problem

• Evaluation

• Conclusion

Journal-guided Resynchronization

DECL12

P P PPP P P P

P P P PP P P P

Recovery and Resynchronization: superblock write: verify read for superblock checkpointing: verify reads for descriptor home locations committing: verify reads for head of the journal home data writes: verify reads for declared home locations checkpoint committed transactions

DESC11

METADATA

COMM11

DECL11

Outline

• Problem

• Evaluation

• Conclusion

Declared Mode Evaluation

• Microbenchmarks (versus ordered mode)– Random write (3% slowdown)– Sequential write (5% slowdown)– Sprite create, read, unlink (4% slowdown)

• Macrobenchmarks– ssh Benchmark (3% speedup for unpack)– Postmark (40% speedup - 5% slowdown)

• Speedup from globally sorted write order

– TPC-B (20% - 5% slowdown)• Small transaction size increases declare overhead

Implementation Complexity

• Cooperative approach reduces complexity

Journal-guided Resynchronization

ModuleOriginal

LinesModified

LinesChange

Software RAID-5 3475 18 0.5 %

ext3 8621 69 0.8 %

Journaling 3472 308 8.9 %

Total 15568 395 2.5 %

Linux RAID-1 Intent Bitmap Logging

Software RAID-1 3116 1193 38.3 %

Resynchronization Experiment

• Five disk, 1 GB RAID-5 array

• Foreground process reading a set of files

• After 30 seconds, crash and restart machine– Resynchronization begins– Foreground process restarts

• Monitor foreground bandwidth and resync

Resynchronization Results

• Availability: foreground BW from 29.6 to 34.1 MB/s• Reliability: vulnerability from 254 to 0.21 seconds

– Reduced from O(array size) to O(journal size)

Outline

• Problem

• Evaluation

• Conclusion

Conclusion

• RAID consistent updates are challenging

• Analyzed ext3 journaling, declared mode– Identifies outstanding writes after a crash

• Software RAID verify read interface

• Journal-guided Resynchronization– Leverages functionality, reducing complexity– Provides performance, reliability, and availability

• Cooperation between layers is the key

Questions?

http://www.cs.wisc.edu/adsl/

journal-guided resynchronization for software raid

Documents

can adaptive cardiac resynchronization therapy reduce atrial...

bi-ventricular pacemakers (cardiac resynchronization...

atrial resynchronization for prevention of atrial...

image-guided left ventricular lead placement in cardiac...

tech bulletin 2011-001 ipitomy raid · the utility sw...

cardiac resynchronization therapy : atrial fibrillation

cardiac resynchronization therapy mechanisms in atrial...

resynchronization of islanded virtual synchronous machines

raid config guide - cnet content solutions - english ·...

guide storage administration - suse linux · 7 software...

cardiac resynchronization therapy and implantable cardiac

how to rebuild raid5 when degraded - hikvision€¦ ·...

journal review trials on cardiac resynchronization therapy

long-term survival with cardiac resynchronization therapy...

cardiac resynchronization therapy (crt) reduces...

pmc adaptec trusted storage solutions - ingram...

cardiac resynchronization therapy guidelines … -...

cardiac resynchronization therapy (crt) doctor believes a...

cardiac resynchronization therapy for congestive heart...

cardiac resynchronization therapy