conquest: preparing for life after disks an-i andy wang geoff kuenning, peter reiher, gerald popek
TRANSCRIPT
Conquest: Preparing forLife After Disks
An-I Andy Wang
Geoff Kuenning, Peter Reiher, Gerald Popek
2
Conquest Overview File systems are optimized for disks
Performance problem Complexity
Now we have tons of inexpensive RAM What can we do with that RAM?
3
Conquest Approach Combine disk and persistent RAM (e.g.,
battery-backed RAM) in a novel way Simplification
> 20% fewer semicolons than ext2, reiserfs, and SGI XFS
Performance (under popular benchmarks) 24% to 1900% faster than LRU disk caching
4
Outline of the Talk Motivation Conquest design (high level) Conquest components Performance evaluation Conclusion
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
5
Motivation Most file systems are built for disks
Problems with the disk assumption: Performance Complexity
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
6
Hardware Evolution
1990 2000
1 KHz
1 MHz
1 GHzCPU (50% /yr)memory (50% /yr)
disk (15% /yr)
accessespersecond(log scale)
105106
1995(1 sec : 6 days) (1 sec : 3 months)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
7
Inside Pandora’s Box
Disk arm Disk platters
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Access time = seek time (disk arm)
+ rotational delay (disk platter)
+ transfer time
8
Disk Optimization Methods Disk arm scheduling Group information on
disk Disk readahead Buffered writes Disk caching
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Data mirroring Hardware parallelism
9
Complexity Bytes
synchronization
predictive readahead
cache replacement
elevator algorithm
data clusteringdata consistencyasynchronous write
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Micron Semiconductor Products 2000; Quantum 2000]
10
Storage Media Alternatives
accesses/sec (log scale)
$/MB (log scale)
100 103
persistent RAM
magnetic RAM?
(write once) flash memorydisktape
battery-backed DRAM10-3
10-3 106
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Grochowski 2000] 11
Price Trend of Persistent RAM
1995 2005
100
year
$/MB(log scale)
2000
10-2
10-1
101
102
paper/film
3.5" HDD2.5" HDD1" HDDpersistent RAM
booming of digitalphotography
4 to 10 GB of persistent RAM
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
12
Old Order; New World Disk will stay around
Cost, capacity, power, heat RAM as a viable storage alternative
PDAs, digital cameras, MP3 players More architectural changes due to RAM
A big assumption change from disk Rethink data structures, interfaces,
applications
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
13
What does it take to design and build a system that assumes ample persistent RAM as the primary storage medium?
Getting a Fresh Start
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
14
Conquest Design Design and build a disk/persistent-RAM
hybrid file system Deliver all file system services from memory,
with the exception of high-capacity storage Two separate data paths to memory and disk Benefits:
Simplicity Performance
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
15
Simplicity Remove disk-related complexities for most
files Make things simpler for disk as well Less complexity
Fewer bugs Easier maintenance Shorter data paths
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
16
Overall All management performed in memory
Memory data path No disk-related overhead
Disk data path Faster speed due to simpler access models
Performance
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
17
Conquest Components Media management Metadata representation Directory service Allocation service Persistence support Resiliency support
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Iram 1993; Douceur et al., 1999; Roselli et al., 2000] 18
User Access Patterns Small files
Take little space (10%) Represent most accesses (90%)
Large files Take most space Mostly sequential accesses
Not characteristic of database applications
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
19
Files Stored in Persistent RAM Small files (< 1MB)
No seek time or rotational delays Fast byte-level accesses Contiguous allocation
Metadata Fast synchronous update No dual representations
Executables and shared libraries In-place execution
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
20
Memory Data Path of Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Conventional File Systems
IO buffer
disk management
storage requests
IO buffermanagement
disk
persistencesupport
Conquest Memory Data Path
storage requests
persistencesupport
battery-backedRAM
small file and metadata storage
[Devlinux.com 2000] 21
Large-File-Only Disk Storage Allocate in big chunks
Lower access overhead Reduced management overhead
No fragmentation management No tricks for small files
Storing data in metadata No elaborate data structures
Wrapping a balanced tree onto disk cylinders
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
22
Sequential-Access Large Files Sequential disk accesses
Near-raw bandwidth Well-defined readahead semantics Read-mostly
Little synchronization overhead (between memory and disk)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
23
Disk Data Path of Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Conventional File Systems
IO buffer
disk management
storage requests
IO buffermanagement
disk
persistencesupport
Conquest Disk Data Path
IO buffermanagement
IO buffer
storage requests
disk management
disk
battery-backedRAM
small file and metadata storage
large-file-only file system
24
Random-Access Large Files Random access?
Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files
Near sequential access? Simplifies large-file metadata representation
significantly
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
25
Logical File Representation
File
Name(s) i-node File attributes
Data
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
26
Physical File Representation
File
Name(s) i-node File attributes Data locations
Data blocks
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
27
Ext2 Data Representation
data block location
index block location
index block location
index block location
data block location
index block location
index block location
data block location
data block location
i-node
12
data block location
data block locationdata block location
data block location
index block location
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
28
Disadvantages with Ext2 Design Designed for disk storage Optimization for small files makes things
complex Random-access data structure for large files
that are accessed mostly sequentially Data access time dependent on the byte
position in a file Maximum file size is limited
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
29
Conquest Representation Persistent RAM
Hash(file name) = location of data Offset(location of data)
Disk storage Per-file, doubly linked list of disk block
segments (stored in persistent RAM)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
30
Advantages Conquest Design Direct data access for in-core files Worse case: sequential memory search for
random disk locations Maximum file size limited by physical storage
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
31
Directory Service Requirements
Fast sequential traversal (e.g., ls) Fast random lookup (e.g., locate file x) Hard links (apply multiple names to data)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
32
First Design A doubly hashed table for each directory
Conserves space Problems:
Dynamic resizing of directories Need to handle the current file position Important for rm -fr
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Fagin et al., 1979] 33
Second Design A variant of extensible hash table for each
directory An old data structure fits nicely
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
empty
empty
0100 | file_1
1001 | file_2
empty
empty0100 | file1
1001 | file2
empty
0011 | dir1
1110 | file2_hardlink
34
Additional Engineering Details Popular hash functions randomize lower bits Dynamic file positioning Need to handle collisions Memory overhead and complexity tradeoffs
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
35
Metadata Allocation Requirements
Keep track of usage status of metadata entries
Avoid duplicate allocation with unique IDs
Fast retrieval of metadata with a given ID
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
ID: 1| free
ID: 2| in use
ID: 3| free
ID: 4| free
ID: 5| in use
ID: 6| free
36
Existing Memory Allocation Services
Keep track of unallocated memory
No duplicate allocation of physical addresses
Hmm…
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
ADDR 0xe000000| free
ADDR 0xe000038| in use
ADDR 0xe000070| free
ADDR 0xe0000A8| free
ADDR 0xe0000E0| free
ADDR 0xe000118| in use
37
Conquest Metadata Management Metadata = memory allocated by memory
manager Metadata ID = physical address of metadata
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
ID: 1| free
ID: 2| in use
ID: 3| free
ID: 4| free
ID: 5| in use
ID: 6| free
ADDR 0xe000000| free
ADDR 0xe000038| in use
ADDR 0xe000070| free
ADDR 0xe0000A8| free
ADDR 0xe0000E0| free
ADDR 0xe000118| in use
Usage status
Unique IDs and fast retrieval
38
Persistence Support Restore file system states after a reboot
Data Metadata Memory manager
Keep track of metadata allocation
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
39
Linux Memory Manager (1) Page allocator maintains individual pages
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Page allocator
40
Linux Memory Manager (2) Zone allocator allocates memory in power-of-
two sizes
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Page allocator
Zone allocator
41
Linux Memory Manager (3) Slab allocator groups allocations by sizes to
reduce internal memory fragmentation
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Page allocator
Zone allocator
Slab allocator
42
Linux Memory Manager (4) Difficult to restore the persistent states
Three layers of pointer-rich mappings Mixing of persistent and temporary allocations
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Page allocator
Slab allocator
Zone allocator
43
Conquest Persistence Create memory zones with own instantiations
of memory managers
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Page allocator
Slab allocator
Zone allocator
44
Conquest Persistence Encapsulate all pointers within each zone Pointers can survive reboots No serialization and deserialization Swapping and paging
Disabled for Conquest memory zones Enabled for non-Conquest zones
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
45
Resiliency Support Instantaneous metadata commit
No fsck (ad hoc metadata consistency check) Built-in checkpointing Pointer-switch commit semantics
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
pointerpointer
46
Implementation Status Kernel module under Linux 2.4.2 Fully functional and POSIX compliant Modified memory manager to support
Conquest persistence Need to overcome BIOS limitations for
distribution Looking for licensing opportunities
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
47
Performance Evaluation Architectural simplification
Feature count Performance improvement
Memory-only workload Memory and disk workload
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
48
Conventional Data Path Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conventional File Systems
IO buffer
disk management
storage requests
IO buffermanagement
disk
persistencesupport
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
49
Memory Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conquest Memory Data Path
storage requests
Persistencesupport
battery-backedRAM
small file and metadata storage
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Memory manager encapsulation
50
Disk Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management
Conquest Disk Data Path
IO buffermanagement
IO buffer
storage requests
disk management
disk
battery-backedRAM
small file and metadata storage
large-file-only file system
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Katcher 1997; Sweeney et al., 1996; Card et al., 1999; Namesys 2002] 51
Conquest is comparable to ramfs At least 24% faster than the LRU disk cache
ISP workload (emails, web-based transactions)
PostMark Benchmark (1)
0100020003000400050006000700080009000
5000 10000 15000 20000 25000 30000
files
trans / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
40 to 250 MB working set with 2 GB physical RAM
52
0
1000
2000
3000
4000
5000
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
percentage of large files
trans / sec
SGI XFS reiserfs ext2fs Conquest
When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS
PostMark Benchmark (2)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
10,000 files,80 MB to 3.5 GB working setwith 2 GB physical RAM
> RAM<= RAM
53
When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS
PostMark Benchmark (3)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
0
20
40
60
80
100
120
6.0 7.0 8.0 9.0 10.0
percentage of large files
trans / sec
SGI XFS reiserfs ext2fs Conquest
10,000 files,80 MB to 3.5 GB working setwith 2 GB physical RAM
54
Sprite LFS Microbenchmarks (1) Small-file benchmark
Operates on 10,000 1-KB files in three phases
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
020000400006000080000
100000120000140000160000180000
create read delete
op / sec
SGI XFS reiserfs ext2fs ramfs Conquest
55
Sprite LFS Microbenchmarks (2) Modified large-file microbenchmark: 10 1-MB
files (Conquest in-core files)
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
56
Sprite LFS Microbenchmarks (3) Modified large-file microbenchmark: 10 1.01-
MB files (Conquest on-disk files)
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
57
Sprite LFS Microbenchmarks (4) Large-file microbenchmark: 40 100-MB files
(Conquest on-disk files)
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
0
5
10
15
20
25
30
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs Conquest
58
History’s Mystery
Puzzling Microbenchmark Numbers…
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
Geoffrey Kuenning: “If Conquest is slower than ext2, I will toss you off of the balcony…”
59
With me hanging off a balcony… Original large-file microbenchmark: 1-MB file
(Conquest in-core file)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
60
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Odd Microbenchmark Numbers Why are random reads slower than sequential
reads?
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
61
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
Odd Microbenchmark Numbers Why are RAM-based file systems slower than
disk-based file systems?
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
62
A Series of Hypotheses Warm-up effect?
Maybe Why do RAM-based systems warm up slower?
Bad initial states? No
Pentium III streaming IO option? No
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
63
Effects of Cache Footprint SizesLarge cache footprint Small cache footprint
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
write a file sequentially
footprint file end
footprint
read the same file sequentially
footprint
flush
file endfile
read
write a file sequentially
footprint file end
footprint
read the same file sequentially
footprint
flush
file end
read
file
64
LFS Sprite Microbenchmarks Modified large-file microbenchmark: 10 1-MB
files (Conquest in-core files)
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
0
100
200
300
400
500
600
700
seq write seq read rand write rand read seq read
MB / sec
SGI XFS reiserfs ext2fs ramfs Conquest
faster random over sequential accesses due to cache reuse
66
Lessons Learned Faster than LRU caching, unexpected
Heavyweight disk handling Severe penalty for accessing memory content
Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result Need careful design
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
67
More Lessons Learned Effects of L2 caching become highly visible in
memory workloads (modern workloads) Cannot blindly apply existing disk-based
microbenchmarks to measure memory performance of file systems
Need to consider states of L2 cache and memory behaviors at each stage of microbenchmarking
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
68
Additional Lessons Learned Don’t discuss your performance numbers next
to a balcony…unless…
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[McKusick et al., 1990; Ganger et al., 2000; Roselli et al., 2000; Seltzer et al., 2000]
69
Related Work (1) Disk caching
Assumption of scarce memory Complex mechanisms to maintain consistency
Especially with the presence of metadata
RAM drives and RAM file systems Not meant to be persistent Use disk-related mechanisms Limitations on storage capacity
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Riedel 1998; ZDNet 1999] 70
Related Work (2) Disk emulators
RAM storage accessed through SCSI interface Ad hoc approaches
Manual transferring of files to and from ramfs Capacity limitation
Background daemon to stage RAM files to a disk
Semantic and name space problems
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
71
Going Beyond Conquest (1) Matching usage patterns with heterogeneous
machines in the distributed domain Specialized tasks for machines within a cluster Preferably self-organizing and self-evolving
State-rich computing Caching of runtime data structures Similar to /tmp
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
72
Going Beyond Conquest (2) Separate storage of metadata from data
Association of metadata with data of different fidelity
Opportunity for hierarchical replication across devices with different calibers
Benchmarking memory performance of file systems Developing new memory benchmarks
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
73
Contributions Demonstrated the feasibility of disk-memory
hybrid file systems Showed performance does not preclude
simplicity Pinpointed cache-related problems with
modern benchmarks Opened doors to many exciting areas of
research
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
74
Conclusion Conquest demonstrates how rethinking
changes in underlying assumptions can lead to significant architectural and performance improvements
Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well.
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion