btrfs overview, variable blocksize support and others
TRANSCRIPT
Slide 1
Btrfs overview, variable blocksize support and others
IBM Linux Technology CenterMingming CaoOct 13th 2012
Slide 2
Agenda Storage challenges Btrfs basics Btrfs variable blocksize support Btrfs fragmentation issue Btrfs performance
Slide 3
Linux filesystems
Data is essential to end users− Linux has 50+ filesystems to choose
Most-active local filesystems are− Ext3/4− XFS− Reiserfs− Traditional Unix fs design + improvements− Stable, reliable, easy to understand
New storage trend calling for new filesystem− Other OS working on next generation filesystem starting from scratch
(ZFS, ReFS etc)
Slide 4
Storage trend needs
Better data integrity and high availability Scalability and high performance Easy data storage Management Support for new and existing media Virtualization
Slide 5
Btrfs and a short history
Christ Mason started looking into a new fs in 2007− After attend LSF workshop
− Inspired by IBM research report about COW-based
snapshots
− Many experience and lesson learned from other linux
filesystem (reiserfs etc)
− Lot of support from community and multiple
companies
Slide 6
Btrfs Design target (back in 2007)
Storage pools Writeable, named, recursive snapshots Fast filesystem checking and recovery Easy large storage management for admins Proactive error management Better security High scalability
− 128 CPU cores, 256 spindles, hundreds of subvolumes
Fast incremental backup
Slide 7
Btrfs time line target and now
2007 2012
Christ Mason started btrfs
Alpha tester on desktop
Enable as root fs
FC11FC11 RHEL6.2 previewRHEL6.2 preview
2008
Feature complete
2011
Preview in distro
2010
Upstream
2009
2.6.292.6.29 SLES12sp2SLES12sp2
Slide 8
Btrfs basics -- btree
Everything is in a btree Three basic on-disk data structures
− Block header− Keys− Items
Every tree block is node or leaf− All starts with block header and keys− Only leaves stores the actual data
Copy on write protect tree integrity
Nodes [header, key ptr0....key ptrN]
Leaves [header][item0....itemN] [free space][dataN...data0]
struct btrfs_header { u8 csum[32]; u8 fsid[16]; __le64 blocknr; __le64 flags;
u8 chunk_tree_uid[16]; __le64 generation; __le64 owner; __le32 nritems; u8 level; }
struct btrfs_disk_key { __le64 objectid; u8 type; __le64 offset; }
struct btrfs_item { struct btrfs_disk_key key; __le32 offset; __le32 size; }
Slide 9
Btrfs leaf nodes
Slide 10
Btrfs btree
One main btree plus 6 special purpose btree− fs tree (main tree)− extent allocation tree − chunk tree− device allocation tree− checksum tree− data relocation tree− log root tree
Share single btree implementation code Metadata from different files and directories is mixed
together in a block
Slide 11
Btrfs efficient metadata packing
ext3/4 btrfs
Wasted space
Disk seek
Slide 12
Btrfs transactions
No journalling block layer No modification in place Tree is cloned New roots are added into
root tree Will not commit and link
the new root until all data and metadata write to disk
Original trees/nodes can be deferenced
4 7
4 5 6 8 9 11
4 7
4 5 6 8 9 11
4 7
8 9 11 12
After add 12
Slide 13
Btrfs basic -- snapshot
Snapshots are readable, recursively writable
Snapshots in btrfs are really efficient− Similar to transaction, implemented just increase
reference on the original block− Only the modification part is cloned/copied
Snapshorts are subvolumes File clone with cp --reflink
Slide 14
Btrfs basic-- checksums
Both data and metadata are checksumed
Checksums can be read and validated after read from disk
Uses serveal background threads to offload the work
Based on checksum, btrfs performs background scrubbing, scanning bothe data and metadata blocks
Slide 15
Btrfs basic – multi disks support
Storage pool -- easy to add and remove disks Allocates space on its disks by allocating chunks
− 1 gigabyte chunks for data − 256 megabyte chunks for metadata
All physical devices are hidden, logical devices are created to map a linear address space to multile disks
Disk type and speed could be mixed− Potential to place hot data on fast disks
Filesystem is still chunked into blockgroups− extent allocation tree keeps track of the the fs extent
allocation information
Slide 16
Btrfs offers todayWhat ext4/xfs does Extents Journalling Online defragmentation Delayed allocation Preallocation Punch hole Fiemap DIO/AIO Quotas Extend attributes Trim/discard Offline system check (sort
of)
What ext4/xfs does not Intergrited LVM multi-disk
support snapshot/subvolumes Data and metadata
checksumming Online scrubbing Compression SSD support Space efficient pack of small
files Offline conversion from ext3/4
to btrfs Dynamic metadata usage
Slide 17
Btrfs challenges
Performance with sync− Frequent commits generates many recow of blocks− log tree log only the changed file and directory metadata − Helps but still hurts in heavy sync case
Fragmentation issue− Nature caused by COW− Reduce the fragmentation by clustering and preallocation− Large blocksize could reduce for metadata fragmentation
issue
Variable blocksize support− For better performance, less fragmentation− For architecture with large pages
Slide 18
blocksize vs pagesize
Filesystem are chunked into unit of blocks. Fs handles mapping on disk blocks into memory
Linux uses buffer_heads structure to − map logical to physical blocks− And uses to ensure data=ordered journalling mode
Buffer_head has issues− Each page has a buffer_head structure− Consumes lots of low memory− Hard to reclaim pages when buffer_head no longer
referenced− Hard to perform large chunk of IO
Slide 19
Various blocksize
Btrfs initially had the goal of supporting large and small block sizes. It is now time to bring it back
Large blocksize− Better performance− Reduce fragmentation
Small blocksize (subpage)− Space efficiency− Better cross architecture enablement
Slide 20
Btrfs basic data structures for IO
Need some data structures which replace buffer head
Basic IO data structures− Extent_map -- Map − Extent state -- Track IO status− per extent map tree and extent IO tree− extent_buffer -- used only for metadata
Slide 21
Btrfs metadata large block support
One single extent buffer is attached to multiple pages via page->private
Pages could be discontinuous in memory, page pointers are saved in extent buffer
Extent readpage/writepage iterate number of pages to read/write one extent buffer. COW is done at extent buffer unit
mkfs.btrfs -l 64k Generates much less fragmentation for metadata workload
Slide 22
Extent buffer
Large metadata block accepted in 3.3 kernel
Still missing small blocksize support
− Unable to mount small blocksize btrfs on large pagesize machines
− Implies cant do live migration from ext3/4 to btrfs on large pagesize architectures
struct extent_buffer { u64 start; unsigned long len; unsigned long map_start; unsigned long map_len; unsigned long bflags; struct extent_io_tree *tree; spinlock_t refs_lock; atomic_t refs; atomic_t io_pages;
.... struct page *inline_pages[]; struct page **pages;};
Slide 23
Btrfs subpage blocksize
Metadata: − Lots of assumption in
btrfs extent buffer = pagesize
Chaining up extent buffers inside a page
extent buffer radix indexed by blocksize
Teach extent buffer readpage/writepage to iterate multiple extents buffers
struct extent_buffer { u64 start; unsigned long len; unsigned long map_start; unsigned long map_len; unsigned long bflags; struct extent_io_tree *tree; spinlock_t refs_lock; atomic_t refs; atomic_t io_pages;
.... struct page *inline_pages[]; struct page **pages;
Struct extent_buffer * next_eb_this_page;};
Slide 24
Subpage data blocks
Existing extent map code pretty much setup to support range less than a pagesize
Just need to link extent map data structures together. Page private points to the first extent map
Using extent io tree for tracking IO status Patch work in progress
Slide 25
Btrfs performance
Overall good for general workload− FFSB runs random/sequential read/write tests− Fast on random IO operations− Slow on metadata-intensive workloads (e.g.
Mailserver)− Lock contention become visible with multiple threads− Tests are done in 2 sockets quad core Intel SAN with
LVM
Slide 26
Btrfs performance – random reads
1thread 16threads 128threads200
202
204
206
208
210
212
214
216
218
215.24
215.8 7
215.37
210
212
213.63
208 .18
214.32
206.05
Btrfs vs Ext4 vs XFS Random Reads using FFSBext4xfsbtrfs
Thr
oug
hput
(M
B/s
ec)
Slide 27
Btrfs performance – random writes
1thread 16threads 128threads0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Btrfs vs ext4 vs XFS Random Writes using FFSBext4xfsbtrfs
Thr
oug
hput
MB
/Se
c
Slide 28
Btrfs performance – Mail server
1thread 16threads 128threads0
100
200
300
400
500
600
700
800
btrfs vs ext4 vs XFS mail server workload (simulated with FFSB)ext4xfsbtrfs
Thr
oug
hput
MB
/se
c
Slide 29
Btrfs performancedatabase workload study
ext4,xfs,btrfs comparision, database transact ion throughput0
100
200
300
400
500
600
700
Comparing database transaction througputext4xfsbtrfs
Thr
oug
hput
Slide 30
Btrfs performancedatabase maintenance tasks
create table space backup restore0
100
200
300
400
500
600
700
Elapse time for various maintance taskext4xfsbtrfs
Ela
pse
(se
cond
s)
Slide 31
Btrfs database workload study
Run small OLTP workload on SSDs− Btrfs does not perform so well− COW causes bad fragmentation− Expensive sync operations still− DIO, preallocation does not respond well enough with
nocow at first
Further investigate fragmentation issue− Large random IO write with fallocate and DIO− Using pre-allocation and direct IO does not means nocow
all the time− The first time fill the pre-allocated space, nocow is used− After that the space back to COW, fragmentation starts
Slide 32
Btrfs future work
RAID 5/6 support Efficient Offline fsck De-Duplication for btrfs ENOSPC issue Tiered storage
Slide 33
Links Btrfs wiki page
https://btrfs.wiki.kernel.org/index.php/Main_Page
Contact [email protected] Thanks!