compoundfs: compounding i/o operations in firmware file
TRANSCRIPT
CompoundFS: Compounding I/O Operations in Firmware File Systems
Yujie Ren1, Jian Zhang2 and Sudarsun Kannan1
1 Rutgers University; 2 ShanghaiTech University
3
In-storage Processors Are Powerful
CPU: 2-core 3-core 5-core
RAM: 128MB DDR2 512MB LPDDR2 1GB LPDDR4
Year: 2008 2013 2018
Samsung 840 Samsung 970Intel X25M
Price: $7.4/GB $0.92/GB $0.80/GB
Latency: ~70𝜇s ~60𝜇s ~40𝜇s
B/W: 250 MB/s 500 MB/s 3300 MB/s
Software Latency Matters Now
4
OS Kernel Software Overhead Matters !
Page Cache
Block I/O Layer
Device Driver
VFS Layer
Actual FS
Application
: Kernel Trap
: Data Copy
: OS OverheadPMFS ext4
write()
1 - 4𝜇𝑠
5
Current Solutions• DirectFS (i.e. Strata, SplitFS, DevFS) reduces software overhead
bypassing OS kernel partially or fully
Application
FS Lib
FS Server
Storage
Strata (SOSP’17) DevFS (FAST’18)SplitFS (SOSP’19)
Application Application
FS Lib
Kernel DAX FS
Storage
FS Lib
Storage
Firmware FS
: data-plane ops: control-plane ops
6
Limitation of Current Solutions• DirectFS designs do not reduce boundary crossing
- Strata needs boundary crossing between FS Lib and FS Server- SplitFS needs kernel trap for control-plane operations- DevFS suffers from high PCIe latency for every operation
• DirectFS designs do not efficiently reduce data copy- Current solutions need multiple data copies back and forth between application and storage stack
• DirectFS designs do not utilize in-storage computation- Current solutions only use host CPUs for I/O related operations
8
Analysis Methodology
• File Systems- ext4-DAX: ext4 on byte-addressable storage bypassing page cache- SplitFS: direct-access file system bypassing kernel for data-plane ops
• Application- LevelDB: Well-known persistent key-value store- db_bench: random write and read benchmarks
• Storage- Emulated persistent memory on DRAM like prior work (e.g., SplitFS)
9
LevelDB Overhead Breakdown
• LevelDB spends significant time (~%50) in OS storage stack
• Spends ~%15 of time on data copy between App and OS
• Spends ~%20 of time on App-level crash consistency – CRC of data
0%
20%
40%
60%
80%
256(DAX)
4096(DAX)
256(SplitFS)
4096(SplitFS)
Run
tim
e pe
rcen
tage
(%)
Value size (bytes)
Data allocation (OS) Data copy (OS)Filesystem update (OS) Lock (OS)Data allocation (user) Data copy (user)CRC32 (user)
11
Our solution: CompoundFS
• Combine (compound) multiple file system I/O ops into one
• Offload I/O pre- and post-processing to storage-level CPUs
• Bypass OS kernel and provide direct-access
12
Our solution: CompoundFS
• Combine (compound) multiple file system I/O ops into one- e.g. write() after read() compounded to write-after-read()- Reduces boundary crossing b/w host and storage (e.g., syscall)
• Offload I/O pre- and post-processing to storage-level CPUs- e.g. checksum() after write() compounded to write-and-checksum()- Storage CPUs perform computation (e.g., checksum) and persist- Reduce data movement cost across boundaries
• Bypass OS kernel and provide direct-access- firmware file system design to provide direct access for data plane and most control plane operations
13
I/O Only Compound Operations
Read-modify-write:
Traditional FS Path:
2 syscalls + 2 data copies
User space
Kernel space
User space
Storage
Only 1 data copy with direct access
Read(data) Write(data) Read_modify_write(data)
Compound FS Path:
: Kernel Trap
: Data Copy
modify
Storage FS
StorageFS performs compound op
14
I/O + Compute Compound Operations
Write-and-checksum
Traditional FS Path:
2 syscalls + 2 data copies
User space
Kernel space
User space
Storage
Only 1 data copy with direct access
Write(data) Write(checksum) Write_and_checksum(data)
Compound FS Path:
: Kernel Trap
: Data Copy
checksum
Storage FS
StorageFS handles checksum calculation
15
CompoundFS ArchitectureApplication (Thread 1)
Op1 open(File1) -> fd1
Application (Thread 2)
Op2+ read_modify_write(fd2, buf, off=30, sz=5)
UserLib (in Host) Per-inode I/O Queue Per-inode Data Buffer
Converting POSIX I/O syscalls to CompoundFS compoundOps
Journal
…TxB TxEMeta-data
NVM DataBlock Addr
Cred Table
CPUID
Cred
CPUID CPUID
Cred Cred
StorageFS(In Device)
I/O Request Processing Threads
Device CPU Cores
Compounding I/O ops
Perform CRC calculation before write()
Op3* write_and_checksum(fd1,buff, off=10, sz=1K, checksum_pos=head)
Op4 read(fd2, buf, off=30, sz=5)
Op1 Op2+ Op4Op3*
16
CompoundFS Implementation
• Command-based arch based on PMFS (Eurosys’14)- control-plane ops (e.g. open) as commands via ioctl()- ioctl() carries arguments for each I/O ops
• Avoids VFS overhead- control-plane ops are issued via ioctl(), no VFS layer
• Avoids system call overhead- UserLib and StorageFS share a command buffer- UserLib adds requests to command buffer- StorageFS processes requests from the buffer
17
CompoundFS Challenges• Crash-consistency model for compound I/O operations
• All-or-nothing model (current solution)- an entire compound operation is a transaction
- partially completed operations cannot be recovered
- e.g., write-and-checksum, only data is persisted but checksum not
• All-or-something model (ongoing)- fine-grained journaling and partial recovery is supported
- recovery could become complex
19
Evaluation Goal
• Effectiveness to reduce boundary crossing
• Effectiveness to reduce data copy overheads
• Ability to exploit compute capability of modern storage
20
Experimental Setup• Hardware Platform
- dual-socket 64-core Xeon Scalable CPU @ 2.6GHz- 512GB Intel DC Optane NVM
• Emulate firmware-level FS- reserve dedicated device threads handling I/O requests- add PCIe latency for every I/O operation- reduce CPU frequency to 1.2GHz for device CPU
• State-of-the-art File Systems- ext4-DAX (Kernel-level file system)- SplitFS (User-level file system)- DevFS (Device-level file system)
21
Micro-Benchmark
Read-modify-write Write-and-checksum
• CompoundFS reduces unnecessary data movement and system call overhead by combining operations
0
200
400
600
800
1000
1200
256 4096
Thr
ough
put
(MB/
s)
Value Size
ext4-DAX
SplitFS
DevFS
CompoundFS
CompoundFS-slowCPU
0
200
400
600
800
1000
1200
256 4096
Thr
ough
put
(MB/
s)
Value Size
2.1x
1.25x
• Even with slow device CPUs, CompoundFS can still provide gains for in-storage computation
22
LevelDB
db_bench random writedb_bench random read
• CompoundFS also shows promising speedup in Leveldb
0
20
40
60
80
100
512 4096
Thr
ough
put
(MB/
s)
db_bench Value Size (500k keys)
0
10
20
30
40
512 4096
Late
ncy
(us/
op)
db_bench Value Size (500k keys)
ext4-DAX
SplitFS
DevFS
CompoundFS
CompoundFS-slowCPU
1.75x
23
Conclusion• Storage hardware is moving to microsecond era
- Software overhead matters and providing direct-access is critical- Storage compute capability can benefit I/O intensive applications
• CompoundFS combines I/O ops and offloads computations- Reduces boundary crossing (system call) and data copy overhead- Takes advantage of in-storage compute resources
• Our ongoing work- Fine-grained crash consistency mechanism- Efficient I/O scheduler for managing computation in storage