Download - Optimizing RocksDB for Open-Channel SSDs
![Page 1: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/1.jpg)
1
Towards Application Driven StorageOptimizing RocksDB for Open-‐Channel SSDs
Javier González <[email protected]>LinuxCon Europe 2015
Contributors: Matias Bjørling and Florin Petriuc
![Page 2: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/2.jpg)
Application Driven Storage: What is it?
2
RocksDB
Metadata Mgmt.
Standard Libraries User S
pace
Kernel Space
App-‐specific opt.
Page cache
Block I/O interface
FS-‐specific logic
![Page 3: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/3.jpg)
Application Driven Storage: What is it?
• Application Driven Storage - Avoid multiple (redundant)
translation layers
- Leverage optimization opportunities
- Minimize overhead when manipulating persistent data
- Make better decisions regarding latency, resource utilization, and data movement (compared to best-‐effort techniques today)
3
RocksDB
Metadata Mgmt.
Standard Libraries User S
pace
Kernel Space
App-‐specific opt.
Page cache
Block I/O interface
FS-‐specific logic
Generic <> Optimized
➡ Motivation: Give the tools to the applications that know how to manage their own storage
![Page 4: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/4.jpg)
Application Driven Storage Today
• Arrakis (https://arrakis.cs.washington.edu) - Remove the OS kernel entirely from normal application execution
• Samsung multi stream - Let the SSD know from where “I/O streams” emerge to make better
decisions
• Fusion I/O - Dedicated I/O stack to support a specific type of hardware
• Open-‐Channel SSDs - Expose SSD characteristics to the host and give it full control over its
storage
4
![Page 5: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/5.jpg)
Traditional Solid State Drives
• Flash complexity is abstracted away form the host by an embedded Flash Translation Layer (FTL) - Maps logical addresses (LBAs) to physical addresses (PPAs)
- Deals with flash constrains (next slide)
- Has enabled adoption by making SSDs compliant with the existing I/O stack
5
High throughput + Low latency Parallelism + Controller
![Page 6: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/6.jpg)
Page 0
Page 1
Page 2
Page n -1
…
StateOOBData
Flash memory 101
6
• Flash constrains: - Write at a page granularity
• Page state: Valid, invalid, erased
- Write sequentially in a block
- Write always to an erased page • Page becomes valid
- Updates are written to a new page • Old pages become invalid -‐ need for GC
- Read at a page granularity (seq./random reads)
- Erase at a block granularity (all pages in block) • Garbage collection (GC):
• Move valid pages to new block
• Erase valid and invalid pages -‐> erased state
![Page 7: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/7.jpg)
• Open Channel SSDs share control responsibilities with the Host in order to implement and maintain features that typical SSDs implemented strictly in the SSD device firmware
Open-‐Channel SSDs: Overview
7
• Host-‐based FTL manages: - Data placement - I/O scheduling - Over-‐provisioning - Garbage collection - Wear-‐leveling
• Host needs to know: - SSD features & responsibilities - SSD geometry
• NAND media idiosyncrasies • Die geometry (blocks & pages) • Channels, timings, etc. • Bad blocks & ECC
Host manages physical flash Application Driven Storage
Physical flash exposed to the host (Read, Write, Erase)
![Page 8: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/8.jpg)
Open-‐Channel SSDs: LightNVM
8
Key-Value/Object/FS/Block/etc.
Block Target Direct Flash Target
File-System
Block Manager (Generic, Vendor-specific, ...)
Open-Channel SSDs (NVMe, PCI-e, RapidIO, ...)
Kernel
User-space
Block Copy Engine
Metadata State Mgmt.
Bad Block State Mgmt.
XOR Engine ECC Engine
Error Handling
Etc.GC Engine
Raw NAND Geometry
Managed Geometry
Vendor-Specific Target
Hardware
Softw
are
LightNVM
Framew
ork
• Targets - Expose physical media to user-‐
space
• Block Managers - Manage physical SSD
characteristics - Evens out wear-‐leveling across
all flash
• Open-‐Channel SSD - Responsibility - Offload engines
![Page 9: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/9.jpg)
LightNVM’s DFlash: Application FTL
9
➡ DFlash is the LightNVM target supporting application FTLs
Open%Channel*SSD
get_block(),/,put_block(),/erase_block()
Block*Manager
DFlash*Target
Provisioning,interface Block,device
Applica:on
Provisioning,buffer
Block0 Block1 BlockN…
Applica@on,Logic
Normal,I/OblockNE>bppa,*,PAGE_SIZE
struct&nvm_tgt_type&/_dflash&=&{&&[…]&&.make_rq&&&&&&=&df_make_rq,&&.&end_io&&&&&&&&=&df_end_io,&&[…]};
FTL
syncpsynclibaioposixaio…
…
…
…
…
CH0CH1
CHN
Lun0 LunN
Physical,Flash,Layout,,E,NAND,E,specific,,E,Managed,by,controller
…
…
…
…
vblockManaged,Flash,,E,Exploit,parallelism,,E,Serve,applica@on,needs
type%1
type%2vblock
vblock type%3KERN
EL*SPA
CEUSER*SPAC
E
Data,placement
I/O,scheduling
OverEprovisioning
Garbage,collection
WearEleveling
struct vblock { uint64_t id; uint64_t owner_id; uint64_t nppas; uint64_t ppa_bitmap sector_t bppa; uint32_t vlun_id; uint8_t flags };
![Page 10: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/10.jpg)
Open-‐Channel SSDs: Challenges
1. Which classes of applications would benefit most from being able to manage physical flash? - Modify storage backend (i.e., no posix)
- Probably no file system, page cache, block I/O interface, etc.
2. Which changes do we need to make on these applications? - Make them work on Open-‐Channel SSDs
- Optimize them to take advantage of directly using physical flash (e.g., data structures, file abstractions, algorithms).
3. Which interfaces would (i) make the transition simpler, and (ii) simultaneously cover different classes of applications?
10
➡ New paradigm that we need to explore in the whole I/O stack
![Page 11: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/11.jpg)
RocksDB: Overview
11
• Embedded Key-‐Value persistent store • Based on Log-‐Structured Merge Tree • Optimized for fast storage • Server workloads • Fork from LevelDB • Open Source: - https://github.com/facebook/rocksdb
• RocksDB is not: - Not distributed
- No failover
- Not highly available
RocksDB Reference: The Story of RocksDB, Dhruba Borthakur and Haobo Xu (link)
The Log-‐Structured Merge-‐Tree, Patrick O'Neil, Edward Cheng Dieter Gawlick, Elizabeth O’Neil. Acta Informatica, 1996.
![Page 12: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/12.jpg)
RocksDB: Overview
12
Active MemTable
ReadOnlyMemTable
Log
Log
sst
sst
sst sst
sst sstCompaction
Flush
Switch Switch
LSM
Write Request
Read Request
Blooms
![Page 13: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/13.jpg)
Problem: RocksDB Storage Backend
13
sst
sst
sst sst
sst sst
User Data MetadataDB Log
WAL
… WALWALWALManifest Manifest Manifest…
Current
OtherLOG (Info)LOCK IDENTITY
Storage Backend
LSM Logic
Posix HDFS Win
RocksDB LSM
• Storage backend decoupled from LSM - WritableFile(): Sequential writes -‐> Only way to write to
secondary storage
- SequentialFile() -‐> Sequential reads. Used primarily for sstable user data and recovery
- RandomAccessFile() -‐> Random reads. Used primarily for metadata (e.g., CRC checks)
- Sstable: Persistent memtable
- DB Log: Write-‐Ahead Log (WAL)
- MANIFEST: File metadata
- IDENTITY: Instance ID
- LOCK: use_existing_db
- CURRENT: Superblock
- Info Log: Log & Debugging
![Page 14: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/14.jpg)
RocksDB: LSM using Physical Flash
• Objective: Fully optimize RocksDB for Flash memories - Control data placement:
• User data in sstables is close in the physical media (same block, adjacent blocks)
• Same for WAL and MANIFEST
- Exploit Parallelism: • Define virtual blocks based on file write patters in the storage backend
• Get blocks from different luns based on RocksDB’s LSM write patters
- Schedule GC and minimize over-‐provisioning • Use LSM sstable merging strategies to minimize (and ideally remove) the need for GC and
over-‐provisioning on the SSD
- Control I/O scheduling • Prioritize I/Os based on the LSM persistent needs (e.g., L0 and WAL have higher priority
than levels used for compacted data to maximize persistency in case of power loss)
14
➡ Implement an FTL optimized for RocksDB, which can be reused for similar applications (e.g., LevelDB, Cassandra, MongoDB)
![Page 15: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/15.jpg)
RocksDB + DFlash: Challenges
• Sstables (persistent memtables) - P1: Fit block sizes in L0 and further level (merges + compactions)
• No need for GC on SSD side -‐ RocksDB merging as GC (less write and space amplification)
- P2: Keep block metadata to reconstruct sstable in case of host crash
• WAL (Write-‐Ahead Log) and MANIFEST - P3: Fit block sizes (same as in sstables)
- P4: Keep block metadata to reconstruct the log in case of host crash
• Other Metadata - P5: Keep superblock metadata and allow to recover the database
- P6: Keep other metadata to account for flash constrains (e.g., partial pages, bad pages, bad blocks)
• Process - P7: Follow RocksDB architecture -‐ upstreamable solution
15
![Page 16: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/16.jpg)
Arena&block&(kArenaSize)(Op3mal&size&~&1/10&of&write_buffer_size)
write_buffer_size
…Flash'Block'0 Flash'Block'1
nppas'*'PAGE_SIZE
offset'0
gid:'128 gid:'273 gid:'481
EOF
space'amplificaHon(life'of'file)
P1, P3: Match flash block size
16
• WAL and MANIFEST are reused in future instances until replaced - P3: Ensure that WAL and
MANIFEST replace size fills up most of last block
• Sstable sizes follow a heuristic -‐ MemTable::ShouldFlushNow()P1:
- kArenaBlockSize = sizeof(block) - Conservative heuristic in terms of overallocation
• Few lost pages is better than allocating a new block
- Flash block size becomes a “static” DB tuning parameter that is used to optimize “dynamic” ones
➡ Optimize RocksDB bottom up (from storage backend to LSM)
DFlash File
![Page 17: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/17.jpg)
P2, P4, P6: Block Metadata
17
Block&Ending&
Metadata
Out$of$Bound
Block&Star+ng&Metadata
• Blocks can be checked for integrity • New DB instance can append; padding is maintained in OOB (P6) • Closing a block updates bad page & bad block information (P6)
First&Valid&Page
Intermediate&Page
Last&Page
Out&of&Bound
Block&Star:ng&Metadata
Block&Ending&
Metadata
RocksDB(data
RocksDB(data
RocksDB(data
Out&of&Bound
Out&of&Bound
Flash&Block
struct vblock_init_meta { char filename[100]; // RocksDB file GID
uint64_t owner_id; // Application owning the block size_t pos; // relative position in block };
struct vpage_meta { size_t valid_bytes; // Valid bytes from offset 0 uint8_t flags; // State of the page };
struct vblock_close_meta { size_t written_bytes; // Payload size size_t ppa_bitmap; // Updated valid page bitmap
size_t crc; // CRC of the whole block unsigned long next_id; // Next block ID (0 if last) uint8_t flags; // Vblock flags };
![Page 18: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/18.jpg)
P2, P4, P6: Crash Recovery
18
• A DFlash file can be reconstructed from individual blocks (P2, P4) 1. Metadata for the blocks forming a DFlash file is stored in MANIFEST
• The last WAL is not guaranteed to reach the MANIFEST -‐> RECOVERY metadata for DFLASH
2. On recovery, LightNVM provides an application with all its valid blocks
3. Each block stores enough metadata to reconstruct a DFLash file
OPCODEENCODEDMETADATA
OPCODEENCODEDMETADATA
OPCODEENCODEDMETADATA
…
Metadata/Type:3/Log3/Current3/Metadata3/Sstable
!"Private"(Env)
Enough/metadata/to/recover/database/in/a/new/instance
Private"(DFlash):vblocks/forming/the/DFlash/File
1
BLOCK BLOCK BLOCK3
Open%Channel*SSD
BM Block&(ownerID)
Block&(ownerID)
Block&(ownerID)
…
Block&ListRecovery
2
![Page 19: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/19.jpg)
19
• CURRENT is used to store RocksDB “superblock” - Points to current MANIFEST, which is used to reconstruct the DB when
creating a new instance. We append the block metadata that points to the blocks forming the current MANIFEST (P5)
P5: Superblock
OPCODEENCODEDMETADATA
OPCODEENCODEDMETADATA
OPCODEENCODEDMETADATA
…
Metadata/Type:3/Log3/Current3/Metadata3/Sstable
!"Private"(Env)
Enough/metadata/to/recover/database/in/a/new/instance
Private"(DFlash):vblocks/forming/the/DFlash/File
MANIFEST (DFlash)
CURRENT (Posix)
Normal Recovery
![Page 20: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/20.jpg)
P7: Work upstream
20
![Page 21: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/21.jpg)
RocksDB + DFlash: Prototype (1/2)
21
• Optimize RocksDB for Flash storage - Implement a user space append-‐only FS that deals with flash constrains
• Append-‐only: Updates are re-‐written and old data invalidated -‐> LSM understands this logic
• Page cache implemented in user space; use Direct I/O
• Only “sync” complete pages, and prefer closed blocks.
• In case of write failure, write to a new block (or mark bad page and re-‐try)
- Implement RocksDB’s file classes for DFlash: • WritableFile(): Sequential writes -‐> Only way to write to secondary storage
• SequentialFile() -‐> Used primarily for sstable user data and recovery
• RandomAccessFile() -‐> Used primarily for metadata (e.g., CRC checks)
• Use flash block as the central piece for storage optimizations - The Open-‐Channel SSD fabric is configured at first
• Define block size -‐ across luns and channels to exploit parallelism
• Define different types of luns with different block features
- RocksDB is configured with standard parameters (e.g., write buffer, cache) • DFlash backend tunes these parameters based on the type of lun and block
![Page 22: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/22.jpg)
RocksDB + DFlash: Prototype (2/2)
22
• Use LSM merging strategies as perfect garbage collection (GC) - All blocks in a DFlash file are either active or inactive -‐> no need to GC in SSD - Reduce over-‐provisioning significantly (~5%) - Predictable latency -‐> SSD is in stable state from the beginning
• Reuse RocksDB concepts and abstractions as much as possible - Store private metadata in MANIFEST - Store superblock in CURRENT - Minimize the amount of “visible” metadata -‐ use OOB, Root FS, etc.
• Separate persistent (meta)data between “fast” and “static” - Fast data is all user data (i.e., sstables) and the WAL - Fast metadata that follows user data rates (i.e., MANIFEST) - Static metadata is written once and seldom updated
• CURRENT: Superblock for MANIFEST
• LOCK and IDENTITY
![Page 23: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/23.jpg)
Architecture: RocksDB with DFlash
23
DFlash'Storage'Backend
LSM'Logic
Open8Channel'SSD'(Fast'Storage) Posix'FS'(Sta>c'Meta.)
sst
sst
sst sst
sst sst
User%Data DB%Log
WAL
…WALWALWAL
Metadata
Manifest
Metadata
CURRENT
LOCK
IDENTITY LOG'(Info)
Env'DFlashOp2miza2onsEnv%Op2ons(Flash%charac.%on%init)
DFlash'File'ClassesDFWritableFile()DFRandomAccessFile()DFSequen2alFile()
Persist%opera2ons Environment%Tunning
Env'DFlash
Private%metadata
Metadata,%Close,%Crash
![Page 24: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/24.jpg)
Architecture: DFlash + LightNVM
24
Open%Channel*SSDBM
DFlash
KERN
EL*SPA
CEUSER*SPAC
E
Provisioning)interface Block)device
DFlash)File)(blocks)
get_block()put_block()
DFWritableFileDFSequen?alFileDFRandomAccesslFile
Free)Blocks Used)Blocks Bad)Blocks
Sector)Calcula?ons
I/O)Path
Manifestsst WALData)Placementcontrolled)by)LSM
I/O
Env*DFlash Env*Posix
PxWritableFilePxSequen?alFilePxRandomAccesslFile
CURRENT
LOCKIDENTITY
LOG*(Info)
TradiGonal*SSD
File*System
I/O
RocksDB*LSM
Device)Features
Env)Op?onsOp?miza?ons
Init
Code
![Page 25: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/25.jpg)
Architecture: DFlash + LightNVM
25
• LSM is the FTL - DFlash target and RocksDB storage
backend take care of provisioning flash blocks
• Optimized critical I/O path - Sstables, WAL, and MANIFEST are
stored in the Open-‐Channel SSD, where we can provide QoS
• Enable a RocksDB distributed architecture - BM abstracts the storage fabric (e.g.,
NVM) and can potentially provide blocks from different drives -‐> single address space
Open%Channel*SSDBM
DFlash
KERN
EL*SPA
CEUSER*SPAC
E
Provisioning)interface Block)device
DFlash)File)(blocks)
get_block()put_block()
DFWritableFileDFSequen?alFileDFRandomAccesslFile
Free)Blocks Used)Blocks Bad)Blocks
Sector)Calcula?ons
I/O)Path
Manifestsst WALData)Placementcontrolled)by)LSM
I/O
Env*DFlash Env*Posix
PxWritableFilePxSequen?alFilePxRandomAccesslFile
CURRENT
LOCKIDENTITY
LOG*(Info)
TradiGonal*SSD
File*System
I/O
RocksDB*LSM
Device)Features
Env)Op?onsOp?miza?ons
Init
Code
![Page 26: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/26.jpg)
26
RocksDB make release
ENTRY KEYS with 4 threads
DFLASH (1 LUN) POSIX
WRITES 10000 keys 70MB/s 25MB/s
100000 keys 40MB/s 25MB/s
1000000 keys 25MB/s 20MB/s
Page-‐aligned write buffer
- DFlash write page cache + Direct I/O + flash page aligned write buffer is better optimized than best effort techniques and top-‐down optimizations (RocksDB parameters). Write buffer required by RocksDB due to small WAL writes
- If we manually tune buffer sizes with Posix, we obtain similar results. However, it requires lots of experimentation for each configuration
QEMU Evaluation: WritesBare Metal: ~180MB/s
![Page 27: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/27.jpg)
27
RocksDB make release ENTRY KEYS DFLASH
(1 LUN) POSIX
READ 10000 keys 5MB/s 300MB/s
100000 keys 5MB/s 500MB/s
1000000 keys 5MB/s 570MB/s
No page cache support
- Without a DFLASH page cache we need to issue an I/O for each read! • Sequential 20 byte-‐reads in same page would issue different PAGE_SIZE I/Os
QEMU Evaluation: Reads
![Page 28: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/28.jpg)
28
RocksDB make release ENTRY KEYS DFLASH
(1 LUN)
DFLASH (1 LUN)
(+ simple page cache)POSIX
READ 10000 keys 5MB/s 160MB/s 300MB/s
100000 keys 5MB/s 280MB/s 500MB/s
1000000 keys
5MB/s 300MB/s 570MB/s
- Posix + buffered I/O using Linux’s page cache is still better, but we have confirmed our hypothesis.
Simple page cache for reads
➡ User-‐space page cache is a necessary optimization when the generic OS page cache is on the way. Other databases use this technique (e.g., Oracle, MySQL)
QEMU Evaluation: “Fixing” Reads
![Page 29: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/29.jpg)
QEMU Evaluation: Insights
29
• Posix backend and DFlash backend (with 1 lun) should achieve very similar throughput for reads/writes when using same page cache and write buffer optimizations
• But… - DFlash allows to optimize buffer and cache sizes based on Flash
characteristics
- DFlash knows which file class is calling -‐ we can do prefetching for sequential reads (DFSequentialFile) at block granularity
- DFlash designed to implement a Flash optimized page cache using Direct I/O
• If the Open-‐Channel SSD exposes several LUNs, we can exploit parallelism within DFlash and RocksDB’s LSM write/read patterns - How many luns and their characteristics are organized is controller specific
![Page 30: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/30.jpg)
CNEX WestLake SDK: Overview
30
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 23" 25" 27" 29" 31" 33" 35" 37" 39" 41" 43" 45" 47" 49" 51" 53" 55" 57" 59" 61" 63"
MB/s"
LUNs"ac5ve"
Read/Write"Performance"(MB/s)"
Write"KB/s"
Read"KB/s"
FPGA Prototype Platform before ASIC: -‐ PCIe G3x4 or PCI G2x8 -‐ 4x10GE NVMoE -‐ 40 bit DDR3 -‐ 16 CH NAND
* Not real performance results -‐ FPGA prototype, not ASIC
![Page 31: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/31.jpg)
CNEX WestLake Evaluation
31
RocksDB make release
ENTRY KEYS with 4 threads
WRITES (1 LUN)
READS (1 LUN)
WRITES (8 LUNS)
READS (8 LUNS)
WRITES (64 LUNS)
READS (64 LUNS)
RocksDB DFLASH 10000 keys 21MB/s 40MB/s X X X X
100000 keys
21MB/s 40MB/s X X X X
1000000 keys
21MB/s 40MB/s X X X X
Raw DFLASH (with fio) 32MB/s 64MB/s 190MB/s 180MB/s 920MB/s 1,3GB/s
• We focus on a single I/O stream for first prototype -‐> 1 LUN
![Page 32: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/32.jpg)
CNEX WestLake Evaluation: Insights
32
• RocksDB checks for sstable integrity on writes (intermittent reads) - We pay the price of not having an optimized page cache also on writes
- Reads and writes are mixed in one single lun
• Ongoing work: Exploit parallelism in RocksDB’s I/O patters
RocksDB
Ac#ve&memtable WAL… SST1 SST2 SST3 … SSTN
Merging&&&Compac#on
w w w w w wr
W R
r r r r
R/W
r
ReadBonly&memtable
r
RO&MT
RO&MT…
R R
w r
W
MANIFEST
W
w r w r
Open,Channel1SSD
Block1Manager
LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUN8 LUN9…
DFlash
get_block()put_block()
Virtual&LUN
Physical&LUN……
…
…
CH0CH1
CHN
Lun0 LunN
WestLake
- Do not mix R/W
- Different VLUN per path
- Different VLUN types
- Enabling I/O scheduling
- Block pool in DFlash (prefetching)
![Page 33: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/33.jpg)
CNEX WestLake Evaluation: Insights
33
• Also, in any Open-‐Channel SSD - DFlash will not get a performance hit when the SSD triggers GC -‐ RocksDB
does GC when merging sstables in LSM > L0 • SSD steady state is improved (and reached from the beginning)
• We achieve predictable latency
IOPS
Time
![Page 34: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/34.jpg)
Status and ongoing work
• Status: - get_block/put_block interface (first iteration) through the DFlash target
- RocksDB DFlash target that plugs into LightNVM. Source code to test in available (RocksDB, Kernel, and QEMU). Working on WestLake upstreaming
• Ongoing: - Implement functions to increase the synergy between the LSM and the
storage backend (i.e., tune write buffer based on block size) -‐> Upstreaming - Support libaio to enable async I/O in DFlash storage backend
• Need to deal with RocksDB design decisions (e.g., Get() assumes sync IO)
- Exploit device parallelism within RockDB internal structures
- Define different types of virtual luns and expose them to the application
- Other optimizations: double buffering, aligned memory in LSM, etc.
34
- Move RocksDB DFlash’s logic to liblightnvm -‐> append-‐only FS for Flash
![Page 35: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/35.jpg)
Conclusions
• Application Driven Storage - Demo working on real hardware: • (RocksDB -‐> LightNVM -‐> WestLake-‐powered SSD)
- QEMU support for testing and development
- More Open-‐Channel SSDs coming soon.
• RocksDB - DFlash dedicated backend -‐> append-‐only FS optimized for Flash
- Set the basis for moving to a distributed architecture while guaranteeing performance constrains (specially in terms of latency)
- Intention to upstream the whole DFlash storage backend
• LightNVM - Framework to support Open-‐Channel SSDs in Linux
- DFlash target to support application FTLs
35
![Page 36: Optimizing RocksDB for Open-Channel SSDs](https://reader036.vdocuments.us/reader036/viewer/2022082212/586f91431a28ab54768b7b95/html5/thumbnails/36.jpg)
36
Towards Application Driven StorageOptimizing RocksDB on Open-‐Channel SSDs with LightNVM
LinuxCon Europe 2015
Questions?
• Open-‐Channel SSD Project: https://github.com/OpenChannelSSD
• LightNVM: https://github.com/OpenChannelSSD/linux
• RocksDB: https://github.com/OpenChannelSSD/rocksdb
Javier González <[email protected]>