rocksdb storage engine for mysql and mongodb
TRANSCRIPT
![Page 1: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/1.jpg)
RocksDB Storage Engine Igor Canadi | Facebook
![Page 2: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/2.jpg)
Overview
• Story of RocksDB
• Architecture
• Performance tuning
• Next steps
1
![Page 3: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/3.jpg)
Story of RocksDB
![Page 4: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/4.jpg)
Pre-2011
• FB infrastructure – many custom-built key-value stores
• LevelDB released
2
![Page 5: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/5.jpg)
Experimentation (2011 – 2013)
• First use-cases
• Not designed for server – many bottlenecks, stalls
• Optimization
• New features
3
![Page 6: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/6.jpg)
Explosion (2013 – 2015)
• Open sourced RocksDB
• Big success within Facebook
• External traction – Linkedin, Yahoo, CockroachDB, …
4
![Page 7: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/7.jpg)
New Challenges (2015 - )
• Bring RocksDB to databases
5
![Page 8: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/8.jpg)
MongoRocks
• Running in production at Parse for 6 months
• Huge storage savings (5TB à 285GB)
• Document-level locking
6
![Page 9: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/9.jpg)
MyRocks
7 InnoDB RocksDB
0
0.2
0.4
0.6
0.8
1
1.2
Database size (relative)
InnoDB
RocksDB
InnoDB RocksDB 0
0.2
0.4
0.6
0.8
1
1.2
Bytes written (relative)
InnoDB
RocksDB
![Page 10: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/10.jpg)
Architecture Log Structured Merge Trees
![Page 11: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/11.jpg)
Log Structured Merge Trees
8
(64MB)
(256MB)
(512MB)
(5GB)
(50GB)
(500GB)
Memtable
Level 0
Level 1
Level 2
Level 3
Level 4
![Page 12: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/12.jpg)
Log Structured Merge Trees – write
9
(64MB)
(256MB)
Memtable
Level 0
(key,value)
![Page 13: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/13.jpg)
Log Structured Merge Trees – flush
10
(64MB)
(256MB)
Memtable
Level 0
![Page 14: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/14.jpg)
Log Structured Merge Trees – compaction
11
(5GB)
(50GB)
Level 2
Level 3
![Page 15: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/15.jpg)
Writes
• Foreground:
• Writes go to memtable (skiplist) + write-ahead log
• Background:
• When memtable is full, we flush to Level 0
• When a level is full, we run compaction
12
![Page 16: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/16.jpg)
Reads
13
(64MB)
(256MB)
(512MB)
(5GB)
(50GB)
(500GB)
Memtable
Level 0
Level 1
Level 2
Level 3
Level 4
![Page 17: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/17.jpg)
Reads
• Point queries
• Bloom filters reduce reads from storage
• Usually only 1 read IO
• Range scans
• Bloom filters don’t help
• Depends on amount of memory, 1-2 IO
14
![Page 18: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/18.jpg)
RocksDB Files
15
rocksdb/> ls MANIFEST-000032 000024.log 000031.log 000025.sst 000028.sst 000029.sst 000033.sst 000034.sst LOG LOG.old.1441234029851978 ...
![Page 19: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/19.jpg)
RocksDB Files – MANIFEST
16
(initial state) Add file 1 Add file 2 Add file 3 Add file 4 …
(flush) Add file 9 Mark log 6 persisted
(compaction) Add file 10 Add file 11 Remove file 9 Remove file 8
Add new column family “system”
• Atomical updates to database metadata
![Page 20: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/20.jpg)
RocksDB Files – Write-ahead log
17
Write (A, B) Write (C, D) Write (E, F)
Delete(A) Write(X, Y) Delete(C)
• Persisted memtable state
![Page 21: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/21.jpg)
RocksDB Files – Table files
18
(Data block) • compressed • prefix encoded
(Data block) <key, value>
(Data block) (Data block)
(Data block)
(Data block)
(Data block)
(Data block)
(Index block) <key, block>
(Filter block) (Statistics) (Meta index block) Pointers to blocks
![Page 22: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/22.jpg)
RocksDB Files – LOG files
• Debugging output
• Tuning options
• Information about flushes and compactions
• Performance statistics
19
![Page 23: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/23.jpg)
Backups
• Table files are immutable
• Other files are append-only
• Easy and fast incremental backups
• Open sourced Rocks-Strata
20
![Page 24: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/24.jpg)
Performance tuning
![Page 25: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/25.jpg)
Tombstones
• Deletions are deferred
• May cause higher P99 latencies
• Be careful with pathological workloads, e.g. queues
21
![Page 26: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/26.jpg)
Caching
22
Block cache • Managed by RocksDB • Uncompressed data • Defaults to 1/3 of RAM
Page cache • Managed by kernel • Compressed data
![Page 27: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/27.jpg)
Memory usage
• Block cache
• Index and filter blocks (0.5 – 2% of the database)
• Memtables
• Blocks pinned by iterators
23
![Page 28: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/28.jpg)
Reduce memory usage
• Reduce block cache size – will increase CPU
• Increase block size – decrease index size
• Turn off bloom filters on bottom level
24
![Page 29: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/29.jpg)
Reduce CPU
• Profile the CPU usage
• Increase block cache size – will increase memory usage
• Turn off compression
• It might be tombstones
25
![Page 30: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/30.jpg)
Reduce write amplification
• Write amplification = 5 * num_levels
• Increase memtable and level 1 size
• Stronger (zlib, zstd) compression for bottom levels
• Try universal compaction
26
![Page 31: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/31.jpg)
Next steps
![Page 32: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/32.jpg)
Next steps
• Increase performance & stability
• Deploy MyRocks at Facebook
• External adoption of MyRocks and MongoRocks
• Build an ecosystem
27
![Page 33: RocksDB storage engine for MySQL and MongoDB](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f91431a28ab54768b7b91/html5/thumbnails/33.jpg)
Thank you