tech talk: rocksdb slides by dhruba borthakur & haobo xu of facebook
DESCRIPTION
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.TRANSCRIPT
The Story of RocksDB
Dhruba Borthakur & Haobo XuDatabase Engineering@Facebook
Embedded Key-Value Store for Flash and RAM
Monday, December 9, 13
Monday, December 9, 13
Monday, December 9, 13
A Client-Server Architecture with disks
Network roundtrip = 50 micro sec
Disk access = 10 milli seconds
Application ServerDatabase
Server
Locally attached Disks
Monday, December 9, 13
Client-Server Architecture with fast storage
Network roundtrip = 50 micro secApplication Server
Database Server
Latency dominated by network
SSD RAM
100 microsecs
100 nanosecs
Monday, December 9, 13
Architecture of an Embedded Database
Network roundtrip = 50 micro sec
Application Server
Database Server
Monday, December 9, 13
Architecture of an Embedded Database
Network roundtrip = 50 micro sec
Application Server
Database Server
Monday, December 9, 13
Architecture of an Embedded Database
Network roundtrip = 50 micro sec
Application Server
Database Server
SSD RAM
100 microsecs
100 nanosecs
Monday, December 9, 13
Architecture of an Embedded Database
Network roundtrip = 50 micro sec
Application Server
Database Server
Storage attached directly to application servers
SSD RAM
100 microsecs
100 nanosecs
Monday, December 9, 13
Any pre-existing embedded databases?
Open Source
FB Proprietary
Monday, December 9, 13
Any pre-existing embedded databases?
1. Berkeley DB2. SQLite3. Kyoto TreeDB4. LevelDB
Open Source
FB Proprietary
Key-value stores
Monday, December 9, 13
Any pre-existing embedded databases?
1. High Performant2. No transaction log3. Fixed size keys
Open Source
FB Proprietary
Key-value stores
Monday, December 9, 13
Comparison of open source databases
Monday, December 9, 13
Comparison of open source databases
Random Reads
Monday, December 9, 13
Comparison of open source databases
Random Reads
Random Writes
Monday, December 9, 13
Comparison of open source databases
Random Reads
LevelDBKyoto TreeDBSQLite3
129,000 ops/sec151,000 ops/sec134,000 ops/sec
Random Writes
Monday, December 9, 13
Comparison of open source databases
Random Reads
LevelDBKyoto TreeDBSQLite3
129,000 ops/sec151,000 ops/sec134,000 ops/sec
LevelDBKyoto TreeDBSQLite3
164,000 ops/sec88,500 ops/sec
9,860 ops/sec
Random Writes
Monday, December 9, 13
HBase and HDFS (in April 2012)
Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html
Monday, December 9, 13
HBase and HDFS (in April 2012)
Random Reads
Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html
Monday, December 9, 13
HBase and HDFS (in April 2012)
Random Reads
HDFS (1 node)HBase (1 node)
93,000 ops/sec35,000 ops/sec
Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Transaction log
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Transaction log
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Transaction log
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Transaction log Read Only data in RAM on
disk
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from Application
Transaction log Read Only data in RAM on
disk
Periodic Compaction
Monday, December 9, 13
Log Structured Merge Architecture
Read Write data in RAM
Write Request from ApplicationScan Request from Application
Transaction log Read Only data in RAM on
disk
Periodic Compaction
Monday, December 9, 13
Leveldb has low write ratesFacebook Application 1:
• Write rate 2 MB/sec only per machine
• Only one cpu was used
Monday, December 9, 13
Leveldb has low write ratesFacebook Application 1:
• Write rate 2 MB/sec only per machine
• Only one cpu was used
We developed multithreaded compaction
10ximprovement on
write rate
100%of cpus are
in use+Monday, December 9, 13
Leveldb has stallsFacebook Feed:
• P99 latencies were tens of seconds
• Single-threaded compaction
Monday, December 9, 13
Leveldb has stallsFacebook Feed:
• P99 latencies were tens of seconds
• Single-threaded compaction
We implemented thread aware compaction
Dedicated thread(s) to flush memtable Pipelined memtables
P99 reduced to less than a second
Monday, December 9, 13
Leveldb has high write amplification•Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very high
Monday, December 9, 13
Leveldb has high write amplification•Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very high
Two compactions by LevelDB Style Compaction
5 bytes
6 bytes
10 bytes
Level-0
Level-1
Level-2
Stage 1
11 bytes
10 bytes
Stage 2
10 bytes
Stage 3
Monday, December 9, 13
Our solution: lower write amplification•Facebook Application 2:
• We implemented Universal Style Compaction
• Start from newest file, include next file in candidate set if
• Candidate set size >= size of next file
Monday, December 9, 13
Our solution: lower write amplification•Facebook Application 2:
• We implemented Universal Style Compaction
• Start from newest file, include next file in candidate set if
• Candidate set size >= size of next file Single compaction by Universal Style Compaction
5bytes
6 bytes
10 bytes
Level-0
Level-1
Level-2
Stage 1
10 bytes
Stage 2
Write amplification reduced to <10Monday, December 9, 13
Leveldb has high read amplification
Monday, December 9, 13
Leveldb has high read amplification
•Secondary Index Service:
•Leveldb does not use blooms for scans
Monday, December 9, 13
Leveldb has high read amplification
•Secondary Index Service:
•Leveldb does not use blooms for scans
•We implemented prefix scans
• Range scans within same key prefix
• Blooms created for prefix
• Reduces read amplification
Monday, December 9, 13
Leveldb: read modify write = 2X IOs
Monday, December 9, 13
Leveldb: read modify write = 2X IOs
•Counter increments
• Get value, value++, Put value
• Leveldb uses 2X IOPS
Monday, December 9, 13
Leveldb: read modify write = 2X IOs
•Counter increments
• Get value, value++, Put value
• Leveldb uses 2X IOPS
•We implemented MergeRecord
• Put “++” operation in MergeRecord
• Background compaction merges all MergeRecords
• Uses only 1X IOPS
Monday, December 9, 13
Leveldb has a Rigid Design
Monday, December 9, 13
Leveldb has a Rigid Design
•LevelDB Design
• Cannot tune system, fixed file sizes
Monday, December 9, 13
Leveldb has a Rigid Design
•LevelDB Design
• Cannot tune system, fixed file sizes
•We wanted a pluggable architecture
• Pluggable compaction filter, e.g. TimeToLive
• Pluggable memtable/sstable for RAM/Flash
• Pluggable Compaction Algorithm
Monday, December 9, 13
The Changes we did to LevelDB
Monday, December 9, 13
The Changes we did to LevelDB
Inherited from LevelDB
• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and Reverse Iteration
Monday, December 9, 13
The Changes we did to LevelDB
Inherited from LevelDB
• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and Reverse Iteration
RocksDB
• 10X higher write rate
• Fewer stalls
• 7x lower write amplification
• Blooms for range scans
• Ability to avoid read-modify-write
• Optimizations for flash or RAM
• And many more…
Monday, December 9, 13
RocksDB is born!
•Key-Value persistent store
•Embedded
•Optimized for fast storage
•Server workloads
Monday, December 9, 13
RocksDB is born!
•Key-Value persistent store
•Embedded
•Optimized for fast storage
•Server workloads
Monday, December 9, 13
What is it not?
•Not distributed
•No failover
•Not highly-available, if machine dies you lose your data
Monday, December 9, 13
What is it not?
•Not distributed
•No failover
•Not highly-available, if machine dies you lose your data
Monday, December 9, 13
RocksDB API▪ Keys and values are arbitrary byte arrays.
▪ Data is stored sorted by key.
▪ The basic operations are Put(key,value), Get(key), Delete(key) and Merge(key, delta)
▪ Forward and backward iteration is supported over the data.
Monday, December 9, 13
RocksDB Architecture
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
RocksDB Architecture
Write Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
RocksDB Architecture
Write Request
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
RocksDB Architecture
Write Request
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch
Memory
Switch
Monday, December 9, 13
RocksDB Architecture
Write Request
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch
Memory Persistent Storage
Switch
Monday, December 9, 13
Log Structured Merge Tree -- Writes
▪ Log Structured Merge Tree
▪ New Puts are written to memory and optionally to transaction log
▪ Also can specify log write sync option for each individual write
▪ We say RocksDB is optimized for writes, what does this mean?
Monday, December 9, 13
RocksDB Write Path
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
RocksDB Write Path
Write Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
Log Structured Merge Tree -- Reads
▪ Data could be in memory or on disk
▪ Consult multiple files to find the latest instance of the key
▪ Use bloom filters to reduce IO
Monday, December 9, 13
RocksDB Read Path
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Blooms
Monday, December 9, 13
RocksDB Read Path
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Blooms
Monday, December 9, 13
RocksDB Read Path
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Memory
Blooms
Monday, December 9, 13
RocksDB Read Path
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Memory Persistent Storage
Blooms
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Transaction log
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Transaction log
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Transaction log
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Transaction log Pluggable sst data format on storage
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from Application
Transaction log Pluggable sst data format on storage
Pluggable Compaction
Blooms
CustomizableWAL
Monday, December 9, 13
RocksDB: Open & Pluggable
Pluggable Memtable
format in RAM
Write Request from ApplicationGet or Scan Request from Application
Transaction log Pluggable sst data format on storage
Pluggable Compaction
Blooms
CustomizableWAL
Monday, December 9, 13
Example: Customizable WALogging
•In-house Replication solution wants to be able to embed arbitrary blob in the rocksdb WAL stream for log annotation
•Use Case: Indicate where a log record came from in multi-master replication
•Solution: A Put that only speaks to the log
Monday, December 9, 13
Example: Customizable WALogging
Active MemTable
log
Replication Layer In one write batch:
PutLogData(“I came from Mars”)Put(k1,v1)
k1 v1 “I came from Mars”
k1/v1
Monday, December 9, 13
Example: Customizable WALogging
Write Request
Active MemTable
log
Replication Layer In one write batch:
PutLogData(“I came from Mars”)Put(k1,v1)
k1 v1 “I came from Mars”
k1/v1
Monday, December 9, 13
Example: Pluggable SST format
•One Facebook use case needs extreme fast response but could tolerate some loss of durability
•Quick hack: mount sst in tmpfs
•Still not performant:
• existing sst format is block based
•Solution: A much simpler format that just stores sorted key/value pairs sequentially
• no blocks, no caching, mmap the whole file
• build efficient lookup index on load
Monday, December 9, 13
Example: Blooms for MemTable
•Same use case, after we optimized sst access, memtable lookup becomes a major cost in query
•Problem: Get needs to go through the memtable lookups that eventually return no data
•Solution: Just add a bloom filter to memtable!
Monday, December 9, 13
RocksDB Read Path
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Blooms
Blooms
Blooms
Monday, December 9, 13
RocksDB Read Path
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Blooms
Blooms
Blooms
Monday, December 9, 13
RocksDB Read Path
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Memory
Blooms
Blooms
Blooms
Monday, December 9, 13
RocksDB Read Path
Read Request
FlushCompaction
Active MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Memory Persistent Storage
Blooms
Blooms
Blooms
Monday, December 9, 13
Example: Pluggable memtable format
•Another Facebook use case has a distinct load phase where no query is issued.
•Problem: write throughput is limited by single writer thread
•Solution: A new memtable representation that does not keep keys sorted
Monday, December 9, 13
Example: Pluggable memtable format
Sort, Flush
Compaction
Unsorted MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
Example: Pluggable memtable format
Write Request
Sort, Flush
Compaction
Unsorted MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
Example: Pluggable memtable format
Write Request
Read Request
Sort, Flush
Compaction
Unsorted MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch Switch
Monday, December 9, 13
Example: Pluggable memtable format
Write Request
Read Request
Sort, Flush
Compaction
Unsorted MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch
Memory
Switch
Monday, December 9, 13
Example: Pluggable memtable format
Write Request
Read Request
Sort, Flush
Compaction
Unsorted MemTable
ReadOnlyMemTable
log
log
log
LSM
sst sst
sst sst
sst
sst
d
Switch
Memory Persistent Storage
Switch
Monday, December 9, 13
Possible workloads for RocksDB?
▪ Serving data to users via a website
▪ A spam detection backend that needs fast access to data
▪ A graph search query that needs to scan a dataset in realtime
▪ Distributed Configuration Management Systems
▪ Fast serve of Hive Data
▪ A Queue that needs a high rate of inserts and deletes
Monday, December 9, 13
Futures
Monday, December 9, 13
Futures
•Scale linearly with number of cpus
• 32, 64 or higher core machines
• ARM processors
Monday, December 9, 13
Futures
•Scale linearly with number of cpus
• 32, 64 or higher core machines
• ARM processors
•Scale linearly with storage iops
• Striped flash cards
• RAM & NVRAM storage
Monday, December 9, 13
Come Hack with us
Monday, December 9, 13
Come Hack with us
•RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksdb.dev/
Monday, December 9, 13
Come Hack with us
•RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksdb.dev/
•Help us HACK RocksDB
Monday, December 9, 13
Monday, December 9, 13