tech talk: rocksdb slides by dhruba borthakur & haobo xu of facebook

101
The Story of RocksDB Dhruba Borthakur & Haobo Xu Database Engineering@Facebook Embedded Key-Value Store for Flash and RAM Monday, December 9, 13

Upload: the-hive

Post on 27-Jan-2015

159 views

Category:

Technology


14 download

DESCRIPTION

This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.

TRANSCRIPT

Page 1: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

The Story of RocksDB

Dhruba Borthakur & Haobo XuDatabase Engineering@Facebook

Embedded Key-Value Store for Flash and RAM

Monday, December 9, 13

Page 2: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Monday, December 9, 13

Page 3: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Monday, December 9, 13

Page 4: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

A Client-Server Architecture with disks

Network roundtrip = 50 micro sec

Disk access = 10 milli seconds

Application ServerDatabase

Server

Locally attached Disks

Monday, December 9, 13

Page 5: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Client-Server Architecture with fast storage

Network roundtrip = 50 micro secApplication Server

Database Server

Latency dominated by network

SSD RAM

100 microsecs

100 nanosecs

Monday, December 9, 13

Page 6: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

Monday, December 9, 13

Page 7: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

Monday, December 9, 13

Page 8: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

SSD RAM

100 microsecs

100 nanosecs

Monday, December 9, 13

Page 9: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Architecture of an Embedded Database

Network roundtrip = 50 micro sec

Application Server

Database Server

Storage attached directly to application servers

SSD RAM

100 microsecs

100 nanosecs

Monday, December 9, 13

Page 10: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Any pre-existing embedded databases?

Open Source

FB Proprietary

Monday, December 9, 13

Page 11: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Any pre-existing embedded databases?

1. Berkeley DB2. SQLite3. Kyoto TreeDB4. LevelDB

Open Source

FB Proprietary

Key-value stores

Monday, December 9, 13

Page 12: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Any pre-existing embedded databases?

1. High Performant2. No transaction log3. Fixed size keys

Open Source

FB Proprietary

Key-value stores

Monday, December 9, 13

Page 13: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Comparison of open source databases

Monday, December 9, 13

Page 14: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Comparison of open source databases

Random Reads

Monday, December 9, 13

Page 15: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Comparison of open source databases

Random Reads

Random Writes

Monday, December 9, 13

Page 16: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Comparison of open source databases

Random Reads

LevelDBKyoto TreeDBSQLite3

129,000 ops/sec151,000 ops/sec134,000 ops/sec

Random Writes

Monday, December 9, 13

Page 17: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Comparison of open source databases

Random Reads

LevelDBKyoto TreeDBSQLite3

129,000 ops/sec151,000 ops/sec134,000 ops/sec

LevelDBKyoto TreeDBSQLite3

164,000 ops/sec88,500 ops/sec

9,860 ops/sec

Random Writes

Monday, December 9, 13

Page 18: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

HBase and HDFS (in April 2012)

Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13

Page 19: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

HBase and HDFS (in April 2012)

Random Reads

Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13

Page 20: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

HBase and HDFS (in April 2012)

Random Reads

HDFS (1 node)HBase (1 node)

93,000 ops/sec35,000 ops/sec

Details of this experiment: http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html

Monday, December 9, 13

Page 21: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Monday, December 9, 13

Page 22: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Monday, December 9, 13

Page 23: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Monday, December 9, 13

Page 24: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Monday, December 9, 13

Page 25: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log

Monday, December 9, 13

Page 26: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log

Monday, December 9, 13

Page 27: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log

Monday, December 9, 13

Page 28: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log Read Only data in RAM on

disk

Monday, December 9, 13

Page 29: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from Application

Transaction log Read Only data in RAM on

disk

Periodic Compaction

Monday, December 9, 13

Page 30: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Architecture

Read Write data in RAM

Write Request from ApplicationScan Request from Application

Transaction log Read Only data in RAM on

disk

Periodic Compaction

Monday, December 9, 13

Page 31: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has low write ratesFacebook Application 1:

• Write rate 2 MB/sec only per machine

• Only one cpu was used

Monday, December 9, 13

Page 32: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has low write ratesFacebook Application 1:

• Write rate 2 MB/sec only per machine

• Only one cpu was used

We developed multithreaded compaction

10ximprovement on

write rate

100%of cpus are

in use+Monday, December 9, 13

Page 33: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has stallsFacebook Feed:

• P99 latencies were tens of seconds

• Single-threaded compaction

Monday, December 9, 13

Page 34: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has stallsFacebook Feed:

• P99 latencies were tens of seconds

• Single-threaded compaction

We implemented thread aware compaction

Dedicated thread(s) to flush memtable Pipelined memtables

P99 reduced to less than a second

Monday, December 9, 13

Page 35: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has high write amplification•Facebook Application 2:

• Level Style Compaction

• Write amplification of 70 very high

Monday, December 9, 13

Page 36: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has high write amplification•Facebook Application 2:

• Level Style Compaction

• Write amplification of 70 very high

Two compactions by LevelDB Style Compaction

5 bytes

6 bytes

10 bytes

Level-0

Level-1

Level-2

Stage 1

11 bytes

10 bytes

Stage 2

10 bytes

Stage 3

Monday, December 9, 13

Page 37: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Our solution: lower write amplification•Facebook Application 2:

• We implemented Universal Style Compaction

• Start from newest file, include next file in candidate set if

• Candidate set size >= size of next file

Monday, December 9, 13

Page 38: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Our solution: lower write amplification•Facebook Application 2:

• We implemented Universal Style Compaction

• Start from newest file, include next file in candidate set if

• Candidate set size >= size of next file Single compaction by Universal Style Compaction

5bytes

6 bytes

10 bytes

Level-0

Level-1

Level-2

Stage 1

10 bytes

Stage 2

Write amplification reduced to <10Monday, December 9, 13

Page 39: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has high read amplification

Monday, December 9, 13

Page 40: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has high read amplification

•Secondary Index Service:

•Leveldb does not use blooms for scans

Monday, December 9, 13

Page 41: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has high read amplification

•Secondary Index Service:

•Leveldb does not use blooms for scans

•We implemented prefix scans

• Range scans within same key prefix

• Blooms created for prefix

• Reduces read amplification

Monday, December 9, 13

Page 42: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb: read modify write = 2X IOs

Monday, December 9, 13

Page 43: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb: read modify write = 2X IOs

•Counter increments

• Get value, value++, Put value

• Leveldb uses 2X IOPS

Monday, December 9, 13

Page 44: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb: read modify write = 2X IOs

•Counter increments

• Get value, value++, Put value

• Leveldb uses 2X IOPS

•We implemented MergeRecord

• Put “++” operation in MergeRecord

• Background compaction merges all MergeRecords

• Uses only 1X IOPS

Monday, December 9, 13

Page 45: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has a Rigid Design

Monday, December 9, 13

Page 46: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has a Rigid Design

•LevelDB Design

• Cannot tune system, fixed file sizes

Monday, December 9, 13

Page 47: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Leveldb has a Rigid Design

•LevelDB Design

• Cannot tune system, fixed file sizes

•We wanted a pluggable architecture

• Pluggable compaction filter, e.g. TimeToLive

• Pluggable memtable/sstable for RAM/Flash

• Pluggable Compaction Algorithm

Monday, December 9, 13

Page 48: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

The Changes we did to LevelDB

Monday, December 9, 13

Page 49: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB

• Gets/Puts/Scans of keys

• Forward and Reverse Iteration

Monday, December 9, 13

Page 50: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

The Changes we did to LevelDB

Inherited from LevelDB

• Log Structured Merge DB

• Gets/Puts/Scans of keys

• Forward and Reverse Iteration

RocksDB

• 10X higher write rate

• Fewer stalls

• 7x lower write amplification

• Blooms for range scans

• Ability to avoid read-modify-write

• Optimizations for flash or RAM

• And many more…

Monday, December 9, 13

Page 51: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB is born!

•Key-Value persistent store

•Embedded

•Optimized for fast storage

•Server workloads

Monday, December 9, 13

Page 52: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB is born!

•Key-Value persistent store

•Embedded

•Optimized for fast storage

•Server workloads

Monday, December 9, 13

Page 53: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

What is it not?

•Not distributed

•No failover

•Not highly-available, if machine dies you lose your data

Monday, December 9, 13

Page 54: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

What is it not?

•Not distributed

•No failover

•Not highly-available, if machine dies you lose your data

Monday, December 9, 13

Page 55: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB API▪ Keys and values are arbitrary byte arrays.

▪ Data is stored sorted by key.

▪ The basic operations are Put(key,value), Get(key), Delete(key) and Merge(key, delta)

▪ Forward and backward iteration is supported over the data.

Monday, December 9, 13

Page 56: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Architecture

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 57: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Architecture

Write Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 58: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Architecture

Write Request

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 59: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Architecture

Write Request

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory

Switch

Monday, December 9, 13

Page 60: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Architecture

Write Request

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory Persistent Storage

Switch

Monday, December 9, 13

Page 61: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Tree -- Writes

▪ Log Structured Merge Tree

▪ New Puts are written to memory and optionally to transaction log

▪ Also can specify log write sync option for each individual write

▪ We say RocksDB is optimized for writes, what does this mean?

Monday, December 9, 13

Page 62: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Write Path

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 63: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Write Path

Write Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 64: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Log Structured Merge Tree -- Reads

▪ Data could be in memory or on disk

▪ Consult multiple files to find the latest instance of the key

▪ Use bloom filters to reduce IO

Monday, December 9, 13

Page 65: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Monday, December 9, 13

Page 66: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Monday, December 9, 13

Page 67: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory

Blooms

Monday, December 9, 13

Page 68: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory Persistent Storage

Blooms

Monday, December 9, 13

Page 69: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Blooms

CustomizableWAL

Monday, December 9, 13

Page 70: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Blooms

CustomizableWAL

Monday, December 9, 13

Page 71: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Blooms

CustomizableWAL

Monday, December 9, 13

Page 72: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Blooms

CustomizableWAL

Monday, December 9, 13

Page 73: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log

Blooms

CustomizableWAL

Monday, December 9, 13

Page 74: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log

Blooms

CustomizableWAL

Monday, December 9, 13

Page 75: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log

Blooms

CustomizableWAL

Monday, December 9, 13

Page 76: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log Pluggable sst data format on storage

Blooms

CustomizableWAL

Monday, December 9, 13

Page 77: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from Application

Transaction log Pluggable sst data format on storage

Pluggable Compaction

Blooms

CustomizableWAL

Monday, December 9, 13

Page 78: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB: Open & Pluggable

Pluggable Memtable

format in RAM

Write Request from ApplicationGet or Scan Request from Application

Transaction log Pluggable sst data format on storage

Pluggable Compaction

Blooms

CustomizableWAL

Monday, December 9, 13

Page 79: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Customizable WALogging

•In-house Replication solution wants to be able to embed arbitrary blob in the rocksdb WAL stream for log annotation

•Use Case: Indicate where a log record came from in multi-master replication

•Solution: A Put that only speaks to the log

Monday, December 9, 13

Page 80: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Customizable WALogging

Active MemTable

log

Replication Layer In one write batch:

PutLogData(“I came from Mars”)Put(k1,v1)

k1 v1 “I came from Mars”

k1/v1

Monday, December 9, 13

Page 81: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Customizable WALogging

Write Request

Active MemTable

log

Replication Layer In one write batch:

PutLogData(“I came from Mars”)Put(k1,v1)

k1 v1 “I came from Mars”

k1/v1

Monday, December 9, 13

Page 82: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable SST format

•One Facebook use case needs extreme fast response but could tolerate some loss of durability

•Quick hack: mount sst in tmpfs

•Still not performant:

• existing sst format is block based

•Solution: A much simpler format that just stores sorted key/value pairs sequentially

• no blocks, no caching, mmap the whole file

• build efficient lookup index on load

Monday, December 9, 13

Page 83: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Blooms for MemTable

•Same use case, after we optimized sst access, memtable lookup becomes a major cost in query

•Problem: Get needs to go through the memtable lookups that eventually return no data

•Solution: Just add a bloom filter to memtable!

Monday, December 9, 13

Page 84: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Blooms

Blooms

Monday, December 9, 13

Page 85: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Blooms

Blooms

Blooms

Monday, December 9, 13

Page 86: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory

Blooms

Blooms

Blooms

Monday, December 9, 13

Page 87: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

RocksDB Read Path

Read Request

FlushCompaction

Active MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Memory Persistent Storage

Blooms

Blooms

Blooms

Monday, December 9, 13

Page 88: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable memtable format

•Another Facebook use case has a distinct load phase where no query is issued.

•Problem: write throughput is limited by single writer thread

•Solution: A new memtable representation that does not keep keys sorted

Monday, December 9, 13

Page 89: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable memtable format

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 90: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable memtable format

Write Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 91: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable memtable format

Write Request

Read Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch Switch

Monday, December 9, 13

Page 92: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable memtable format

Write Request

Read Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory

Switch

Monday, December 9, 13

Page 93: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Example: Pluggable memtable format

Write Request

Read Request

Sort, Flush

Compaction

Unsorted MemTable

ReadOnlyMemTable

log

log

log

LSM

sst sst

sst sst

sst

sst

d

Switch

Memory Persistent Storage

Switch

Monday, December 9, 13

Page 94: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Possible workloads for RocksDB?

▪ Serving data to users via a website

▪ A spam detection backend that needs fast access to data

▪ A graph search query that needs to scan a dataset in realtime

▪ Distributed Configuration Management Systems

▪ Fast serve of Hive Data

▪ A Queue that needs a high rate of inserts and deletes

Monday, December 9, 13

Page 95: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Futures

Monday, December 9, 13

Page 96: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Futures

•Scale linearly with number of cpus

• 32, 64 or higher core machines

• ARM processors

Monday, December 9, 13

Page 97: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Futures

•Scale linearly with number of cpus

• 32, 64 or higher core machines

• ARM processors

•Scale linearly with storage iops

• Striped flash cards

• RAM & NVRAM storage

Monday, December 9, 13

Page 98: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Come Hack with us

Monday, December 9, 13

Page 99: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Come Hack with us

•RocksDB is Open Sourced

• http://rocksdb.org

• Developers group https://www.facebook.com/groups/rocksdb.dev/

Monday, December 9, 13

Page 100: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Come Hack with us

•RocksDB is Open Sourced

• http://rocksdb.org

• Developers group https://www.facebook.com/groups/rocksdb.dev/

•Help us HACK RocksDB

Monday, December 9, 13

Page 101: Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Monday, December 9, 13