rocksdb vs. trocksdb - snia · rocksdb has some challenges for ssds: write amplification is very...

32
Remi Brasga Sr. Software Engineer | Memory Storage Strategy Division Toshiba Memory America, Inc. TRocksDB

Upload: others

Post on 21-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Remi BrasgaSr. Software Engineer | Memory Storage Strategy Division

Toshiba Memory America, Inc.

TRocksDB

Page 2: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Hi!

I’m Remi (Remington Brasga).

As a Sr. Software Engineer for the Memory Storage Strategy Division at Toshiba Memory America, Inc., I am dedicated to development and collaboration on open source software.I earned my B.S. and M.S. from University of California, Irvine.

Today, I am here fresh from having my 3rd son who has kept me very busy the past few weeks.

(thanks for letting me come here and rest!)

Page 3: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Applications are King

• Software is eating the world

• Storage is “Defined by Software”– i.e., applications define how storage is used

• Yet, applications do not always use storage wisely

Page 4: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

For Example, RocksDB

RocksDB is a popular data storage engine…used by a wide range of database applications

Cassandra is a registered trademark of The Apache Software Foundation. Ceph is a trademark of Red Hat, Inc. or its subsidiaries in the United States and other countries. Python is a registered trademark of the Python Software Foundation. MariaDB is a registered trademark of MariaDB in the European Union and other regions.

ArangoDB Ceph™

Python®MyRocks Rockset

Cassandra® MariaDB®

All other company names, product names and service names may be trademarks of their respective companies.

Page 5: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

The Challenge of Rocks

While highly effective as a data storage engine,RocksDB has some challenges for SSDs:

Write Amplification is very large 20-30X (or more)as a result of the compaction layers rewriting the same data

This can result in impact to SSD endurance.

Page 6: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Level 4Target 1000 GB

Level 1Target 1GB

Level 2Target 10 GB

Level 3Target 100 GB

Level 0

How does RocksDB work?

Keys and Values are stored together in pairs.

As more data is added, these pairs waterfall

through the levels.

This waterfall occurs during each compaction.

K+V

DataKey

Page 7: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Let’s Demonstrate

Page 8: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1

Key & Value Key & Value Key & Value

New key value database entries

Saved to SSD in compaction level 1

LSM

Page 9: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1

More writes to compaction level 1

Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value

Compactionlevel 1 fills up

LSM

Page 10: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1 Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & ValueCompaction level 1 is full;

Written to RAM…

Consolidated for compaction level 2

LSM

Page 11: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1

Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value

Consolidated datawritten to compaction level 2

Application-generated write amplification

LSM

Page 12: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1

Compaction level 2 filling up Application-generated write amplification

Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & ValueKey & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & ValueKey & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & ValueKey & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & ValueKey & Value Key & Value Key & Value Key & Value Key & Value Key & Value Key & Value

LSM

Page 13: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1

Compaction level 2 fills up

Written to DRAM,and consolidated

LSM

Page 14: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

RocksDB…the Old Way…Log Structured Merge

CPU

DRAM

Compaction Level 2

Compaction Level 3

Compaction Level 1Written to compaction

level 3Application-generated

write amplificationLSM

Page 15: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Result

As you can see, • This results in rewriting data “that doesn’t change” during compaction• This can have a significant impact on SSD health and endurance

The good news is…

There is a better way

Toshiba Memory Americahas

re-architected RocksDB to solve this SSD endurance challenge

Page 16: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Inspiration

We looked at some of the academic work on Rocks

– Major inspiration came from the “wisckey” from the university of WisconsinWiscKey: Separating Keys from Values in SSD-conscious storage

https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf

Page 17: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Coding Work

Significant code was involved to split the keys from the values

• Replaced the LSM key value pairs with the ring buffer (for a Value Log - VLog)

• The VLog is a large number of files, – with each file having about the same number of values as an SST* has keys

• VLogFiles are organized into FIFO structures called VLogRings

• One tricky issue - keeping track of the state of the VLog over a restart• Active recycling added to compactions to reduce instances of

uncompacted files

*SST=Stored String Table

Page 18: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Coding Work

Work involved managing the keys

• While keys mostly are cached in memory • They still need to be written into LSM layers

• Compactions still occur, – about one per second as usual but only on the keys

• Coding for compaction involved – Compacting the keys– And cleaning up the stale data in the ring buffers

Page 19: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Performance Tuning

Tuning performance in the code was a challenge

• Some changes made performance worse (against every expectation…)

• The split keys and values approach needed new test cases

• There are a number of new option settings to enable optimizationsthis added improvements to – Compaction picker improved ( to reduce need of active recycling )– Write amp and space amp measuring– Bulk load efficiency

Page 20: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Coding Merge

Code Merge:

RocksDB continues to evolve – TRocksDB is maintaining version compatibility with each

current revision of RocksDB– TRocks is fully compatible and cross compiles with/into

regular RocksDB code today– Once compiled Rocks or TRocks can be enabled with option

settings

Page 21: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

How does TRocksDB work?

Keys and Valuesare split

Keys go into compaction layers (in memory)

Values are stored separately into

a ring buffer

Value Log

V

Level 4Target 1000 GB

Level 1Target 1GB

Level 2Target 10 GB

Level 3Target 100 GB

Level 0K

DataKey

V

Value is unaffected by compactions

Page 22: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

The Improved TRocksDB Method

Page 23: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

TRocksDB…the New Way Ring Buffer, Separate Keys/Values

CPU

DRAM

Key Value Key Value Key Value Key Value

Ring Buffer

LSM

Create new database entries

Page 24: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

TRocksDB…the New Way Ring Buffer, Separate Keys/Values

CPU

DRAM

Ring Buffer

LSM

Key Value Key Value Key Value Key Value

Write Values to Ring Buffer

Keys stored separately (cacheable)

SSD endurance improved

via lowered Write Amp

Lookups are lightening fast

with keys cached in DRAM

Page 25: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Our Goal during Development

Improved(reduced) Write Amplification with

the same or better performance

Page 26: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Improved Write Amplification

0

1

2

3

4

5

Test 1:Random Bulk Load

Test 2: Bulk Sequential Load

Test 3:Random Overwrites

0.6 0.6

2.1

1 1

4.9

Write Amplification(lower is better)

TRocks 1.2.5.23 Rocks FB 6.4

-40% -40%

-57%

Performance results based on internal Toshiba Memory testing. Benchmark tests were developed by Facebook and generated in July 2018. They measure RocksDB performance when data resides on flash storage and tested using the newest release RocksDB 6.1.fb. The benchmark information and test descriptions are available at https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks.

Page 27: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

Test 1:Random Bulk Load

Test 2: Bulk Sequential Load

Test 3:Random Overwrites

Test 4:Random Key Reads

21,565

6,1559,595

118,244

29,742

6,938 9,890

155,081

Performance Comparison(lower is better)

TRocks 1.2.5.23 Rocks FB 6.4

-28%-11% -3%

-24%

Same or Better Performance

Performance results based on internal Toshiba Memory testing. Benchmark tests were developed by Facebook and generated in July 2018. They measure RocksDB performance when data resides on flash storage and tested using the newest release RocksDB 6.1.fb. The benchmark information and test descriptions are available at https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks.

Seconds

Page 28: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Tuning and Optimization

• Every database is different

• Rocks and TRocks have a wide range settings / optionsto optimize the database

• These settings are often confusing or “left in default”

GitHub is an exclusive trademark registered in the United States by GitHub, Inc. All other company names, product names and service names may be trademarks of their respective companies.

Page 29: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Tuning and Optimization• Toshiba Memory America has

created a “tuning” tool

– It is on a Google spreadsheet• Available from the GitHub®

– Easy to use - just fill in the blanks

– It generates known good settings

Fill in the blanks

Generate “known-good” options settings

GitHub is an exclusive trademark registered in the United States by GitHub, Inc. All other company names, product names and service names may be trademarks of their respective companies.

Page 30: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Call to Action

DownloadTestUse

Enjoy !

Join the projectimprove and contribute to the code

Page 31: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

Developed by Toshiba Memory America(Which will be renamed KIOXIA America, Inc. as of October 1, 2019)

Open Source

Improved write amplification = SSD last longerwith the same or better performance

Available today:https://github.com/ToshibaMemoryAmerica

All company names, product names and service names may be trademarks of their respective companies.

Page 32: RocksDB vs. TRocksDB - SNIA · RocksDB has some challenges for SSDs: Write Amplification is very large . 20-30X (or more) as a result of the compaction layers rewriting the same data

© 2019 Toshiba Memory America, Inc. All rights reserved. Information, including product pricing and specifications, content of services, and contact information is current and believed to be accurate on the date of the announcement, but is subject to change without prior notice. Technical and application information contained here is subject to the most recent applicable Toshiba Memory product specifications.