a next generation storage engine for nosql database systems preview: couchbase connect 2014

56
Preview: A Next Generation Storage Engine for NoSQL Database Systems Chiyoung Seo Software Architect, Couchbase Inc.

Upload: couchbase

Post on 29-May-2015

1.068 views

Category:

Data & Analytics


0 download

DESCRIPTION

B+-tree has been used as one of the main index structures in a database field fore more than four decades. However, typical B+-tree implementations show scalability and performance issues as modern global-scale Web or mobile applications generate huge volumes of data that has not been seen before. Various key-value storage engines with variants of B+-tree, such as log-structured merge tree (LSM-tree) have been proposed to address these limitations. At Couchbase, we also have been working on a new key-value storage engine that can provide high scalability and performance, and recently released the beta version of ForestDB, whose main index structure is based on Hierarchical B+-Tree based Trie or HB+-Trie. In this presentation, we introduce ForestDB and discuss why ForestDB can be fitted well for modern big data applications. We also explain various optimizations on ForestDB, which are planned especially for solid-state drives (SSDs).

TRANSCRIPT

Page 1: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Preview: A Next Generation Storage Engine for NoSQL Database Systems

Chiyoung Seo

Software Architect, Couchbase Inc.

Page 2: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 2

Why do we need a new KV storage engine?

ForestDB HB+-Trie Block aligning Write Ahead Logging (WAL) Block buffer cache Evaluation

Optimizations for Solid-State Drives (SSDs) Async I/O to exploit parallel I/O capabilities from SSDs Volume manager inside ForestDB Lightweight and I/O efficient compaction Append / Prepend support

Summary

Contents

Page 3: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Why do we need a new KV storage engine?

Page 4: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 4

Operate on huge volumes of unstructured data

Significant amount of new data is constantly generated from hundreds of millions of users or devices

Still require high performance and scalability in managing their ever-growing database

Underlying storage engine is one of the most critical parts in database systems to provide high performance / scalability

Modern Web / Mobile Applications

Page 5: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 5

Main storage index structure in a database field

Generalization of binary search tree

Each node consists of two or more {key, value (or pointer)} pairs Fanout (or branch) degree: # of KV pairs in a node

Node size is generally fitted into multiple page size

B+Tree

Page 6: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

03/26

B+Tree

Ki: ith smallest key in the nodePi: pointer corresponding to Ki

Vi: value corresponding to Ki

f: fanout degree

K1 P1 … … Kd Pd

K1 V1 K2 V2 … … Kf Vf …

Index (non-leaf) node

Leaf node

… Kj Pj … … Kl Pl

K1 P1 … … Kj PjRoot node

K1 P1 … … Kf Pf Kj Pj … … Kn Pn

… …

… …

Kj Vj Kk Vk … … Kn Vn

Index (non-leaf) node

Page 7: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Not suitable to index variable or fixed-length long keys Significant space overhead as entire key strings are indexed in non-leaf nodes

Tree depth grows quickly as more data is loaded

I/O performance is degraded significantly as the data size gets bigger

Several variants of B+Tree were proposed LevelDB (Google) RocksDB (Facebook) TokuDB (Tokutek)

B+Tree Limitations

04/26

Page 8: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Fast and scalable index structure for variable or fixed-length long keys Targeting block I/O storage devices not only SSD but also legacy HDD

Less storage space overhead Reduce write amplification

Regardless of the pattern of keys Efficient to keys both sharing common prefix and not sharing common prefix

Goals

06/26

Page 9: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB

Page 10: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 11

Key-Value storage engine developed by Couchbase Caching / Storage team

Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie

- HB+-Trie was originally presented at ACM SIGMOD 2011 Programming Contest, by Jung-Sang Ahn who works at Couchbase

(http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf)

Significantly better read and write performance with less storage overhead

Support various server OSs (Centos, Ubuntu, Debian, Mac OS x, Windows) and mobile OSs (iOS, Android)

1.0 beta was released last week

ForestDB

Page 11: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 12

Multi-Version Concurrency Control (MVCC) with append-only storage model Write-Ahead Logging (WAL) A value can be retrieved by its sequence number or disk offset in addition to a

key Custom compare function to support a customized key order Snapshot support to provide different views of database Rollback to revert the database to a specific point Ranged iteration by keys or sequence numbers Transactional support with read_committed or read_uncommitted isolation level Manual or auto compaction configured per KV instance

Main Features

Page 12: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB: Main Index Structure

Page 13: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Trie (prefix tree) whose node is B+Tree A key is split into the list of fixed-size chunks (sub-string of the key)

HB+Trie (Hierarchical B+Tree based Trie)

Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…

07/26Lexicographical ordered traversal

Search using Chunk1

Document

B+Tree (Node of HB+Trie)

Node of B+Tree

Chunk1 Chunk2 Chunk3 …

a83j gls8 3jgo …

Search using Chunk2

Search using Chunk3 07/26

Page 14: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

1stInsert ‘aaaa’

B+Tree using 1st

chunk as key

Page 15: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

1stInsert ‘aaaa’

aaaaa

Distinguishable by first chunk ‘a’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte B+Tree using 1st

chunk as key

Page 16: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

Distinguishable by first chunk ‘b’

B+Tree using 1st

chunk as key

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

Insert ‘bbbb’

aaaa

1st

abbbb

b

Page 17: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

B+Tree using 1st

chunk as key

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

Insert ‘aaab’

aaaa

1st

abbbb

b

Cannot distinguish using first chunk ‘a’

Page 18: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

Insert ‘aaab’

aaaaCannot distinguish using first chunk ‘a’

First distinguishable

chunk: 4th

B+Tree using 1st

chunk as key

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

1st

abbbb

b

Page 19: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

Store skipped common prefix ‘aa’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

1st

abbbb

b

4th aa

aaaaa

aaabb

B+Tree using 4th chunk as key,

skipping common prefix ‘aa’

Page 20: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

1st

abbbb

b

4th aa

aaaaa

aaabb

Insert ‘bbcd’Cannot distinguish using first chunk ‘b’

B+Tree using 4th chunk as key,

skipping common prefix ‘aa’

Page 21: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Prefix Compression

1st

abbbb

b

4th aa

aaaaa

aaabb

Insert ‘bbcd’Cannot distinguish using first chunk ‘b’

First distinguishable

chunk: 3rd

B+Tree using 4th chunk as key,

skipping common prefix ‘aa’

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

Page 22: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

As original trie, each node (B+Tree) is created on-demand (except for root node)Example: Chunk size = 1 byte

Prefix Compression

1st

a b

4th

aa

aaaaa

aaabb

3rd b

bbbb bbcdb c

Store skipped common prefix ‘b’

B+Tree using 3rd chunk as key,

skipping common prefix ‘b’

08/26

Page 23: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Benefits

When keys are sufficiently long & uniform random (e.g., UUID or hash value) When keys have common prefixes (e.g., secondary index keys)

Example: Chunk size = 2 bytes

1st

Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue

a83jfl2iejzm30

2k

a8 dpwk3gjrieorig

je

dp z9382h3igor8eh

4k

z9283hgoeir8goer

ha

28023o8f9o8zufis

ue

02

Page 24: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Benefits

09/26

1st

Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue

a83jfl2iejzm30

2k

a8 dpwk3gjrieorig

je

dp z9382h3igor8eh

4k

z9283hgoeir8goer

ha

28023o8f9o8zufis

ue

02

Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie

We don’t need to store & compare entire key string

When keys are sufficiently long & uniform random (e.g., UUID or hash value) When keys have common prefixes (e.g., secondary index keys)

Example: Chunk size = 2 bytes

Page 25: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Benefits

1st

Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue

When chunk size is n-bit Up to 2n keys can be index by only first chunk

n=32 (4 bytes): 232 ~= 4 billion n=64 (8 bytes): 264 ~= 1019

a83jfl2iejzm30

2k

a8 dpwk3gjrieorig

je

dp z9382h3igor8eh

4k

z9283hgoeir8goer

ha

28023o8f9o8zufis

ue

02

09/26Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie

We don’t need to store & compare entire key string

When keys are sufficiently long & uniform random (e.g., UUID or hash value) When keys have common prefixes (e.g., secondary index keys)

Example: Chunk size = 2 bytes

Page 26: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB maintains two index structures HB+Trie: key index Sequence B+Tree: sequence number (8-byte integer) index Retrieve the file offset to a value using key or sequence number

ForestDB Index Structures

DB file Doc Doc Doc Doc Doc Doc …

HB+Trie

B+Tree

key

Sequence number

11/26

Page 27: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB: Write Ahead Logging

Page 28: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Append updates first, and update the main indexes later

Main purposes To maximize write throughput by sequential writes (append-only updates) To reduce # of index nodes to be written by batched updates

Write Ahead Logging

Append DB headerfor every commitHDocsDB file Docs Index nodes

h(key)h(key)

OffsetOffset

h(seq no)h(seq no)

OffsetOffset

ID index Seq no. index

WAL indexes:in-memory structures(hash table)

H

Page 29: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Append DB headerfor every commitHDocsDB file Docs Index nodes

h(key)h(key)

OffsetOffset

h(seq no)h(seq no)

OffsetOffset

ID index Seq no. index

WAL indexes:in-memory structures(hash table)

H15/26

Append updates first, and update the main indexes later

Main purposes To maximize write throughput by sequential writes (append-only updates) To reduce # of index nodes to be written by batched updates

Write Ahead Logging

< Key query>1. Retrieve WAL index first2. If hit return immediately3. If miss retrieve HB+Trie (or B+Tree)

Page 30: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB: Block Cache

Page 31: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB has its own block cache layer

Managed on a block basis

Give higher priority to index node blocks than data blocks

Provide an option to bypass the OS page cache

Block Cache

HB+Trie (or Seq Index) WAL Index

Block Cache Layer

Block read/write

DB File (on File System)

File read/write (if cache miss/eviction occurs)

Page 32: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

18/26

Global LRU list for database files that are currently opened

Separate AVL tree for each file to keep track of dirty blocks

Separate hash table for each file with a key (block_id) and a value (pointer to a cache entry in either the clean LRU list or AVL tree)

Block Cache

File LRU list

File 4

File 2

File 1

File 5

hash(BID)

hash(BID)

ptr

ptr

AVL-tree

Block Block

Hash table

Block Block Block Block

Dirty blocks

Clean LRU list

Page 33: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB: Compaction

Page 34: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Manual compaction Performed by calling the compact public API manually

Daemon compaction A single daemon thread inside ForestDB manages the compaction automatically

A Compactor thread can interleave with a writer thread While a compaction task is running, a writer thread can still write dirty items into the

WAL section of a new file, which allows the compaction thread to be interleaved with the writer thread

Compaction

Page 35: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

ForestDB: Evaluation

Page 36: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Evaluation Environments 64-bit machine running Centos 6.5 Intel Xeon 2.00 GHz CPU (6 cores, 12 threads) 32GB RAM and Crucial M4 SSD

Data Key size 32 bytes and value size 1KB Load 100M items Logical data size 100GB total

ForestDB DGM Performance

Page 37: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

LevelDB Compression is disabled Write buffer size: 256 MB (initial load), 4 MB (otherwise) Buffer cache size: 8 GB

RocksDB Compression is disabled Write buffer size: 256 MB (initial load), 4 MB (otherwise) Maximum number of background compaction threads: 8 Maximum number of background memtable flushes: 8 Maximum number of write buffers: 8 Buffer cache size: 8 GB (uncompressed)

ForestDB Compression is disabled WAL size: 4,096 documents Buffer cache size: 8 GB

KV Storage Engine Configurations

Page 38: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Initial Load Performance

3x ~ 6x less time

Page 39: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Initial Load Performance

4x less write overhead

Page 40: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Read-Only Performance

1 2 4 80

5000

10000

15000

20000

25000

30000

Read-Only Performance

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

Page 41: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Write-Only Performance

1 4 16 64 2560

2000

4000

6000

8000

10000

12000

Write-Only Performance

ForestDB LevelDB RocksDB

Write batch size (# documents)

Ope

ratio

ns p

er s

econ

d

- Small batch size (e.g., < 10) is not usually common

3x ~ 5x

Page 42: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Write-Only Performance

1 4 16 64 2560

50

100

150

200

250

300

350

400

450

Write Amplification

ForestDB LevelDB RocksDB

Write batch size (# documents)

Writ

e am

plifi

catio

n(N

orm

aliz

ed t

o a

sing

le d

oc s

ize)

ForestDB shows 4x ~ 20x less write amplification

Page 43: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Mixed Workload Performance

1 2 4 80

2000

4000

6000

8000

10000

12000

Mixed (Unrestricted) Performance

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

Page 44: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Optimizations for Solid-State Drives

Page 45: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

26/26

OS File System Stack Overhead

SSD SSD SSD

Block I/O Interface (SATA, PCI)

OS File System

Page Cache

Meta Data Mgmt

Database Storage Engine

SSD SSD SSD

Block I/O Interface (SATA, PCI)

Database Storage Engine

… Buffer Cache

Typical Database Storage Stack

Advanced Database Storage Stack

Volume Manager

Page 46: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 71

Bypass the entire OS file system stack

Volume manager Operate on unformatted disks Maintain the list of valid blocks used by database Garbage collect all invalid blocks

Buffer cache Allow us to use different cache policies based on application workload

Advanced Database Storage Stack

Page 47: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 72

Required for append-only storage model Garbage collect stale data blocks

Use significant disk I/O bandwidth Read the entire database file and write all valid blocks into a new file

Affect other performance metrics Regular read / write performance drops significantly

Database Compaction

Page 48: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 73

Logical page can change its physical address in flash memory whenever it is overwritten

For this reason, the mapping table between LBA and PBA is maintained by Flash Translation Layer (FTL)

SWAT-Based Compaction Optimization

A B C D E F…

Logical Address in File System (LBA)

FTL Address Mapping: LBA PBA

Physical Address inFlash Memory (PBA)

A A’…

Page 49: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 74

SWAT-Based Compaction Optimization

Document

B+Tree (Node of HB+Trie)

B+Tree Node

Old Ver. of B+Tree Node

I

G H

E

A B

F

C D C’

F’

H’

I’

G

E

A B DC’

F’

H’

I’

Current DB file

New CompactedDB file

A new compacted file can be simply created by creating the new LBA to PBA mappings that contain the valid pages only in the current DB file

Need to extend the FTL by adding a new interface SWAT (Swap and Trim)

Page 50: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 75

Implement SWAT interface on the OpenSSD development platform by adapting its FTL code

Total time taken for compactions was reduced by 17x

Number of compactions triggered was reduced by 4x

SWAT-Based Compaction Optimization

Page 51: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 76

Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O capabilities by SSDs

Quite useful in querying secondary indexes when items satisfying a query predicate are located in multiple blocks on different channels

Utilizing Parallel Channels on SSDs

Page 52: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 77

More and more applications use Append / Prepend APIs provided by Couchbase Server Mobile messaging services (e.g., Viber)

Value size gets bigger over time Compaction happens frequently

Append / Prepend Limitations

I

G H

E

A B

F

C D C’

F’

H’

I’

B+Tree (Node of HB+Trie)

B+Tree Node

Old Ver. of B+Tree Node

Document

key: “foo” key: “foo”

curr_val = Get(“foo”);new_val = curr_val + delta;Set(“foo”, new_val);

Page 53: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 78

Internally check if a key exists or not

If exists, then write a delta value only into the file and have the offset to the old value

Appended / prepended values will be consolidated together periodically or as part of compaction

Less compaction overhead

Storage-Level Append / Prepend APIs

I

G H

E

A B

F

C D C’

F’

H’

I’

B+Tree (Node of HB+Trie)

B+Tree Node

Old Ver. of B+Tree Node

Document

key: “foo” key: “foo”

Append(“foo”, delta);

Page 54: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

Summary

Page 55: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 80

ForestDB Compacted main index structure built from HB+-Trie High-performance, space efficiency, and scalability

Various optimizations for Solid-State Drives Compaction Volume manager Exploiting parallel I/O channels on SSDs Append / Prepend

ForestDB integrations Couchbase Server secondary index Couchbase Lite (The Future of Couchbase Mobile session @5:10pm today) Couchbase Server KV engine

Summary

Page 56: A Next Generation Storage Engine for NoSQL Database Systems Preview: Couchbase Connect 2014

©2014 Couchbase, Inc. 81

Questions?

[email protected]