the “what”, “why”, and “how” of fractal tree indexing for...

51
TokuMX Internals The “What”, “Why”, and “How” of Fractal Tree Indexing for MongoDB Zardosht Kasheff @zkasheff @tokutek

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Internals

The “What”, “Why”, and

“How” of Fractal Tree

Indexing for MongoDB

Zardosht Kasheff

@zkasheff

@tokutek

Page 2: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

What is TokuMX?

• TokuMX = MongoDB with improved storage (Fractal

Tree Indexes!)

• Drop in replacement for MongoDB v2.2 applications

o Including replication and sharding

o Same data model

o Same query language

o Drivers just work

o 2.4 compatibility soon

• Open source • https://github.com/Tokutek/mongo/

Page 3: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefits

Top 5 benefits to TokuMX

are…

Page 4: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefit #1

Improved write performance on large data

Page 5: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefit #2

Compression! (up to 25x)

TokuMX achieved

11.6:1 compression

Page 6: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefit #3

No Fragmentation.

Page 7: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefit #4

Scale up

• No global

read/write lock

• Document level

locking

• Sysbench

Benchmark on

data > RAM

Page 8: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefit #4

Scale up

• No global

read/write lock

• Document level

locking

• Sysbench

Benchmark on

data < RAM

Page 9: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Benefit #5

Transactions: MVCC + multi-statement on single servers

Page 10: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX Top 5 Benefits Recap

• Improved write performance on large data

• Compression! (up to 25x)

• No fragmentation (Deprecated compact!)

• Scale up

• Transactions (MVCC + multi-statement)

Bottom line: TokuMX makes MongoDB applications stable

and fast for large databases.

Page 11: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX: How?

Built a storage core from the ground up, with Fractal Tree

indexes, a data structure designed with large data in

mind.

• Some benefits thanks to Fractal Tree indexes

• Some benefits thanks to good old fashioned engineering

Benefits:

• Improved write performance on large data

• Compression! (up to 10x)

• No fragmentation (Deprecated compact!)

• Scale up

• Transactions (MVCC + multi-statement)

Thanks to Fractal Trees

Good old fashioned engineering

Page 12: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Agenda

• Focus on how TokuMX brings the benefits that Fractal

Trees are responsible for. (We won’t focus on scale up

and transactions).

• Compare side-by-side the B-Tree (what many databases

use) and the Fractal Tree. Understand the differences.

• Use differences to show, one by one, how TokuMX’s

Fractal Trees enable:

– Fast writes on big data

– Compression

– No fragmentation

But first, a spoiler…

Page 13: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Spoiler!!

• MySQL customer I/O utilization graph:

It’s all about I/O!!

Without Fractal Trees With Fractal Trees

Page 14: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Trees v. B-Trees Contrast and Compare

Page 15: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Trees v. B-Trees

What is a B-Tree?

• Traditional data structure used in databases for over 40

years.

• Used in NEARLY ALL databases, such as MongoDB,

MySQL, BerkeleyDB, etc…

Page 16: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Trees v. B-Trees

What is a B-Tree?

Simple and elegant data structure:

• Internal nodes store as many pivots and pointers

that fit.

• Leaf nodes store data.

Leaf Nodes

Internal Nodes

Page 17: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Trees v. B-Trees

What is a Fractal Tree?

Another simple and elegant data structure:

• Internal nodes store pivots, pointers, and buffers.

• Leaf nodes store data.

Pointers and pivots

Buffer

Leaf node

Page 18: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Trees v. B-Trees

What is a Fractal Tree?

Buffers are important:

• Batch up writes

• Will dig into what this means soon.

Pointers and pivots

Buffer

Leaf node

Page 19: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Trees v. B-Trees

Characteristics of B-Trees and Fractal Trees for large data:

• Very high percentage of leaf nodes do not fit in memory

• Therefore, accessing a random leaf node likely requires

I/O

On disk, not in memory

Page 20: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Understanding TokuMX’s Fractal

Tree Benefit #1:

Write performance

Page 21: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Write performance. How…

100mm inserts into a collection with 3 secondary indexes

Page 22: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

With Less I/O!

100mm inserts into a collection with 3 secondary indexes

Page 23: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Tree v B-Tree for write I/O

Fractal Trees have significantly better write performance

than B-Trees when data > RAM

– B-Trees become I/O bound. (Disks do < 500 I/O per second)

– Fractal Trees are not I/O bound

This is why B-Tree insertion performance “falls off a cliff”.

MySQL MongoDB cliff

Page 24: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Conventional Wisdom

This also leads to the following conventional wisdom:

• Keep indexes in memory.

• Keep “working set” in memory.

• Have a “right-most insertion pattern” on indexes

All of these tips are designed to work around the fact that B-

Trees become I/O bound when writing to large databases.

Now let’s understand why…

Page 25: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a B-Tree does writes

Random Writes require I/O

B-Trees algorithm for doing a write:

• Find the appropriate leaf node where the write belongs

• Bring the leaf node into memory EXPENSIVE!

• Modify the leaf node

For large data, nearly all B-Tree leaf nodes are not in memory,

so algorithm requires practically one I/O per write

Page 26: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

• Writes are batched in buffers with messages

• When a buffer is full, messages spills into buffers of

child node (who also spill if they get full)

• Through spilling, messages eventually make it to

leaf nodes.

Let’s zoom in here for the next slide

Page 27: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

Internal nodes

Page 28: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

Internal nodes

Page 29: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

Internal nodes

Page 30: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

Internal nodes

Page 31: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

Internal nodes

Page 32: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

How a Fractal Tree does Writes

Internal nodes

Page 33: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

When does a Fractal Tree do I/O for Writes?

– When flushing a buffer’s worth of writes.

Here we see the BIG difference in I/O performance for

Fractal Trees v. B-Trees:

B-Trees do an I/O to write one measly document.

Fractal Trees do an I/O to write a buffer’s worth of

documents. This is why I/O is drastically reduced!

How a Fractal Tree does Writes

Page 34: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fractal Tree Wisdom

This also leads to the following wisdom for Fractal Trees:

• Indexes don’t need to fit in memory.

• “Working set” does not need to be in memory.

• Indexes don’t need to worry about their “insertion

pattern”.

These capabilities reduce complexity of database design,

and enable rich indexes and queries that B-trees cannot

support.

Page 35: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Understanding TokuMX’s Fractal

Tree Benefit #2:

Compression

Page 36: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

• BitTorrent Peer Snapshot Data (~31 million documents), 3 indexes

• http://cs.brown.edu/~pavlo/torrent/

What Compression?

TokuMX achieved

11.6:1 compression

Page 37: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX compression algorithm is simple!

1. Take large chunks of data

2. Use standard compression algorithms (zlib, lzma, or

quicklz) and compress them

3. There is no step 3!

Effectiveness of these compression algorithms is

dependent on how much data you give it. TokuMX

gives lots of data, so TokuMX compresses well.

The secret is…

Compression: How?

Page 38: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

TokuMX node sizes (4 MB) are larger than B-Trees

Compression: The Secret

Small: 8KB or 16KB

Large: 4MB

Larger node size leads to better compression

So the question is, why do Fractal Trees have such large

node sizes?

Page 39: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Again, it’s all about the I/O.

For writes: – B-Trees: reading a large node to write one measly row is

painful

– Fractal Trees: reading a large node to write a proportionally

large buffer is not painful. In fact, it’s better. Reading larger

nodes means you pay more disk bandwidth cost than disk

seek cost.

Conclusion: Fractal Trees should use large nodes for

writes, for better performance AND compression.

Fractal Trees: Why Large Nodes?

Page 40: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

What about reading a single document?

The problem:

• For point query, we are reading one

measly document

• Just as B-Trees don’t want to do a large

I/O to write one measly document, Fractal

Trees should not read 4MB to read one

measly document.

Fractal Trees: Large Nodes + Reads

Page 41: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

What about reading a single document?

The solution: • Partition the 4MB leaf node into 64KB “basement nodes”.

(value of 64KB is configurable)

• 64KB chunks are individually compressed, concatenated,

and written disk to represent a leaf node

• When flush data for writes, read the full 4MB row

• When reading “one measly document”, read only

appropriate 64KB chunk of data

64 KB chunks are nice sweet spot to get good compression

and point query performance

Fractal Trees: Large Nodes + Reads

Page 42: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Summary: • Use large nodes: 4MB

• Partition leaf nodes into 64KB contiguous chunks

• Compress 64 KB chunks individually with standard

compression algorithms (zlib, lzma, or quicklz), getting

good compression

• Concatenate compressed chunks to make large

compressed leaf node.

Fractal Trees: Compression

Page 43: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Understanding TokuMX’s Fractal

Tree Benefit #3:

No Fragmentation

Page 44: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Fragmentation happens when nodes on disk get

rearranged in random order, with wasted space

accumulating between nodes.

Why MongoDB Users care about fragmentation:

• Wasted space between blocks makes keeping

working set in memory more difficult, leads to

disk bloat

• Blocks of data rearranged in random order leads

to performance degradation

What is Fragmentation?

Page 45: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Workarounds for Fragmentation

MongoDB workarounds:

– Pad inserted documents

with some additional space

to account for future

updates

– Occasionally bring the

database down and run

compact. This correctly

rearranges blocks and

removes wasted space

– Aggressively preallocate

files to reserve space

TokuMX workarounds:

Page 46: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

Why TokuMX Users don’t care about fragmentation:

• On wasted space between blocks:

– Compression greatly mitigates impact of wasted space on disk

usage

– Write performance allows working set to exceed memory

• On blocks of data being rearranged in random order:

– Short answer: large leaf nodes practically eliminate the I/O

impact of rearranged data blocks (once again, it’s all about the

I/O)

– Long answer: let’s do some analysis…

Why TokuMX does not Fragment

Page 47: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

First, let’s assume the following costs of disk access:

• Disk seek time: 10ms 100 I/Os per second

• Disk bandwidth time: 100MB/s

Numbers are meant to be nice estimates to make math

simple.

Question to ask ourselves that shows the impact of

fragmentation:

At what rate (determined in bytes/second) can I read an

entire B-Tree?

Impact of Rearranged blocks

Page 48: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

At what rate (determined in bytes/second) can I read an

entire B-Tree?

Non-fragmented B-Trees:

– all data sequentially arranged, therefore sequentially accessed

– Effective rate: 100 MB/s (at most) great performance!

Fragmented B-Tree:

– Suppose node size 8KB, accessing leaf node requires I/O

– Cost of reading block of data is seek time + bandwidth time

– seek time: 10ms, bandwidth time: 100us dominated by seek

– Effective rate: 8KB/10ms = 800 KB/s poor performance!

This is the poor performance one sees with

fragmentation, and why users want to compact

Impact of Rearranged blocks

Page 49: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

At what rate (determined in bytes/second) can I read an

entire Fractal Tree?

Non-fragmented Fractal Tree:

– Effective rate: 100 MB/s (at most) great performance!

“Fragmented” Fractal Tree:

– Suppose node size 1MB compressed, 4MB uncompressed

– Cost of reading block of data is seek time + bandwidth time

– seek time: 10ms, bandwidth time: 10ms

– Effective rate: 1 MB / 20 ms = 50 MB/s great performance!

Large Fractal Tree nodes mitigate I/O seek cost of a

fragmented collection!

Impact of Rearranged blocks

Page 50: The “What”, “Why”, and “How” of Fractal Tree Indexing for ...files.meetup.com/1742411/The What, Why, and How of... · TokuMX Internals The “What”, “Why”, and “How”

• Don’t worry about fragmentation .

Summary on Fragmentation