20131112 pluk fractal trees theory to practice

Upload: rkdec89

Post on 03-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    1/53

    Fractal TreeIndexes

    Theoryto Practice

    Percona Live London 2013

    Tim Callaghan, [email protected]

    @tmcallaghan

    Tuesday, November 12, 13

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    2/53

    Ever seen this?

    IO Utilization Graph, performance is IO limited

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    3/53

    Who is Tokutek?

    Tokutek builds high-performance databasesoftware!

    TokuDB - storage engine forMySQL and MariaDB

    TokuMX - storage engine forMongoDB

    HDD & SSDstorage

    Storage Engine

    Developer Interface

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    4/53

    Who am I?

    17 year database consumer

    schema design, development, deployment

    database administration + infrastructure

    mostly Oracle

    5 year database producer

    2 years @ VoltDB

    2+ years @ Tokutek

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    5/53

    Housekeeping

    Feedback is important to me

    Ideas for Webinars or Presentations?

    Whos using MongoDB?

    Anyone using TokuDB or TokuMX?

    Please ask questions

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    6/53

    Agenda

    Why Fractal Tree indexes are cool

    What they enable in MySQL(TokuDB)

    What they enable in MongoDB

    (TokuMX)

    Q+A

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    7/53

    Indexing:

    B-trees andFractal Tree Indexes

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    8/53

    B-trees

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    9/53

    B-tree Overview - vocabulary

    Internal Nodes -Path to data

    Leaf Nodes -Actual Data -

    Sorted

    Pointers

    Pivots

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    10/53

    B-tree Overview - example

    22

    10 99

    2, 3, 4 10,20 22,25 99

    * Pivot Rule is >=

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    11/53

    B-tree Overview - search

    22

    10 99

    2, 3, 4 10,20 22,25 99

    Find 25

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    12/53

    B-tree Overview - insert

    22

    10 99

    2, 3, 4 10,15,20 22,25 99

    Insert 15

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    13/53

    RAM

    RAM

    DISK

    B-tree Overview - performance

    22

    10 99

    2, 3, 4 10,20 22,25 99

    Performance is IO limited when data > RAM,one IO is needed for each insert/update

    (actually its one IO for every index on the table)

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    14/53

    Fractal Tree Indexes

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    15/53

    Fractal Tree Indexes

    similar to B-trees

    store data in leaf nodesuse index key for ordering

    messagebuffer

    message

    buffer

    messagebuffer

    All internal nodeshave message

    buffers

    different than B-trees

    message buffersbig nodes (4MB vs. ~16KB)

    As buffers overflow,they cascade down

    the tree

    Messages areeventually applied to

    leaf nodes

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    16/53

    Fractal Tree Indexes - sample data

    25

    10 99

    2,3,4 10,20 22,25 99

    Looks a lot like a b-tree!

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    17/53

    insert 15;

    Fractal Tree Indexes - insert

    25

    10 99

    2,3,4 10,20 22,25 99

    insert (15)

    search operations must consider messages along the way messages cascade down the tree as buffers fill up they are eventually applied to the leaf nodes, hundreds or

    thousands of operations for a single IO

    CPU and cache are conserved as important data is not ejected

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    18/53

    Fractal Tree Indexes - other operations

    25

    10 99

    2,3,4 10,20 22,25 99

    add_column(c4 bigint)delete(99)increment(22,+5)

    ...

    insert (100)delete(8)delete(2)

    insert (8)

    Lots of operations can be messages!

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    19/53

    TokuDB

    Fractal Tree Indexing +

    MySQL/MariaDB

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    20/53

    What is TokuDB?

    Transactional MySQL Storage Engine - think InnoDB

    Available for MySQL 5.5 and MariaDB 5.5

    ACID and MVCC

    Free/OSS Community Edition http://github.com/Tokutek/ft-engine

    Enterprise Edition

    Commercial support + hot backup

    20

    Performance + Compression + Agility

    Tuesday, November 12, 13

    https://github.com/Tokutek/ft-enginehttps://github.com/Tokutek/ft-engine
  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    21/53

    TokuDB Performance

    Warning - Benchmarks Ahead!

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    22/53

    Indexed Insertion Performance

    High-performance insert/update/delete for largedatabases (> RAM) while maintaining indexes

    22

    * old numbers, now > 25K/sec

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    23/53

    Sysbench Performance

    Sysbench read/write workload, > RAM

    23

    The fastest IO is the one you never have to do (compression)

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    24/53

    Efficient index maintenance, especially secondaryindexes

    Clustered secondary indexes

    Additional copy of the row is stored in the index

    No additional IO to get row data from primary key

    Think better covering index (all non-indexed columns)

    Compression eliminates size concerns

    Big blocks = sequential IO for range scans

    Basement nodes are always co-located

    Multi-threaded bulk loader

    24

    Performance Advantages

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    25/53

    TokuDB Compression

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    26/53

    Compression: TokuDB vs. InnoDB

    InnoDB compression misses force node splits, whichgreatly reduces performance

    MySQL 5.6 dynamic padding (from FB), less cache

    Larger block size and flexible on-disk size wins!

    Multiple compression algorithms (lzma, quicklz, zlib)

    Larger, less frequent writes (much less IO)

    Why it matters on spinning disks:

    Compressed reads and amortized compressed writes

    overcome IO limitations Why it matters on flash/SSD:

    Buy less : 250GB * 10x = as 2.5TB)

    Large/less frequent writes are flash friendly

    26

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    27/53

    Compression + IO Reduction

    Server was at 90% IO utilization with InnoDB,10% IO utilization with TokuDB

    27

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    28/53

    Compression Performance

    iiBench benchmark

    28

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    29/53

    Compression Achieved

    log data (extremely compressible)

    29

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    30/53

    TokuDB Agility

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    31/53

    The Challenge of MySQL Schema Changes

    Common schema changes can take hours inMySQL

    Adding, dropping, or expanding a column

    Adding an index

    And the table is unavailable for writes during theprocess

    As a workaround, people generally

    Use a replication slave, then swap with master

    Use helper tools: Percona OSC, MySQL 5.6

    o These have IO, CPU, RAM consequences

    31

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    32/53

    Schema Changes Without Downtime

    In TokuDB, column add/drop/expand isinstantaneous

    its just a message

    Indexes can be created in the background while

    table is fully available TokuDB just builds the index, it does not

    rebuild the table (MySQL getting better)

    32

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    33/53

    TokuMX

    Fractal Tree Indexing +

    MongoDB

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    34/53

    What is TokuMX?

    TokuMX = MongoDB with improved storage (Fractal Tree indexes)

    Drop in replacement for MongoDB v2.2 applications

    Including replication and sharding

    Same data model Same query language

    Drivers just work

    Open Source

    http://github.com/Tokutek/mongo

    Performance + Compression + Transactions

    Tuesday, November 12, 13

    https://github.com/Tokutek/mongohttps://github.com/Tokutek/mongohttps://github.com/Tokutek/mongo
  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    35/53

    MongoDB Storage

    18

    4 5555

    (1,ptr5) (4,ptr1),(12,ptr8)

    (19,ptr7) (10000,ptr2)

    The pointer tells MongoDB where to look in the heap for the requesteddocument (another IO)

    35

    85

    40 120

    (2,ptr5),(22,ptr6)

    (50,ptr4) (100,ptr7) (222,ptr3)

    PK index (_id + pointer) Secondary index (foo + pointer)

    db.test.insert({foo:55})db.test.ensureIndex({foo:1}) memory mapped heap

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    36/53

    TokuMX Storage

    18

    4 5555

    (1,doc) (4,doc),(12,doc)

    (19,doc) (10000,doc)

    36

    85

    40 120

    (2,4), (22,12) (50,19) (100,10000) (222,1)

    PK index (_id + document) Secondary index (foo + _id)

    db.test.insert({foo:55})db.test.ensureIndex({foo:1}) memory mapped heap

    One less IO per _id lookup, document is clustered in the index

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    37/53

    TokuMX Performance

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    38/53

    Performance - Indexed Insertion

    100mm inserts into a collection with 3 secondary indexes

    38

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    39/53

    Indexed Insertion : Multikey (100 inserts per doc)

    39

    Performance - Inserts on Indexed Arrays

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    40/53

    Performance - Replication

    TokuMX replication allows secondary servers to process

    replication without IO

    Simply injecting messages into the Fractal TreeIndexes on the secondary server

    The Hard Work was done on the primary

    o Uniqueness checkingo Transactional locking

    o Update effort (read-before-write)

    Elimination of replication lag

    Your secondaries are fully available for read scaling! Wasnt that the point?

    40

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    41/53

    Performance - Lock Refinement

    41

    TokuMX performs locking at the document level Extreme concurrency!

    instance

    database database

    collection collection collection collection

    document

    document

    document

    document

    document

    document document

    document

    document

    document

    MongoDB v2.2

    MongoDB v2.0

    TokuMX

    Tuesday, November 12, 13

    f f

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    42/53

    42

    Performance - Lock Refinement

    Tuesday, November 12, 13

    f k f d d

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    43/53

    Sysbench benchmark (> RAM)

    43

    Performance - Lock Refinement + Reduced IO

    Tuesday, November 12, 13

    P f R d d IO

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    44/53

    Indexed insertion benchmark

    44

    Performance - Reduced IO

    Tuesday, November 12, 13

    P f Cl t d I d

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    45/53

    Performance - Clustered Indexes

    Clustered secondary indexes

    Additional copy of the document is stored in the index

    No additional IO to get row data from primary key

    Think better covered index (all non-indexed fields)

    Good for point queries, great for range scans

    Compression eliminates size concerns

    45

    Tuesday, November 12, 13

    P f M M t

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    46/53

    Performance - Memory Management

    Two approaches to memory management

    MongoDB = memory-mapped files

    o Operating system determines what data isimportant

    TokuMX = managed cache

    o User defined size

    o TokuMX determines what data is important

    Run multiple TokuMX instances on a single server

    Each has its own fixed cache size

    46

    Tuesday, November 12, 13

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    47/53

    TokuMX Compression

    Tuesday, November 12, 13

    C i

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    48/53

    Compression

    MongoDB does not offer compression

    Compressed file systems?

    Shortened field names?

    o Remember: each field name is stored in every single document

    TokuMX easily achieves 5x-10x compression

    Buy less disk or flash Compressed reads and writes reduce overall IO

    TokuMX support 3 compression types

    zlib, quicklz, lzma (size vs. speed)

    all data is compressed Use descriptive field names!

    They are easy to compress

    48

    Tuesday, November 12, 13

    C i

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    49/53

    Compression

    31 million documents, bit torrent peer data

    http://cs.brown.edu/~pavlo/torrent/

    49

    Tuesday, November 12, 13

    http://cs.brown.edu/~pavlo/torrent/http://cs.brown.edu/~pavlo/torrent/
  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    50/53

    TokuMX Transactions

    Tuesday, November 12, 13

    ACID + MVCC

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    51/53

    ACID + MVCC

    ACID

    In MongoDB, multi-insertion operations allow forpartial successo Asked to store 5 documents, 3 succeeded

    We offer all or nothing behavior

    Document level locking

    MVCC

    In MongoDB, queries can be interrupted by writers.o The effect of these writers are visible to the reader

    TokuMX offers MVCCo Reads are consistent as of the operation start

    51

    Tuesday, November 12, 13

    Multi statement Transactions

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    52/53

    Multi-statement Transactions

    TokuMX brings the following to MongoDB

    db.runCommand({beginTransaction, isolation:mvcc})

    ...perform 1 or more operations

    db.runCommand(rollbackTransaction) |db.runCommand(commitTransaction)

    Not allowed in sharded environments

    mongos will reject

    52

    Tuesday, November 12, 13

    Questions?

  • 8/12/2019 20131112 Pluk Fractal Trees Theory to Practice

    53/53

    Tim Callaghan

    VP/Engineering, [email protected]

    @tmcallaghan

    Questions?

    mailto:[email protected]:[email protected]:[email protected]:[email protected]