bulk loadingdataintocassandra 140307160517 phpapp02

Upload: vibhasss

Post on 12-Oct-2015

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    1/44

    Bulk-Loading Data into Cassandra

    Patricia Gorla

    @patriciagorla

    Cassandra Consultant

    www.thelastpickle.com

    Planet Cassandra 2014

    http://www.thelastpickle.com/
  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    2/44

    About Us

    " Work with clients to deliver and improve

    Apache Cassandra services

    " Apache Cassandra committer, Datastax

    MVP, Hector maintainer, Apache Usergrid

    committer

    " Based in New Zealand & USA

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    3/44

    Why is bulk loading useful?

    " Performance tests

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    4/44

    Why is bulk loading useful?

    " Performance tests

    " Migrating historical data

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    5/44

    Why is bulk loading useful?

    " Performance tests

    " Migrating historical data

    " Changing topologies

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    6/44

    " How Data is Stored

    " Case Studies

    - Generating Dummy Data

    - Backfilling Historical Data

    -Changing Topologies

    " Conclusion

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    7/44

    Cassandra Write Path write[

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    8/44

    Cassandra Write Path

    " Writes written to both the commit log and

    memtable.

    write[

    memtabcommitlog

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    9/44

    Cassandra Write Path

    " Writes written to both the commit log and

    memtable.

    " Memtable is sorted.

    write[

    memtabcommitlog

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    10/44

    Cassandra Write Path

    " Memtable flushed out to sstables.

    sstable[0]sstable

    write[

    memtabcommitlog

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    11/44

    Cassandra Write Path

    " Compaction helps keep the read latency

    low.

    sstable[0]sstable

    sstable

    write[

    memtabcommitlog

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    12/44

    Sorted String Tables

    mykeyspace-mycf-jb-1-CompressionInfo.db

    mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db

    mykeyspace-mycf-jb-1-Index.db

    mykeyspace-mycf-jb-1-Statistics.db

    mykeyspace-mycf-jb-1-Summary.db

    mykeyspace-mycf-jb-1-TOC.txt

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    13/44

    Sorted String Tables

    mykeyspace-mycf-jb-1-CompressionInfo.db

    mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db

    mykeyspace-mycf-jb-1-Index.db

    mykeyspace-mycf-jb-1-Statistics.db

    mykeyspace-mycf-jb-1-Summary.db

    mykeyspace-mycf-jb-1-TOC.txt

    Contains all data needed to regenerate compone

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    14/44

    Sorted String Tables

    mykeyspace-mycf-jb-1-CompressionInfo.db

    mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db

    mykeyspace-mycf-jb-1-Index.db

    mykeyspace-mycf-jb-1-Statistics.db

    mykeyspace-mycf-jb-1-Summary.db

    mykeyspace-mycf-jb-1-TOC.txt

    Index of row keys

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    15/44

    Sorted String Tables

    mykeyspace-mycf-jb-1-CompressionInfo.db

    mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db

    mykeyspace-mycf-jb-1-Index.db

    mykeyspace-mycf-jb-1-Statistics.db

    mykeyspace-mycf-jb-1-Summary.db

    mykeyspace-mycf-jb-1-TOC.txt

    Index summary from Index.db file

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    16/44

    Sorted String Tables

    mykeyspace-mycf-jb-1-CompressionInfo.db

    mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db

    mykeyspace-mycf-jb-1-Index.db

    mykeyspace-mycf-jb-1-Statistics.db

    mykeyspace-mycf-jb-1-Summary.db

    mykeyspace-mycf-jb-1-TOC.txt

    Bloom filter over sstable

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    17/44

    Sorted String Tables

    mykeyspace-mycf-jb-1-CompressionInfo.db

    mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db

    mykeyspace-mycf-jb-1-Index.db

    mykeyspace-mycf-jb-1-Statistics.db

    mykeyspace-mycf-jb-1-Summary.db

    mykeyspace-mycf-jb-1-TOC.txt

    Table of contents of all components

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    18/44

    " How Data is Stored

    " Case Studies

    - Generating Dummy Data

    - Backfilling Historical Data

    -Changing Topologies

    " Conclusion

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    19/44

    Set up keyspace and column family

    create keyspace test

    with placement_strategy = 'org.apache.cassandra.locator.Simp

    and strategy_options = {replication_factor:1};

    create column family test

    with comparator = 'AsciiType'and default_validation_class = 'AsciiType'

    and key_validation_class = 'AsciiType';

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    20/44

    SStableGen.java

    AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsort

    directory,

    partitioner,

    keyspace,

    columnFamily,

    AsciiType.instance,null,

    size_per_sstable_mb

    );

    // subcomparator for super columns

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    21/44

    SStableGen.java

    AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsort

    directory,

    partitioner,

    keyspace,

    columnFamily,

    AsciiType.instance,null,

    size_per_sstable_mb

    );

    // subcomparator for super columns

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    22/44

    SStableGen.java

    AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsort

    directory,

    partitioner,

    keyspace,

    columnFamily,

    AsciiType.instance,null,

    size_per_sstable_mb

    );

    // subcomparator for super columns

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    23/44

    ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024));

    KeyGenerator keyGen = new KeyGenerator();

    long dataSize = 0;

    writer = new SSTableSimpleUnsortedWriter();

    while (dataSize < max_data_bytes) {

    writer.newRow(key);

    for (int j=0; j

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    24/44

    Examining sstable output

    patricia@dev:~/../data$ ls -lh mykeyspace/mycf

    total 64

    -rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-Com

    -rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Dat

    -rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Fil

    -rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Ind-rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Sta

    -rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Sum

    -rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    25/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf-d l

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    26/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf-d l

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    27/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d l

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    28/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d l

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    29/44

    $ bin/sstableloader Keyspace1/ColFam1

    " Run command on separate server

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    30/44

    $ bin/sstableloader Keyspace1/ColFam1

    " Run command on separate server

    " Throttle command

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    31/44

    $ bin/sstableloader Keyspace1/ColFam1

    " Run command on separate server

    " Throttle command

    " Parallelise processes

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    32/44

    " How Data is Stored

    " Case Studies

    - Generating Dummy Data

    - Backfilling Historical Data

    -Changing Topologies

    " Conclusion

    // li t f d b

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    33/44

    // list of orders by user

    customerOrders = new SSTableSimpleUnsortedWriter();

    // orders by order id

    orders = new SSTableSimpleUnsortedWriter();

    // assume orders are in date order

    for (Order order : oldOrders) {customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));

    customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil

    timestamp);

    orders.newRow(ByteBufferUtil.bytes(order.userId));

    orders.addColumn(ByteBufferUtil.bytes(customer_id), ByteBufferUtil.bytes(

    timestamp);

    orders.addColumn(ByteBufferUtil.bytes(date), ByteBufferUtil.bytes(order.dorders.addColumn(ByteBufferUtil.bytes(total), ByteBufferUtil.bytes(order.

    }

    customerOrders.close()

    orders.close()

    // list of orders by user

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    34/44

    // list of orders by user

    customerOrders = new SSTableSimpleUnsortedWriter();

    // orders by order id

    orders = new SSTableSimpleUnsortedWriter();

    // assume orders are in date order

    for (Order order : oldOrders) {customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));

    customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil

    timestamp);

    orders.newRow(ByteBufferUtil.bytes(order.userId));

    orders.addColumn(ByteBufferUtil.bytes(customer_id), ByteBufferUtil.bytes(

    timestamp);

    orders.addColumn(ByteBufferUtil.bytes(date), ByteBufferUtil.bytes(order.dorders.addColumn(ByteBufferUtil.bytes(total), ByteBufferUtil.bytes(order.

    }

    customerOrders.close()

    orders.close()

    // list of orders by user

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    35/44

    // list of orders by user

    customerOrders = new SSTableSimpleUnsortedWriter();

    // orders by order id

    orders = new SSTableSimpleUnsortedWriter();

    // assume orders are in date order

    for (Order order : oldOrders) {customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));

    customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil

    timestamp);

    orders.newRow(ByteBufferUtil.bytes(order.userId));

    orders.addColumn(ByteBufferUtil.bytes(customer_id), ByteBufferUtil.bytes(

    timestamp);

    orders.addColumn(ByteBufferUtil.bytes(date), ByteBufferUtil.bytes(order.dorders.addColumn(ByteBufferUtil.bytes(total), ByteBufferUtil.bytes(order.

    }

    customerOrders.close()

    orders.close()

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    36/44

    " How Data is Stored

    " Case Studies

    - Generating Dummy Data

    - Backfilling Historical Data

    -Changing Topologies

    " Conclusion

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    37/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \

    cass1,cass2,cass3

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    cass3,cass4,cass5,cass6]

    progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (

    (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    38/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \

    cass1,cass2,cass3

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    cass3,cass4,cass5,cass6]

    progress: [/cas1 3/3(100)] [/cas2 0/4(0)] [/cas3 0/0 (0)] [/cas4 0/0 (

    (0)] [/cas6 1/2(50)] [total: 50 - 0MB/s (avg: 5MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    39/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \

    cass1,cass2,cass3

    Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db

    cass3,cass4,cass5,cass6]

    progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (

    (0)] [/cas6 1/2 (50)] [total: 50- 0MB/s (avg: 5MB/s)]

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    40/44

    $ bin/sstableloader Keyspace1/ColFam1

    patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats

    pending tasks: 30

    Active compaction remaining time : n/a

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    41/44

    " How Data is Stored

    " Case Studies

    - Generating Dummy Data

    - Backfilling Historical Data

    -Changing Topologies

    " Conclusion

    cqlsh> CREATE KEYSPACE "test"

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    42/44

    CQL: Keep schema consistent

    cqlsh> CREATE KEYSPACE test

    WITH replication = {'class': 'SimpleStrategy', 'replication_fact

    cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    43/44

    CQL3 Considerations

    " Uses CompositeType comparator

  • 5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02

    44/44

    Q&A

    Patricia Gorla

    @patriciagorla

    Cassandra Consultant

    www.thelastpickle.com

    Planet Cassandra 2014

    http://www.thelastpickle.com/