bulk loadingdataintocassandra 140307160517 phpapp02
TRANSCRIPT
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
1/44
Bulk-Loading Data into Cassandra
Patricia Gorla
@patriciagorla
Cassandra Consultant
www.thelastpickle.com
Planet Cassandra 2014
http://www.thelastpickle.com/ -
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
2/44
About Us
" Work with clients to deliver and improve
Apache Cassandra services
" Apache Cassandra committer, Datastax
MVP, Hector maintainer, Apache Usergrid
committer
" Based in New Zealand & USA
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
3/44
Why is bulk loading useful?
" Performance tests
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
4/44
Why is bulk loading useful?
" Performance tests
" Migrating historical data
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
5/44
Why is bulk loading useful?
" Performance tests
" Migrating historical data
" Changing topologies
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
6/44
" How Data is Stored
" Case Studies
- Generating Dummy Data
- Backfilling Historical Data
-Changing Topologies
" Conclusion
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
7/44
Cassandra Write Path write[
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
8/44
Cassandra Write Path
" Writes written to both the commit log and
memtable.
write[
memtabcommitlog
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
9/44
Cassandra Write Path
" Writes written to both the commit log and
memtable.
" Memtable is sorted.
write[
memtabcommitlog
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
10/44
Cassandra Write Path
" Memtable flushed out to sstables.
sstable[0]sstable
write[
memtabcommitlog
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
11/44
Cassandra Write Path
" Compaction helps keep the read latency
low.
sstable[0]sstable
sstable
write[
memtabcommitlog
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
12/44
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
13/44
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
Contains all data needed to regenerate compone
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
14/44
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
Index of row keys
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
15/44
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
Index summary from Index.db file
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
16/44
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
Bloom filter over sstable
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
17/44
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.dbmykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
Table of contents of all components
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
18/44
" How Data is Stored
" Case Studies
- Generating Dummy Data
- Backfilling Historical Data
-Changing Topologies
" Conclusion
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
19/44
Set up keyspace and column family
create keyspace test
with placement_strategy = 'org.apache.cassandra.locator.Simp
and strategy_options = {replication_factor:1};
create column family test
with comparator = 'AsciiType'and default_validation_class = 'AsciiType'
and key_validation_class = 'AsciiType';
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
20/44
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsort
directory,
partitioner,
keyspace,
columnFamily,
AsciiType.instance,null,
size_per_sstable_mb
);
// subcomparator for super columns
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
21/44
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsort
directory,
partitioner,
keyspace,
columnFamily,
AsciiType.instance,null,
size_per_sstable_mb
);
// subcomparator for super columns
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
22/44
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsort
directory,
partitioner,
keyspace,
columnFamily,
AsciiType.instance,null,
size_per_sstable_mb
);
// subcomparator for super columns
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
23/44
ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024));
KeyGenerator keyGen = new KeyGenerator();
long dataSize = 0;
writer = new SSTableSimpleUnsortedWriter();
while (dataSize < max_data_bytes) {
writer.newRow(key);
for (int j=0; j
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
24/44
Examining sstable output
patricia@dev:~/../data$ ls -lh mykeyspace/mycf
total 64
-rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-Com
-rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Dat
-rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Fil
-rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Ind-rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Sta
-rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Sum
-rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
25/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf-d l
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
26/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf-d l
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
27/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d l
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
28/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d l
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
29/44
$ bin/sstableloader Keyspace1/ColFam1
" Run command on separate server
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
30/44
$ bin/sstableloader Keyspace1/ColFam1
" Run command on separate server
" Throttle command
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
31/44
$ bin/sstableloader Keyspace1/ColFam1
" Run command on separate server
" Throttle command
" Parallelise processes
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
32/44
" How Data is Stored
" Case Studies
- Generating Dummy Data
- Backfilling Historical Data
-Changing Topologies
" Conclusion
// li t f d b
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
33/44
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter();
// orders by order id
orders = new SSTableSimpleUnsortedWriter();
// assume orders are in date order
for (Order order : oldOrders) {customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));
customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil
timestamp);
orders.newRow(ByteBufferUtil.bytes(order.userId));
orders.addColumn(ByteBufferUtil.bytes(customer_id), ByteBufferUtil.bytes(
timestamp);
orders.addColumn(ByteBufferUtil.bytes(date), ByteBufferUtil.bytes(order.dorders.addColumn(ByteBufferUtil.bytes(total), ByteBufferUtil.bytes(order.
}
customerOrders.close()
orders.close()
// list of orders by user
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
34/44
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter();
// orders by order id
orders = new SSTableSimpleUnsortedWriter();
// assume orders are in date order
for (Order order : oldOrders) {customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));
customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil
timestamp);
orders.newRow(ByteBufferUtil.bytes(order.userId));
orders.addColumn(ByteBufferUtil.bytes(customer_id), ByteBufferUtil.bytes(
timestamp);
orders.addColumn(ByteBufferUtil.bytes(date), ByteBufferUtil.bytes(order.dorders.addColumn(ByteBufferUtil.bytes(total), ByteBufferUtil.bytes(order.
}
customerOrders.close()
orders.close()
// list of orders by user
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
35/44
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter();
// orders by order id
orders = new SSTableSimpleUnsortedWriter();
// assume orders are in date order
for (Order order : oldOrders) {customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));
customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil
timestamp);
orders.newRow(ByteBufferUtil.bytes(order.userId));
orders.addColumn(ByteBufferUtil.bytes(customer_id), ByteBufferUtil.bytes(
timestamp);
orders.addColumn(ByteBufferUtil.bytes(date), ByteBufferUtil.bytes(order.dorders.addColumn(ByteBufferUtil.bytes(total), ByteBufferUtil.bytes(order.
}
customerOrders.close()
orders.close()
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
36/44
" How Data is Stored
" Case Studies
- Generating Dummy Data
- Backfilling Historical Data
-Changing Topologies
" Conclusion
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
37/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \
cass1,cass2,cass3
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
cass3,cass4,cass5,cass6]
progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (
(0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
38/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \
cass1,cass2,cass3
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
cass3,cass4,cass5,cass6]
progress: [/cas1 3/3(100)] [/cas2 0/4(0)] [/cas3 0/0 (0)] [/cas4 0/0 (
(0)] [/cas6 1/2(50)] [total: 50 - 0MB/s (avg: 5MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
39/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~//cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \
cass1,cass2,cass3
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db
cass3,cass4,cass5,cass6]
progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (
(0)] [/cas6 1/2 (50)] [total: 50- 0MB/s (avg: 5MB/s)]
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
40/44
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats
pending tasks: 30
Active compaction remaining time : n/a
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
41/44
" How Data is Stored
" Case Studies
- Generating Dummy Data
- Backfilling Historical Data
-Changing Topologies
" Conclusion
cqlsh> CREATE KEYSPACE "test"
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
42/44
CQL: Keep schema consistent
cqlsh> CREATE KEYSPACE test
WITH replication = {'class': 'SimpleStrategy', 'replication_fact
cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
43/44
CQL3 Considerations
" Uses CompositeType comparator
-
5/21/2018 Bulk Loadingdataintocassandra 140307160517 Phpapp02
44/44
Q&A
Patricia Gorla
@patriciagorla
Cassandra Consultant
www.thelastpickle.com
Planet Cassandra 2014
http://www.thelastpickle.com/