cassandra&map reduce

14
Cassandra & MR and how we used it in

Upload: vlaskinvlad

Post on 05-Dec-2014

1.414 views

Category:

Technology


0 download

DESCRIPTION

Some notes on Cassandra&MapReduce integration

TRANSCRIPT

Page 1: Cassandra&map reduce

Cassandra & MR

and how we used it in

Page 2: Cassandra&map reduce

Agenda

- Why do we need Cassandra & MapReduce - 3 notes about Cassandra

- 3 notes Cassandra + MapReduce

Page 3: Cassandra&map reduce

Real Time Bidding (RTB)

Page 4: Cassandra&map reduce

RTB

- How often? 10 - 30M auctions per day for mobile devices in Russia e.g. 10-30Gb of data /day

- What do we need? Be effective in showing Ads

- They call it "big data & data mining" We decided to use Cassandra&Hadoop for that

Page 5: Cassandra&map reduce

1. Cassandra tokens∆ = [T0, T4] - token range

ex. ∆ = [-2^63, +2^63]

Every time one writes (K,V) into Cassandra:

- ex. token(K) in [T2, T3]

- (K,V) will be put into node 3 (if replica 1)

Page 6: Cassandra&map reduce

2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)

Cassandra offers the following partitioners:

● Murmur3Partitioner (default): Uniformly distributes data

across the cluster based on MurmurHash hash values.

● RandomPartitioner: Uniformly distributes data across the

cluster based on MD5 hash values.

● ByteOrderedPartitioner: Keeps an ordered distribution of data

lexically by key bytes

The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases.

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners

Page 7: Cassandra&map reduce

3. Cassandra indexes and knows- Cassandra support common data formats E.g. byte, string, long, double

- Cassandra support secondary indexes E.g. you can select your data not bulky

- Cassandra knows how much data (records) in everytoken range

Page 8: Cassandra&map reduce

Cassandra & Map-ReduceGoogle says:

1. Cassandra is integrated with Map-Reduce http://wiki.apache.org/cassandra/HadoopSupport

2. It is outdated

3. It is used for Hadoop 1.0.3 or whatever version This means: Please install hadoop+mr cluster yourself

Page 9: Cassandra&map reduce

Cassandra & Map-Reduce (we want) 1. Cloudera Hadoop Distribution (CDH4) Cloudera manager installs your cluster in couple of clicks

2. Up to date (Cassandra 1.1.x - 1.2.x)

Solution:

A) Take Cassandra sources from http://cassandra.apache.org/download/

B) Take package org.apache.cassandra.hadoop and recompile it, having dependencies from CDH4&Cassadnra[1.x] And Jar is ready to go for your map-reduce jobs

Page 10: Cassandra&map reduce

1. Allocate your clusterDataStax says:

To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes. This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.

Why?

Page 11: Cassandra&map reduce

Because this:

works 100 times faster than this:

Page 12: Cassandra&map reduce

2. Number of map tasks

Job control parameter: InputSplitSize (default 65536)

Estimates how much data one mapper will receive

Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks

Page 13: Cassandra&map reduce

3. How job reads the data JobControlParameter: RangeBatchSize (default: 4096)

Bulk volume to read including your filters (primary & secondary indexes)

Cassandra does filtering job on server side

( [-2^63, +2^63] / number of map tasks )

Page 14: Cassandra&map reduce

Pros:

1. Easy to manage (Cassandra cluster & cloudera manager is

2. Easy to index

3. Supports query language & data types support

Cons:

1. Scalable extremely expensive (every node should run cassandra + hadoop)

2. Reading is very slow

3. Reading big amount is impossible

Note: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!

http://www.datastax.com/dev/blog/2012-in-review-performance

Conclusion