cassandra&map reduce

Cassandra & MR

and how we used it in

Agenda

- Why do we need Cassandra & MapReduce - 3 notes about Cassandra

- 3 notes Cassandra + MapReduce

Real Time Bidding (RTB)

RTB

- How often? 10 - 30M auctions per day for mobile devices in Russia e.g. 10-30Gb of data /day

- What do we need? Be effective in showing Ads

- They call it "big data & data mining" We decided to use Cassandra&Hadoop for that

1. Cassandra tokens∆ = [T0, T4] - token range

ex. ∆ = [-2^63, +2^63]

Every time one writes (K,V) into Cassandra:

- ex. token(K) in [T2, T3]

- (K,V) will be put into node 3 (if replica 1)

2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)

Cassandra offers the following partitioners:

● Murmur3Partitioner (default): Uniformly distributes data

across the cluster based on MurmurHash hash values.

● RandomPartitioner: Uniformly distributes data across the

cluster based on MD5 hash values.

● ByteOrderedPartitioner: Keeps an ordered distribution of data

lexically by key bytes

The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases.

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners#m3ppartitioner

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners#m3ppartitioner

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners#randompartitioner

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners#randompartitioner

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners#byteorderedpartitioner

http://www.datastax.com/docs/1.2/cluster_architecture/partitioners#byteorderedpartitioner

3. Cassandra indexes and knows- Cassandra support common data formats E.g. byte, string, long, double

- Cassandra support secondary indexes E.g. you can select your data not bulky

- Cassandra knows how much data (records) in everytoken range

Cassandra & Map-ReduceGoogle says:

1. Cassandra is integrated with Map-Reduce http://wiki.apache.org/cassandra/HadoopSupport

2. It is outdated

3. It is used for Hadoop 1.0.3 or whatever version This means: Please install hadoop+mr cluster yourself

Cassandra & Map-Reduce (we want) 1. Cloudera Hadoop Distribution (CDH4) Cloudera manager installs your cluster in couple of clicks

2. Up to date (Cassandra 1.1.x - 1.2.x)

Solution:

A) Take Cassandra sources from http://cassandra.apache.org/download/

B) Take package org.apache.cassandra.hadoop and recompile it, having dependencies from CDH4&Cassadnra[1.x] And Jar is ready to go for your map-reduce jobs

1. Allocate your clusterDataStax says:

To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes. This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.

Why?

Because this:

works 100 times faster than this:

2. Number of map tasks

Job control parameter: InputSplitSize (default 65536)

Estimates how much data one mapper will receive

Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks

3. How job reads the data JobControlParameter: RangeBatchSize (default: 4096)

Bulk volume to read including your filters (primary & secondary indexes)

Cassandra does filtering job on server side

( [-2^63, +2^63] / number of map tasks )

Pros:

1. Easy to manage (Cassandra cluster & cloudera manager is

2. Easy to index

3. Supports query language & data types support

Cons:

1. Scalable extremely expensive (every node should run cassandra + hadoop)

2. Reading is very slow

3. Reading big amount is impossible

Note: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!

http://www.datastax.com/dev/blog/2012-in-review-performance

Conclusion

cassandra&map reduce

Technology