cassandra summit 2014: the cassandra experience at orange — season 2

Jean Armel Luce

Orange France

Thursday, September 11 2014

The C* Experience at Orange Season 2

2

The Cassandra experience at Orange - season 1

Summary

1.  Why did we choose C* for our applicaHon PnS ?

2.  Our migraHon strategy (without any interrupHon of service)

3.  AOer the migraHon …

§  For more details , watch « the Cassandra Summit Europe 2013 »: –  hQps://www.youtube.com/watch?v=mefOE9K7sLI&feature=youtube_gdata

Jean Armel Luce - Orange-France

3

The Cassandra experience at Orange - season 2

Summary 1.  Short descripHon of the applicaHon PnS

2.  Keyring: why design customer ids with graphs in C* ?

3.  BYOHH (Bring Your Own Hadoop & Hive) with Cassandra


If you missed season 1: Short description of PnS

5

PnS: Short description of the application

§  PnS means Profile and Syndication: a highly available service for collecting and serving live data about Orange customers

§  End users:

–  Orange customers (www.orange.fr) –  Sellers in Orange shops –  Some services in Orange (advertisements, …)


6

PnS: The Big Picture


End users

Millions of HTTP requests (Rest or Soap) Fast and highly available

Database

WebService to get or set data stored by pns: -  postProcessing(data1) -  postProcessing(data2) -  postProcessing(data3) -  postProcessing(datax) -  …

PNS

Data providers

Thousands of files (Csv, json or Xml) Scheduled data injection

DB Queries R/W operations

7

§  1 multi DC cluster

§  and web services

(read and writes)

§  for batch updates

PnS: Architecture at the end of 2013


Bagnolet

Sophia Antipolis

2 DCs architecture for high availability

8

PnS: Some key dates about the PnS3.0 project


Season 1 (From April 2013 to October 2013)

Migration to C*

Season 2 part 1 (November 2013)

Keyring

Season 2 part 1 (April 2014)

Hadoop & Hive for Analytics

Keyring: Why design customer ids with graphs in C* ?

10

PnS database design

§  Nearly 35 tables at the end of 2013

CREATE TABLE customers (

customer_id varchar,

col1 varchar,

col2 bigint,

col3 set<text>,

...,

coln timestamp,

PRIMARY KEY (customer_id));

§  SELECT colx, coly, colz FROM customers WHERE customer_id = '???' ;


11

Customer ids

§  What is a customer id ? –  cell number –  internet account –  email address –  ISE (internal identifier used by many other Orange applications) –  ....

§  For many reasons, data is stored in tables with different primary keys

–  some data are often retrieved using a cell number è stored when possible in a table where PK is a cell number

–  … but all customers don’t have a cell number è stored in a table where PK is not a cell number

–  …


12

Customer ids translation

§  A PnS user knows only 1 customer id

§  He often needs to retrieve data indexed by another kind of cust id in the DB


My cell number is (209)

123-4567 SELECT * FROM pns WHERE cust_id = ‘ISE_QWERTY’

customer_id translation

13

Database design in the old relational databases

§  Design with secondary indexes ?

SELECT email_address FROM customer_ids WHERE cell_number = ???;

§  Requires a lot of secondary indexes with values having high cardinality

§  With C*, secondary indexes with values having a high cardinality are wasteful


ISE Cell_ number

email_ address … idtypeN

Primary Key

Secondary indexes

intranet account

14

Design with graph for C*


IdType=‘Internet’ IdValue=‘99999999999

’

IdType=‘EMAIL’ IdValue=‘priam@

orange.com’ IdType=‘ISE’ IdValue=‘myISE1’

IdType=‘Cell’ IdValue=(209) 123-4567

IdType=ISE IdValue=‘myISE2’

IdType=‘EMAIL’ IdValue=‘hecuba@

orange.com’

IdType=‘Cell’ IdValue=(209)

123-4568 Type=‘ISE’

Value=‘myISE3’ IdType=‘Cell’ IdValue=(209)

123-4569 IdType=‘EMAIL’ IdValue=‘paris@

orange.com’

IdType=‘EMAIL’ IdValue=‘alexander@

orange.com’

15

The new « Customer ids » table in C*

§  Table of edges between customer ids

CREATE TABLE graph( idvalue1 text, -- type of the initial vertex of the arc idtype1 text, -- value of the initial vertex of the arc idvalue2 text, -- type of the terminal vertex of the arc idtype2 text, -- value of the terminal vertex of the arc attr map<text, text>, -- a column of map type for storing any kind of property t timestamp, PRIMARY KEY ((idvalue1) , idtype1 , idtype2 , idvalue2 ) ); SELECT * FROM graph WHERE idvalue1 IN (‘???’)


16

Small independant graphs


§  500.000.000 edges in the graph

§  The keyring graph is not a single large graph

§  It’s rather a lot of small independant undirected graphs Ø Each vertex has a small neighborhood. Ø  The search of a customer id is limited into a small subset of

the edges and vertices

17

Atomicity

§  The edges are bi-directional (undirected)

–  We need to insert or update 2 rows for each edge –  The atomic batch mode guarantees that the 2 directions are updated

atomically


18


Optimization of the search of the shortest path

§  We know which kind of customer id are used by the PnS users

§  We know which kind of customer id are used for indexation

§  For each pair, the shortest paths are predefined in our application PnS (according to the kind of customer ids)

19


Search API in the graph

§  An in-house C++ library offers an API for an iterative breadth-first graph exploration

§  Example: looking for H from A

E

C

H

F

D

G

I

A

B

SELECT * FROM graph WHERE credval1 IN (‘B’, ‘F’);

20


Nb queries per search

§  Looking for a direct neighbour requires only 1 SELECT

§  Looking for a neighbour of a neighbour requires 2 SELECT

§  Looking for a neighbour of a neighbour of a neighbour requires 3 SELECT

§  …

21

§  A search executed using 1, 2 or 3 reads è very low response time (thanks to FusionIO and C++ code)


Search Response time

Number of searches/sec Response time per search (in ms)

Nearly 700 searches/sec 2ms < RTT < 3.5 ms

22

§  We had to rethink this feature, because C* != RDBMS

§  At first glance, a graph looks like an exotic design … but for our use case, it works well with C* … and FusionIO.

§  Favoring the access to data through the partitioning key is very efficient for getting a low response time and a linear scalability.


Conclusions about Keyring

BYOHH: Bring Your Own Hadoop & Hive with Cassandra

24

Poolofweb

serversDC1

Poolofweb

serversDC2

DC1 DC2


Basic architecture of the Cassandra cluster

§  Cluster without Hadoop: 2 datacenters, 16 nodes in each DC

§  RF (DC1, DC2) = (3, 3)

§  CL = ONE or LOCAL_QUORUM for online queries

§  Requests from web servers in DC1 are sent to C* nodes in DC1

§  Requests from web servers in DC2 are sent to C* nodes in DC2

25


Adding a new datacenter for analytics

§  Cluster with Hadoop/Hive: 3 datacenters, 16 nodes in DC1, 16 nodes in DC2, 4 nodes in DC3

§  RF (DC1, DC2, DC3) = (3, 3, 1)

§  Because RF = 1 in DC3, we need less storage space in this datacenter

§  We favor cheaper disks (SATA) in DC3 rather than SSDs or FusionIo cards

26


Architecture of the Cassandra cluster with the new datacenter for analytics

DC1 DC2

DC3

Poolofweb

serversDC1

Poolofweb

serversDC2

27


Potential impacts of map reduce tasks for online queries

DC1 DC2

DC3

Poolofweb

serversDC1

Poolofweb

serversDC2

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

Timeouts

HH

Timeouts

HHHH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HHHH HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

HH

Hinted Handoffs for online

update queries not replicated in DC3

Map reduce tasks take all the resources (CPU, RAM, IO, …)

Timeouts due to

CL=ONE used for online READ queries

28

CL ANY CL LOCAL_ONE CL ONE CL LOCAL_QUORUM CL EACH_QUORUM CL QUORUM CL ALL


Solution for timeouts (online READ queries)

Isolation between online queries and map reduce tasks:

§  Use a LOCAL CONSISTENCY LEVEL:

–  For map reduce tasks in DC3: –  LOCAL_ONE

–  For online queries in DC1 or DC2: –  LOCAL_ONE –  LOCAL_QUORUM

LOCAL_ONE is available since C* 1.2.12 (cf. JIRA CASSANDRA-6238)

Timeouts due to

CL=ONE used for online READ queries

29


Guarantee on resources for online queries

§  Use CGROUPS:

–  Can guarantee a minimum of CPU/RAM for online queries –  Cgroups cannot be used for I/O disks (Map tasks call C* processes

when reading data on disk)

Solution for Hinted Handoffs (online WRITE queries) 1/2



30


Solution for Hinted Handoffs (online WRITE queries) 2/2

Swap global and local read repair chances

§  By default, in C* 1.2:

–  read_repair_chance = 0.1 –  dclocal_read_repair_chance = 0.0

§  For highly read tables, the read repairs are not sent to DC3:

–  Set read_repair_chance = 0.00 –  Set dclocal_read_repair_chance = 0.1

Ø  Less load and IO disks in DC3 DCLOCAL_READ_REPAIR_CHANCE=0.1 is now the default since C* 2.0.9 (cf. JIRA CASSANDRA_7320)

DC1 DC2

DC3



31


Tradeoff “ease of exploitation vs optimization”

§  256 VN per C* node is usually recommended

§  At least 1 map task per virtual node in DC3

–  Disabling virtual nodes in DC3 adding new nodes in DC3 is less easy shorten the execution time

–  Enabling virtual nodes in DC3 adding new nodes in DC3 is easier,

What is the right number of vnodes ? 64 VN/node looks good.

32


Contributions and open sourced modules

§  Hive Handler open sourced by Orange

§  Works with CDH4.4 and C* 1.2.13

§  Feature added to this handler: authentication

§  Github: https://github.com/Orange-OpenSource/cassandra_handler_for_hive

Thanks to Cyril Scetbon for this handler

33


Conclusions about BYOHH

§  The installation of Hadoop & Hive is tricky, but we didn’t have choice for analytics because CQL has many limitations

§  We had to rethink our architecture. Now, we are able to do analytics with Hadoop + Hive with a better isolation between online and analytics queries.

§  We have also discovered an interesting ecosystem around C* which offers more capabilities. With this ecosystem, we can benefit from the strengths of C* and workaround some of the limitations.

Conclusions

35


Season 2: Conclusions

1. Rethink

2. Adapt

3. Leverage

36


Thank you

37


Questions

38


A few answers about hardware/OS version /Java version/Cassandra version/Hadoop version

§  Hardware:

§  16 nodes in DC1 and DC2 at the end of 2013: §  2 CPU 6cores each Intel® Xeon® 2.00 GHz

§  64 GB RAM

§  FusionIO ® 800 GB MLC

§  4 nodes in DC3 §  24 GB de RAM

§  2 CPU 6cores each Intel® Xeon® 2.00 GHz

§  SATA Disks 15Krpm

§  OS: Ubuntu Precise (12.04 LTS)

§  Cassandra version: 1.2.13

§  Hadoop version: CDH 4.4 (with Hive 0.10): Hadoop 2 with MRv1

§  Hive handler: https://github.com/Orange-OpenSource/cassandra_handler_for_hive

§  Java version: Java7u45 (GC CMS with option CMSClassUnloadingEnabled)

39


A few answers about data and requests

§  Data types:

§  Volume: 6 TB at the end of 2013

§  elementary types: boolean, integer, string, date

§  collection types

§  complex types: json, xml (between 1 and 20 KB)

§  Requests:

§  10.000 requests/sec at the end of 2013

§  80% get

§  20% set

§  Consistency level used by PnS for online queries and batch updates:

§  LOCAL_ONE (95% of the queries)

§  LOCAL_QUORUM (5% of the queries)

cassandra summit 2014: the cassandra experience at orange — season 2

Technology