jstein.cassandra.nyc.2011

21
/* Joe Stein http://www.linkedin.com/in/charmalloc @allthingshadoop @cassandranosql @allthingsscala @charmalloc */ http:// www.medialets.com 1 Cassandra as the central nervous system of your distributed systems

Upload: joe-stein

Post on 11-May-2015

4.207 views

Category:

Technology


0 download

DESCRIPTION

RealTime Analytics With Cassandra

TRANSCRIPT

Page 1: jstein.cassandra.nyc.2011

1

/*Joe Stein http://www.linkedin.com/in/charmalloc@allthingshadoop@cassandranosql@allthingsscala@charmalloc

*/

http://www.medialets.com

Cassandra as the central nervous system of your distributed systems

Page 2: jstein.cassandra.nyc.2011

2

Overview

• Architecture• Aggregate Metrics/Time Series• Implementation Over Cassandra

Page 3: jstein.cassandra.nyc.2011

3

Medialets

Architecture

Page 4: jstein.cassandra.nyc.2011

4

Medialets• Largest deployment of rich media ads for mobile devices• Over 300,000,000 devices supported• 3-4 TB of new data every day• Thousands of services in production• Hundreds of thousands of events received every second• Response times are measured in microseconds• Languages

–35% JVM (20% Scala & 10% Java)–30% Ruby–20% C/C++–13% Python–2% Bash

Page 5: jstein.cassandra.nyc.2011

The million foot view

AdServing

Collection

Muse

mysql

mysql

mysql

Cassandra

HadoopKafka

Page 6: jstein.cassandra.nyc.2011

6

Medialets

Aggregate Metrics/Time Series

Page 7: jstein.cassandra.nyc.2011

7

Lets look at just one data point captured

• 09/10/2011 11:12:13 • App = Yahoo!• Platform = iOS• OS = 4.3.4• Device = iPad2,1• Resolution = 768x1024• Events

–videoPlayPercent = 38–Taste = great

Page 8: jstein.cassandra.nyc.2011

8

The time series part of it

• 09/10/2011 11:12:13

Quarter Q3Month 201109Week 201136Day 20110910Hour 2011091011Minute 201109101112Second 20110910111213

Page 9: jstein.cassandra.nyc.2011

9

Metrics For Different Wants

Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024

Yahoo! + videoPlayPercent = 30 + Taste = great

Yahoo! + Taste = great

Yahoo! + videoPlayPercent = 30 iPad2,1 + videoPlayPercent = 30 + Taste = great

768x1024 + videoPlayPercent = 30 + Taste = great

iOS + 4.3.4 + iPad2,1

Page 10: jstein.cassandra.nyc.2011

10

Medialets

Implementation Over Cassandra

Page 11: jstein.cassandra.nyc.2011

11

Storing the time series

Column Families hold your rows of data. Each row in each column family will be equal to the time period you are dealing with. So an “event” occurring at 09/10/2011 12:13:14 will become 4 rows

BySecond = 20110910121314ByMinute= 201109101213ByHour= 2011091012ByDay=20110910

CREATE COLUMN FAMILY ByDayWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

CREATE COLUMN FAMILY ByHourWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

CREATE COLUMN FAMILY ByMinuteWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

CREATE COLUMN FAMILY BySecondWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

Page 12: jstein.cassandra.nyc.2011

12

Why multiple column families?

http://www.datastax.com/docs/1.0/configuration/storage_configuration

Page 13: jstein.cassandra.nyc.2011

13

Generically group by

• app+platform+osversion+device+resolution

• app+event1+event2

• app+event1

• app+event2

• device+event1+event2

• resolution+event1+event2

• platform+osversion+device

Page 14: jstein.cassandra.nyc.2011

14

As columns – names are composites

• app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024

• app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great

• app+event1#Yahoo!+Taste=great

• app+event2#Yahoo!+videoPlayPercent=30

• device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great

• resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great

• platform+osversion+device#iOS+4.3.4+iPad2,1

Page 15: jstein.cassandra.nyc.2011

15

The rows

• ByHour=2011091011– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great– app+event1#Yahoo!+Taste=great– app+event2#Yahoo!+videoPlayPercent=30– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great– platform+osversion+device#iOS+4.3.4+iPad2,1

• ByDay=20110910– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great– app+event1#Yahoo!+Taste=great– app+event2#Yahoo!+videoPlayPercent=30– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great– platform+osversion+device#iOS+4.3.4+iPad2,1

Page 16: jstein.cassandra.nyc.2011

16

Inserting data with Hector

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1#Yahoo!+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event2#Yahoo!+videoPlayPercent=30”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“platform+osversion+device#iOS+4.3.4+iPad2,1

Page 17: jstein.cassandra.nyc.2011

17

Inserting data with Skeletor

Skeletor is the Scala wrapper of Hector for Cassandrahttps://github.com/joestein/skeletor

aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))

}

//rows we are going to write tooaggregateKeys(KEYSPACE \ ”ByMonth") = month //201109aggregateKeys(KEYSPACE \ "ByDay") = day //20110910aggregateKeys(KEYSPACE \ ”ByHour") = hour //2011091012aggregateKeys(KEYSPACE \ ”ByMinute") = minute //201109101213

def r(columnName: String): Unit = {aggregateKeys.foreach{tuple:(ColumnFamily, String) => {val (columnFamily,row) = tuple

if (row !=null && row.size > 0)rows add (columnFamily -> row has columnName inc) //increment the counter

} }}

ccAppPlatformOSVersionDeviceResolution(r)

Page 18: jstein.cassandra.nyc.2011

18

Retrieving Data

MultigetSliceCounterQuery

• setColumnFamily(“ByDay”)• setKeys("20110910")• setRange(”app+event1=","app+event1=~",false,1000)• We will get all the apps and counts for event1

• setRange(”app+event2=","app+event2=~",false,1000)• We will get all the apps and the counts for event2

By app tastes great vs less filling

• Sample code for the aggregate metrics and retrieving them https://github.com/joestein/apophis

• What is with the tilde?

Page 19: jstein.cassandra.nyc.2011

19

Sort for success

Not magic, just Cassandra

Page 20: jstein.cassandra.nyc.2011

20

A few more things about retrieving data

• You need to start backwards from here.

• If you want to-do things adhoc then map/reduce is better

• Sometimes more rows arebetter allowing more nodes to-do work– If you need to look at 100,000 metrics it is better to pull this out

of 100 rows than out of 1– Don’t be afraid to make CF and composite keys out of Time+

Aggregate data• 20111023+app=Yahoo!• This could be the row that holds ALL of the app information

for that day, if you want to look at 100 apps at once with 1000 metrics for each per time period, this could be the way to go

Page 21: jstein.cassandra.nyc.2011

[email protected]/showcase

MedialetsThe rich media ad platform for mobile.

21

Q & A/* * Joe Stein * http://www.linkedin.com/in/charmalloc * @allthingshadoop * @cassandranosql * @allthingsscala * @charmalloc

* http://github.com/joestein */