jstein.cassandra.nyc.2011

Post on 11-May-2015

4.207 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

RealTime Analytics With Cassandra

TRANSCRIPT

1

/*Joe Stein http://www.linkedin.com/in/charmalloc@allthingshadoop@cassandranosql@allthingsscala@charmalloc

*/

http://www.medialets.com

Cassandra as the central nervous system of your distributed systems

2

Overview

• Architecture• Aggregate Metrics/Time Series• Implementation Over Cassandra

3

Medialets

Architecture

4

Medialets• Largest deployment of rich media ads for mobile devices• Over 300,000,000 devices supported• 3-4 TB of new data every day• Thousands of services in production• Hundreds of thousands of events received every second• Response times are measured in microseconds• Languages

–35% JVM (20% Scala & 10% Java)–30% Ruby–20% C/C++–13% Python–2% Bash

The million foot view

AdServing

Collection

Muse

mysql

mysql

mysql

Cassandra

HadoopKafka

6

Medialets

Aggregate Metrics/Time Series

7

Lets look at just one data point captured

• 09/10/2011 11:12:13 • App = Yahoo!• Platform = iOS• OS = 4.3.4• Device = iPad2,1• Resolution = 768x1024• Events

–videoPlayPercent = 38–Taste = great

8

The time series part of it

• 09/10/2011 11:12:13

Quarter Q3Month 201109Week 201136Day 20110910Hour 2011091011Minute 201109101112Second 20110910111213

9

Metrics For Different Wants

Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024

Yahoo! + videoPlayPercent = 30 + Taste = great

Yahoo! + Taste = great

Yahoo! + videoPlayPercent = 30 iPad2,1 + videoPlayPercent = 30 + Taste = great

768x1024 + videoPlayPercent = 30 + Taste = great

iOS + 4.3.4 + iPad2,1

10

Medialets

Implementation Over Cassandra

11

Storing the time series

Column Families hold your rows of data. Each row in each column family will be equal to the time period you are dealing with. So an “event” occurring at 09/10/2011 12:13:14 will become 4 rows

BySecond = 20110910121314ByMinute= 201109101213ByHour= 2011091012ByDay=20110910

CREATE COLUMN FAMILY ByDayWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

CREATE COLUMN FAMILY ByHourWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

CREATE COLUMN FAMILY ByMinuteWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

CREATE COLUMN FAMILY BySecondWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;

12

Why multiple column families?

http://www.datastax.com/docs/1.0/configuration/storage_configuration

13

Generically group by

• app+platform+osversion+device+resolution

• app+event1+event2

• app+event1

• app+event2

• device+event1+event2

• resolution+event1+event2

• platform+osversion+device

14

As columns – names are composites

• app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024

• app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great

• app+event1#Yahoo!+Taste=great

• app+event2#Yahoo!+videoPlayPercent=30

• device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great

• resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great

• platform+osversion+device#iOS+4.3.4+iPad2,1

15

The rows

• ByHour=2011091011– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great– app+event1#Yahoo!+Taste=great– app+event2#Yahoo!+videoPlayPercent=30– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great– platform+osversion+device#iOS+4.3.4+iPad2,1

• ByDay=20110910– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great– app+event1#Yahoo!+Taste=great– app+event2#Yahoo!+videoPlayPercent=30– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great– platform+osversion+device#iOS+4.3.4+iPad2,1

16

Inserting data with Hector

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1#Yahoo!+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event2#Yahoo!+videoPlayPercent=30”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“platform+osversion+device#iOS+4.3.4+iPad2,1

17

Inserting data with Skeletor

Skeletor is the Scala wrapper of Hector for Cassandrahttps://github.com/joestein/skeletor

aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))

}

//rows we are going to write tooaggregateKeys(KEYSPACE \ ”ByMonth") = month //201109aggregateKeys(KEYSPACE \ "ByDay") = day //20110910aggregateKeys(KEYSPACE \ ”ByHour") = hour //2011091012aggregateKeys(KEYSPACE \ ”ByMinute") = minute //201109101213

def r(columnName: String): Unit = {aggregateKeys.foreach{tuple:(ColumnFamily, String) => {val (columnFamily,row) = tuple

if (row !=null && row.size > 0)rows add (columnFamily -> row has columnName inc) //increment the counter

} }}

ccAppPlatformOSVersionDeviceResolution(r)

18

Retrieving Data

MultigetSliceCounterQuery

• setColumnFamily(“ByDay”)• setKeys("20110910")• setRange(”app+event1=","app+event1=~",false,1000)• We will get all the apps and counts for event1

• setRange(”app+event2=","app+event2=~",false,1000)• We will get all the apps and the counts for event2

By app tastes great vs less filling

• Sample code for the aggregate metrics and retrieving them https://github.com/joestein/apophis

• What is with the tilde?

19

Sort for success

Not magic, just Cassandra

20

A few more things about retrieving data

• You need to start backwards from here.

• If you want to-do things adhoc then map/reduce is better

• Sometimes more rows arebetter allowing more nodes to-do work– If you need to look at 100,000 metrics it is better to pull this out

of 100 rows than out of 1– Don’t be afraid to make CF and composite keys out of Time+

Aggregate data• 20111023+app=Yahoo!• This could be the row that holds ALL of the app information

for that day, if you want to look at 100 apps at once with 1000 metrics for each per time period, this could be the way to go

connect@medialets.comwww.medialets.com/showcase

MedialetsThe rich media ad platform for mobile.

21

Q & A/* * Joe Stein * http://www.linkedin.com/in/charmalloc * @allthingshadoop * @cassandranosql * @allthingsscala * @charmalloc

* http://github.com/joestein */

top related