jstein.cassandra.nyc.2011
DESCRIPTION
RealTime Analytics With CassandraTRANSCRIPT
1
/*Joe Stein http://www.linkedin.com/in/charmalloc@allthingshadoop@cassandranosql@allthingsscala@charmalloc
*/
http://www.medialets.com
Cassandra as the central nervous system of your distributed systems
2
Overview
• Architecture• Aggregate Metrics/Time Series• Implementation Over Cassandra
3
Medialets
Architecture
4
Medialets• Largest deployment of rich media ads for mobile devices• Over 300,000,000 devices supported• 3-4 TB of new data every day• Thousands of services in production• Hundreds of thousands of events received every second• Response times are measured in microseconds• Languages
–35% JVM (20% Scala & 10% Java)–30% Ruby–20% C/C++–13% Python–2% Bash
The million foot view
AdServing
Collection
Muse
mysql
mysql
mysql
Cassandra
HadoopKafka
6
Medialets
Aggregate Metrics/Time Series
7
Lets look at just one data point captured
• 09/10/2011 11:12:13 • App = Yahoo!• Platform = iOS• OS = 4.3.4• Device = iPad2,1• Resolution = 768x1024• Events
–videoPlayPercent = 38–Taste = great
8
The time series part of it
• 09/10/2011 11:12:13
Quarter Q3Month 201109Week 201136Day 20110910Hour 2011091011Minute 201109101112Second 20110910111213
9
Metrics For Different Wants
Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024
Yahoo! + videoPlayPercent = 30 + Taste = great
Yahoo! + Taste = great
Yahoo! + videoPlayPercent = 30 iPad2,1 + videoPlayPercent = 30 + Taste = great
768x1024 + videoPlayPercent = 30 + Taste = great
iOS + 4.3.4 + iPad2,1
10
Medialets
Implementation Over Cassandra
11
Storing the time series
Column Families hold your rows of data. Each row in each column family will be equal to the time period you are dealing with. So an “event” occurring at 09/10/2011 12:13:14 will become 4 rows
BySecond = 20110910121314ByMinute= 201109101213ByHour= 2011091012ByDay=20110910
CREATE COLUMN FAMILY ByDayWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;
CREATE COLUMN FAMILY ByHourWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;
CREATE COLUMN FAMILY ByMinuteWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;
CREATE COLUMN FAMILY BySecondWITH default_validation_class=CounterColumnTypeAND key_validation_class=UTF8Type AND comparator=UTF8Type;
12
Why multiple column families?
http://www.datastax.com/docs/1.0/configuration/storage_configuration
13
Generically group by
• app+platform+osversion+device+resolution
• app+event1+event2
• app+event1
• app+event2
• device+event1+event2
• resolution+event1+event2
• platform+osversion+device
14
As columns – names are composites
• app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024
• app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great
• app+event1#Yahoo!+Taste=great
• app+event2#Yahoo!+videoPlayPercent=30
• device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great
• resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great
• platform+osversion+device#iOS+4.3.4+iPad2,1
15
The rows
• ByHour=2011091011– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great– app+event1#Yahoo!+Taste=great– app+event2#Yahoo!+videoPlayPercent=30– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great– platform+osversion+device#iOS+4.3.4+iPad2,1
• ByDay=20110910– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great– app+event1#Yahoo!+Taste=great– app+event2#Yahoo!+videoPlayPercent=30– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great– platform+osversion+device#iOS+4.3.4+iPad2,1
16
Inserting data with Hector
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024”), 1))
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great”), 1))
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1#Yahoo!+Taste=great”), 1))
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event2#Yahoo!+videoPlayPercent=30”), 1))
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great”), 1))
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great”), 1))
• mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“platform+osversion+device#iOS+4.3.4+iPad2,1
17
Inserting data with Skeletor
Skeletor is the Scala wrapper of Hector for Cassandrahttps://github.com/joestein/skeletor
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”
def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))
}
//rows we are going to write tooaggregateKeys(KEYSPACE \ ”ByMonth") = month //201109aggregateKeys(KEYSPACE \ "ByDay") = day //20110910aggregateKeys(KEYSPACE \ ”ByHour") = hour //2011091012aggregateKeys(KEYSPACE \ ”ByMinute") = minute //201109101213
def r(columnName: String): Unit = {aggregateKeys.foreach{tuple:(ColumnFamily, String) => {val (columnFamily,row) = tuple
if (row !=null && row.size > 0)rows add (columnFamily -> row has columnName inc) //increment the counter
} }}
ccAppPlatformOSVersionDeviceResolution(r)
18
Retrieving Data
MultigetSliceCounterQuery
• setColumnFamily(“ByDay”)• setKeys("20110910")• setRange(”app+event1=","app+event1=~",false,1000)• We will get all the apps and counts for event1
• setRange(”app+event2=","app+event2=~",false,1000)• We will get all the apps and the counts for event2
By app tastes great vs less filling
• Sample code for the aggregate metrics and retrieving them https://github.com/joestein/apophis
• What is with the tilde?
19
Sort for success
Not magic, just Cassandra
20
A few more things about retrieving data
• You need to start backwards from here.
• If you want to-do things adhoc then map/reduce is better
• Sometimes more rows arebetter allowing more nodes to-do work– If you need to look at 100,000 metrics it is better to pull this out
of 100 rows than out of 1– Don’t be afraid to make CF and composite keys out of Time+
Aggregate data• 20111023+app=Yahoo!• This could be the row that holds ALL of the app information
for that day, if you want to look at 100 apps at once with 1000 metrics for each per time period, this could be the way to go
[email protected]/showcase
MedialetsThe rich media ad platform for mobile.
21
Q & A/* * Joe Stein * http://www.linkedin.com/in/charmalloc * @allthingshadoop * @cassandranosql * @allthingsscala * @charmalloc
* http://github.com/joestein */