transforming mobile push notifications with big data

Post on 07-Jul-2015

585 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

How we at Plumbee collect and process data at scale and how this data is used to send relevant mobile push notifications to our players to keep them engaged. Presented as part of a Tech Talk: http://engineering.plumbee.com/blog/2014/11/07/tech-talk-push-notifications-big-data/

TRANSCRIPT

Transforming Mobile Push Notifications with Big DataDennis Waldron, Data EngineeringPablo Varela, Systems Engineering

Who is Plumbee?

● 12.8M Installs● 209K Daily Active Users● 818K Monthly Active Users

● Social Games Studio● Mirrorball Slots & Bingo● Facebook Canvas, iOS

Data Providers

Inhouse data = 99.9% of all data

In Total:

● 98TB (907 days of data)● All stored in Amazon S3

Daily:

● 78GB compressed● ~450M events/day● 4,800 events/second (peak)

Aggregates

Application/Game Servers

End Users (Desktop & Mobile)

Log Aggregators

Amazon S3 (Simple Storage Service)DataPipeline

Amazon EMR(Elastic MapReduce)

Amazon Redshift

Daily Batch Processing

Plumbee Employees

Analytics (SQL Queries)

Architecture - Overview

Events (JSON)

SQS Analytics Queue

Events (JSON)

Amazon Web Service

End Users (Desktop & Mobile)

Application/Game Servers

● Collect everything!

● RPC events intercepted by annotated endpoints. (Requests)

● All mutating state changes recorded:○ DynamoDB, MySQL, Memcache

(Blobs Updates)● Custom Telemetry (Other):

○ Client: click tracking, loading time statistics, GPU data...

○ Server: promotions, transactions, Facebook user data...

Game Data

RPC

OTHER 15%

77%

9%

MySQL

MemCache

GENERATES

DynamoDB

Game Data - Example RPC Endpoint Annotation

/** * Example annotation */@SQSRequestLog(requestMessage = SpinRequest.class)@RequestMapping(“/spin”)public SpinResponse spin(SpinRequest spinRequest) {

}

Example Event - userStats● All events are recorded in JSON.● Structure:

○ Headers○ Categorization Data (metadata)○ Payload (message)

● Important Headers:○ timestamp○ testVariant○ plumbeeUid

Analytics (SQL Queries)

Aggregates

Application/Game Servers

End Users (Desktop & Mobile)

Amazon S3 (Simple Storage Service)DataPipeline

Amazon EMR(Elastic MapReduce)

Amazon Redshift

Daily Batch Processing

Plumbee Employees

Architecture - Collection

Log Aggregators

Events (JSON)

SQS Analytics Queue

Events (JSON)

Data Collection (I) - PUT

Application/Game Servers

Events (JSON)

SQS Queue

Log Aggregators

Producers Consumers

What is SQS (Simple Queue Service)?

A cloud-based message queue for transmitting messages between producers and consumers

SQS Provides:

● ACK/FAIL semantics● Unlimited number of messages● Scales transparently● Buffer zone

What is Apache Flume?

A distributed, reliable, and available service for efficiently collecting, aggregating, and

moving large amounts of log data

SQS Queue

Apache Flume

Consumers

Data Collection (II) - GET

Amazon S3 (Simple Storage Service)

S3 Data:

● Partitioned by: date / type / sub_type● Compressed with: Snappy● Aggregated in 512MB chunks

● Pluggable component architecture● Durability via transactions● File channel use Elastic Book Store (EBS) volumes (network attached storage)

○ Protects against Hardware failure

● SQS Flume Plugin: https://github.com/plumbee/flume-sqs-source

Data Collection (III) - Flume

Flume Agent

Source(Custom)

Sink(HDFS)

SQS Queue

Channel(File Based)

S3 Bucket

Transactions

A + B + C = Flow

A B C

Aggregates

Application/Game Servers

End Users (Desktop & Mobile)

Amazon S3 (Simple Storage Service)DataPipeline

Amazon EMR(Elastic MapReduce)

Amazon Redshift

Daily Batch Processing

Plumbee Employees

Analytics (SQL Queries)

Architecture - Processing

Events (JSON)

SQS Analytics Queue

Events (JSON)

● Daily activity● Orchestrated by Amazon DataPipeline● Includes generation of reports● Configured with JSON

What is DataPipeline?

A cloud-based data workflow service that helps you process and move data between

different AWS services

Extract, Transform, Load

SCH

EDU

LEC

OM

MA

ND

RES

OU

RC

E

What is Elastic Map Reduce?

Cloud-based MapReduce implementation to process vast amounts of data built on top of

the open-sourced Hadoop framework.

Two phases:

● Map() Procedure -> Filtering & Sorting● Reduce() -> Summary operation

Extract & Transform (I)Penguin

HorseCake

Cake

Penguin

Penguin

Penguin

Horse

Horse

CakeCake

HorseHorseHorse

PenguinPenguinPenguinPenguin

Cake: 2 Horse: 3

REDUCE()

MAP()

RA

W D

ATA

SOR

TED

QU

EUES

RES

ULT

Penguin:4

What is Hive?

An open-sourced Apache project with provides a SQL-Like interface to summarize, query and

analysis large datasets by leveraging Hadoop’s MapReduce infrastructure.

● Not really SQL, HQL -> HiveQL● No transactions, materialized views,

limited subquery support, ...

Extract & Transform (II)

SELECT plumbeeuid, COUNT(*) AS spins FROM eventlog

-- Partitioned data access WHERE event_date = '2014-11-18' AND event_type = 'rpc' AND event_sub_type = 'rpc-spin' -- Aggregation GROUP BY plumbeeuid;

Table: Eventlog● Mounted on top of raw data● SerDe provides JSON parsing● Target data via partition filters

● Hive has limitations! ○ Speed, JSON

● Most of our transformations use:

Streaming MapReduce Jobs

What is Streaming?

“A Hadoop utility that allows you to create and run MapReduce jobs using any

executable script as a mapper or reducer”

Extract & Transform (III)

for line in sys.stdin: data = json.loads(line) print data['plumbeeUid'] + '\t' + 1

results = defaultdict(int)

for line in sys.stdin: plumbee_uid, count = line.split('\t') results[plumbee_uid] += int(count)

print results

Emits, Key value Pairs 466264 => 1, 376166 => 1 983131 => 1, 466264 => 1

Hadoop sorts and shuffles the data making sure matching keys are processed by a single reducer!

JSON rpc-spin Data

Result:{ 466264: 2, 376166: 1, 983131: 1 }

map()

reduce()

Results

EMR Transformed data:

● Referred to as aggregates● Stored in S3● Accessible via EMR cluster

Load (I) - Problem

Raw S3 JSON Data Aggregated Data

EMR Transformation(Hive & Streaming Jobs)

Problem

● We don’t run long-lived EMR clusters.

EMR requires:

● Specialists knowledge● Is slow, processing and booting “offline”.

5.4TB

Use Amazon Redshift for fast “online” data access

What is Redshift?

A column-oriented database which uses Massive Parallel Processing (MPP) techniques

to support analytics style SQL based workloads across large datasets.

Power comes from:

● Query parallelization● Column-oriented design

Redshift Provides:

● Low latency JDBC and ODBC access● Fault Tolerance● Automated Backups

Load (II) - Redshift

Redshift (x3 nodes): 0.33sEMR (x20 nodes): 135.46s

Load (II) - Column-Oriented Databases

ID First Name Last Name Country

1 Penguin Situation GB

2 Cheese Labs US

3 Horse Barracks GB

ID First Name Last Name Country

1 Penguin Situation GB

2 Cheese Labs US

3 Horse Barracks GB

Row-oriented Database - MySQL

Column-oriented Database - Redshift

● East to add/modify records● Could read irrelevant data.● Great for fast lookups (OLTP)

● Only read in relevant data● Adding rows requires multiple

updates to column data.● Great for aggregation queries

(OLAP)

Aggregates

Application/Game Servers

End Users (Desktop & Mobile)

Amazon S3 (Simple Storage Service)DataPipeline

Amazon EMR(Elastic MapReduce)

Amazon Redshift

Daily Batch Processing

Plumbee Employees

Analytics (SQL Queries)

Architecture - Revisit

Log Aggregators

Events (JSON)

SQS Analytics Queue

Events (JSON)

Q&A

Targeted Push Notifications

Mirrorball Slots: Kingdom of Riches

Mirrorball Slots: Challenges

● recurring timed event● collect symbols from non-winning

spins● get free coins if enough symbols are

collected

Some players ask for notifications

Use Cases

Building blocks

Data Collection

Players

Data Collection

Amazon Redshift

Architecture - Overview

Amazon S3

Amazon Redshift

Batch ProcessorsAmazon SNS

PublisherTrigger Segmentation Workers

Players

Targeting

Mobile Push

User Targeting

User targeting

Run SQL queries directly against Redshift

Amazon Redshift User Segment

SQL Query

User targeting: Query example

-- Target all mobile users SELECT plumbee_uid, arnFROM mobile_user

User targeting: Query example (II)-- Target lapsed users (1 week lapse)

SELECT plumbee_uid, arnFROM mobile_userWHERE last_play_time < (now - 7 days)

Architecture - Mobile Push

Amazon S3

Amazon Redshift

Batch ProcessorsAmazon SNS

PublisherTrigger Segmentation Workers

Players

Targeting

Mobile Push

Amazon Simple Notification Service

What is SNS?

“Amazon Simple Notification Service (Amazon SNS) is a fast, flexible, fully managed push messaging service”

Amazon SNS

Amazon SNS

Amazon SNS: Device Registration

Game Servers SQS Analytics Queue Amazon RedshiftPlayers

Amazon SNS

register deviceevent

register

Amazon SNS: ARN Retrievalprivate String getArnForDeviceEndpoint(String platformApplicationArn, String deviceToken) {

CreatePlatformEndpointRequest request =

new CreatePlatformEndpointRequest()

.withPlatformApplicationArn(platformApplicationArn)

.withToken(deviceToken);

CreatePlatformEndpointResult result = snsClient.createPlatformEndpoint(request);

return result.getEndpointArn();

}

Amazon SNS: Analytics Eventprivate String registerEndpointForApplicationAndPlatform( final long plumbeeUid,

String platformARN, String platformToken) {

final String deviceEndpointARN = getArnForDeviceEndpoint( platformARN, platformToken);

sqsLogger.queueMessage( new HashMap<String, Object>() {{

put("notification", "register");

put("plumbeeUid", plumbeeUid);

put("provider", platformName);

put("endpoint", deviceEndpointARN );

}}, null);

return deviceEndpointARN;

}

Amazon SNS: Mobile Pushprivate void publishMessage(UserData userData, String jsonPayload) { amazonSNS.publish(new PublishRequest() .withTargetArn( userData.getEndpoint()) .withMessageStructure( "json") .withMessage( jsonPayload));}

{"default": "The 5 day Halloween Challenge has started today! Touch to play NOW!"}

Payload example

Architecture - Orchestration

Amazon S3

Amazon Redshift

Batch ProcessorsAmazon SNS

PublisherTrigger Segmentation Workers

Players

Targeting

Mobile Push

Amazon Simple Workflow

What is Amazon SWF?

“Amazon Simple Workflow (Amazon SWF) is a task coordination and state management service for cloud applications.”

What Amazon SWF provides● consistent execution state management

● workflow executions and tasks tracking

● non-duplicated dispatch of tasks

● task routing and queuing

● the AWS Flow Framework

Architecture - Orchestration

Amazon S3

Amazon Redshift

Batch ProcessorsAmazon SNS

PublisherTrigger Segmentation Workers

Players

Targeting

Mobile Push

Mobile Push: Scheduling

Trigger Publish Service Amazon Simple Workflow

Mobile Push: Targeting

query querytarget users

Amazon SWF Amazon EC2

Worker(Segmentation)

Amazon Redshift

Amazon S3

Mobile Push: Processing

publish pushbatch 1-N

Read data + push End UserAmazon SWFWorkers

(Processing)

Mobile Push: Reporting

send send

Amazon SWF Amazon EC2

Worker(Reporting)

Amazon SES

Demo (II)

Q&A

top related