processing 19 billion messages in real time and not dying in the process

Juan Martín Pampliega

Fernando Otero

Procesando 19 billones de mensajes en tiempo real sin

morir en el intento

@juanpampliegaFav Beer: Paulaner Hefe-Weißbier

@foteroFav Beer: Guinness

JuanSoftware Engineer

Sr. Data Engineer, JamppAssistant Professor @ITBA

FernandoComputer Scientist

Principal Engineer, JamppSpark-Jobserver committerUser & contributor to Spark since 1.0

About these 2 guys ;-)

About JamppWe are a tech company that helps companies grow their mobile business by driving engaged users to their apps

We are a team of 75 people, 30% in the engineering team.

Located in 6 cities across the US, Latin America, Europe and Africa

Machine learning

Post-install event optimisation

Dynamic Product Ads and Segments

Data Science

Programmatic Buying

We have to be fast!

Our platform combines machine learning with big data for programmatic ad buying which optimizes towards in-app activity.

We process 220,000+ RTB ad bid requests per second (19+ billions per day) which amounts to about 40 TB of data per day.

AppStore / Google Play

Postback

Exchange Jampp Tracking

Tracking Platform

How does programmatic ads work?

RTB: Real Time Bidding

Ad Reques

t

OpenRTB(100ms)

OpenRTB(100ms)

OpenRTB bid $3

OpenRTB bid $2

Win notification $2.01

Ad(auction won)

Data @ Jampp● Started using RDBMSs on Amazon Web Services.

● Data grew exponentially.

● Last year: 1500%+ in-app events, 500%+ RTB bids.

● We needed to evolve our architecture.

Initial Data Architecture

C1

C2

Cn

Cupper

Load Balancer

MySQL

Click

Install

Event PostgreSQL

B1 B2 Bn

Replicator

API(Pivots)

Auctions Bids Impressions

User Segments

Initial Jampp Infrastructure

Emerging Needs

● Increase Log forensics capabilities.

● More historical and granular data.

● Make data readily available to other systems.

● Impossible to easily scale MySQL.

Evolution of our Data Architecture

C1

C2

Cn

Cupper

Load Balancer

MySQL

Click

Install

Event

ClickRedirect

ELB Logs

C1

C2

Cn

EMR - Hadoop Cluster

AirPal

Initial Evolution

User Segments

New System Characteristics

● The new system was based on Amazon Elastic Map Reduce using Facebook PrestoDB and Apache Spark.

● Data imported hourly from RDBMSs with Sqoop. ● Logs are imported every 10 minutes from different

sources to S3 tables.● Scalable storage and processing capabilities using

HDFS, S3, YARN and Hive for ETLs and data storage.

Aspects that needed improvement

● Data still imported in batch mode. Delay was larger for MySQL data than with Python Replicator.

● EMR not great for long running clusters.● Data still being accumulated in RDBMSs (clicks,

installs, events).● Lots of joins still needed for day to day data analysis

work.

Current Architecture

● Continuous event processing of unbounded datasets.● Makes data available in real-time and evens out load of

data processing.

Stream processing

● Requires:○ Persistent log for fault tolerance.

○ Scalable processing layer to deal with spikes in throughput.

○ Allow data enrichment and handling out-of-order data.

A Real-Time Architecture

Current state of the evolution

● Real-time event processing architecture based on best practices for stream processing in AWS.

● Uses Amazon Kinesis for durable streaming data and Amazon Lambda for data processing.

● DynamoDB and Redis are used as temporal data storage for enrichment and analytics.

● S3 provides us with a Source of Truth for batch data applications and Kinesis for stream processing.

Our Real-Time Architecture

Benefits of the Evolution

● Enables the use of stream processing frameworks to keep data as fresh as economically possible.

● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure.

● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR.

● Monitoring, logs and alerts managed by AWS Cloudwatch.

Still, it isn’t perfect...

● There is no easy way to manage windows and out or order data with Amazon Lambda.

● Consistency of DynamoDB and S3.● Price of AWS managed services

for events with large numbers compared to custom maintained solutions.

● ACID guarantees of RDBMs are not an easy thing to part with.

● SQL and indexes in RDBMs make forensics easier.

Lessons Learned● Traditional datawarehouses are still relevant. To

cope with data volumes: lose detail.● RDBMSs can fit a lot of use cases initially: unified

log, OLAP, near real-time processing.● Take full advantage of you Public Cloud provider’s

offerings (rent first, build later).● Development and staging for Big Data projects

should involve production traffic or be prepared for trouble.

● PrestoDB is really amazing in regards to performance, maturity and feature set.

● Kinesis, Dynamo and Firehose use HTTPS as transport protocol which is slow and requires aggressive batching.

● All Amazon services require automatic retries with exponential back-off + jitter.

Lessons Learned

Questions?

geeks.jampp.com

processing 19 billion messages in real time and not dying in the process

Technology