processing 19 billion messages in real time and not dying in the process

24
Juan Martín Pampliega Fernando Otero Procesando 19 billones de mensajes en tiempo real sin morir en el intento

Upload: jampp

Post on 16-Apr-2017

191 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Processing 19 billion messages in real time and NOT dying in the process

Juan Martín Pampliega

Fernando Otero

Procesando 19 billones de mensajes en tiempo real sin

morir en el intento

Page 2: Processing 19 billion messages in real time and NOT dying in the process

@juanpampliegaFav Beer: Paulaner Hefe-Weißbier

@foteroFav Beer: Guinness

JuanSoftware Engineer

Sr. Data Engineer, JamppAssistant Professor @ITBA

FernandoComputer Scientist

Principal Engineer, JamppSpark-Jobserver committerUser & contributor to Spark since 1.0

About these 2 guys ;-)

Page 3: Processing 19 billion messages in real time and NOT dying in the process

About JamppWe are a tech company that helps companies grow their mobile business by driving engaged users to their apps

We are a team of 75 people, 30% in the engineering team.

Located in 6 cities across the US, Latin America, Europe and Africa

Machine learning

Post-install event optimisation

Dynamic Product Ads and Segments

Data Science

Programmatic Buying

Page 4: Processing 19 billion messages in real time and NOT dying in the process

We have to be fast!

Our platform combines machine learning with big data for programmatic ad buying which optimizes towards in-app activity.

We process 220,000+ RTB ad bid requests per second (19+ billions per day) which amounts to about 40 TB of data per day.

Page 5: Processing 19 billion messages in real time and NOT dying in the process

AppStore / Google Play

Postback

Exchange Jampp Tracking

Tracking Platform

How does programmatic ads work?

Page 6: Processing 19 billion messages in real time and NOT dying in the process

RTB: Real Time Bidding

Ad Reques

t

OpenRTB(100ms)

OpenRTB(100ms)

OpenRTB bid $3

OpenRTB bid $2

Win notification $2.01

Ad(auction won)

Page 7: Processing 19 billion messages in real time and NOT dying in the process

Data @ Jampp● Started using RDBMSs on Amazon Web Services.

● Data grew exponentially.

● Last year: 1500%+ in-app events, 500%+ RTB bids.

● We needed to evolve our architecture.

Page 8: Processing 19 billion messages in real time and NOT dying in the process

Initial Data Architecture

Page 9: Processing 19 billion messages in real time and NOT dying in the process

C1

C2

Cn

Cupper

Load Balancer

MySQL

Click

Install

Event PostgreSQL

B1 B2 Bn

Replicator

API(Pivots)

Auctions Bids Impressions

User Segments

Initial Jampp Infrastructure

Page 10: Processing 19 billion messages in real time and NOT dying in the process

Emerging Needs

● Increase Log forensics capabilities.

● More historical and granular data.

● Make data readily available to other systems.

● Impossible to easily scale MySQL.

Page 11: Processing 19 billion messages in real time and NOT dying in the process

Evolution of our Data Architecture

Page 12: Processing 19 billion messages in real time and NOT dying in the process

C1

C2

Cn

Cupper

Load Balancer

MySQL

Click

Install

Event

ClickRedirect

ELB Logs

C1

C2

Cn

EMR - Hadoop Cluster

AirPal

Initial Evolution

User Segments

Page 13: Processing 19 billion messages in real time and NOT dying in the process

New System Characteristics

● The new system was based on Amazon Elastic Map Reduce using Facebook PrestoDB and Apache Spark.

● Data imported hourly from RDBMSs with Sqoop. ● Logs are imported every 10 minutes from different

sources to S3 tables.● Scalable storage and processing capabilities using

HDFS, S3, YARN and Hive for ETLs and data storage.

Page 14: Processing 19 billion messages in real time and NOT dying in the process

Aspects that needed improvement

● Data still imported in batch mode. Delay was larger for MySQL data than with Python Replicator.

● EMR not great for long running clusters.● Data still being accumulated in RDBMSs (clicks,

installs, events).● Lots of joins still needed for day to day data analysis

work.

Page 15: Processing 19 billion messages in real time and NOT dying in the process

Current Architecture

Page 16: Processing 19 billion messages in real time and NOT dying in the process

● Continuous event processing of unbounded datasets.● Makes data available in real-time and evens out load of

data processing.

Stream processing

● Requires:○ Persistent log for fault tolerance.

○ Scalable processing layer to deal with spikes in throughput.

○ Allow data enrichment and handling out-of-order data.

Page 17: Processing 19 billion messages in real time and NOT dying in the process

A Real-Time Architecture

Page 18: Processing 19 billion messages in real time and NOT dying in the process

Current state of the evolution

● Real-time event processing architecture based on best practices for stream processing in AWS.

● Uses Amazon Kinesis for durable streaming data and Amazon Lambda for data processing.

● DynamoDB and Redis are used as temporal data storage for enrichment and analytics.

● S3 provides us with a Source of Truth for batch data applications and Kinesis for stream processing.

Page 19: Processing 19 billion messages in real time and NOT dying in the process

Our Real-Time Architecture

Page 20: Processing 19 billion messages in real time and NOT dying in the process

Benefits of the Evolution

● Enables the use of stream processing frameworks to keep data as fresh as economically possible.

● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure.

● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR.

● Monitoring, logs and alerts managed by AWS Cloudwatch.

Page 21: Processing 19 billion messages in real time and NOT dying in the process

Still, it isn’t perfect...

● There is no easy way to manage windows and out or order data with Amazon Lambda.

● Consistency of DynamoDB and S3.● Price of AWS managed services

for events with large numbers compared to custom maintained solutions.

● ACID guarantees of RDBMs are not an easy thing to part with.

● SQL and indexes in RDBMs make forensics easier.

Page 22: Processing 19 billion messages in real time and NOT dying in the process

Lessons Learned● Traditional datawarehouses are still relevant. To

cope with data volumes: lose detail.● RDBMSs can fit a lot of use cases initially: unified

log, OLAP, near real-time processing.● Take full advantage of you Public Cloud provider’s

offerings (rent first, build later).● Development and staging for Big Data projects

should involve production traffic or be prepared for trouble.

Page 23: Processing 19 billion messages in real time and NOT dying in the process

● PrestoDB is really amazing in regards to performance, maturity and feature set.

● Kinesis, Dynamo and Firehose use HTTPS as transport protocol which is slow and requires aggressive batching.

● All Amazon services require automatic retries with exponential back-off + jitter.

Lessons Learned

Page 24: Processing 19 billion messages in real time and NOT dying in the process

Questions?

geeks.jampp.com