processing 19 billion messages in real time and not dying in the process
TRANSCRIPT
Juan Martín Pampliega
Fernando Otero
Procesando 19 billones de mensajes en tiempo real sin
morir en el intento
@juanpampliegaFav Beer: Paulaner Hefe-Weißbier
@foteroFav Beer: Guinness
JuanSoftware Engineer
Sr. Data Engineer, JamppAssistant Professor @ITBA
FernandoComputer Scientist
Principal Engineer, JamppSpark-Jobserver committerUser & contributor to Spark since 1.0
About these 2 guys ;-)
About JamppWe are a tech company that helps companies grow their mobile business by driving engaged users to their apps
We are a team of 75 people, 30% in the engineering team.
Located in 6 cities across the US, Latin America, Europe and Africa
Machine learning
Post-install event optimisation
Dynamic Product Ads and Segments
Data Science
Programmatic Buying
We have to be fast!
Our platform combines machine learning with big data for programmatic ad buying which optimizes towards in-app activity.
We process 220,000+ RTB ad bid requests per second (19+ billions per day) which amounts to about 40 TB of data per day.
AppStore / Google Play
Postback
Exchange Jampp Tracking
Tracking Platform
How does programmatic ads work?
RTB: Real Time Bidding
Ad Reques
t
OpenRTB(100ms)
OpenRTB(100ms)
OpenRTB bid $3
OpenRTB bid $2
Win notification $2.01
Ad(auction won)
Data @ Jampp● Started using RDBMSs on Amazon Web Services.
● Data grew exponentially.
● Last year: 1500%+ in-app events, 500%+ RTB bids.
● We needed to evolve our architecture.
Initial Data Architecture
C1
C2
Cn
Cupper
Load Balancer
MySQL
Click
Install
Event PostgreSQL
B1 B2 Bn
Replicator
API(Pivots)
Auctions Bids Impressions
User Segments
Initial Jampp Infrastructure
Emerging Needs
● Increase Log forensics capabilities.
● More historical and granular data.
● Make data readily available to other systems.
● Impossible to easily scale MySQL.
Evolution of our Data Architecture
C1
C2
Cn
Cupper
Load Balancer
MySQL
Click
Install
Event
ClickRedirect
ELB Logs
C1
C2
Cn
EMR - Hadoop Cluster
AirPal
Initial Evolution
User Segments
New System Characteristics
● The new system was based on Amazon Elastic Map Reduce using Facebook PrestoDB and Apache Spark.
● Data imported hourly from RDBMSs with Sqoop. ● Logs are imported every 10 minutes from different
sources to S3 tables.● Scalable storage and processing capabilities using
HDFS, S3, YARN and Hive for ETLs and data storage.
Aspects that needed improvement
● Data still imported in batch mode. Delay was larger for MySQL data than with Python Replicator.
● EMR not great for long running clusters.● Data still being accumulated in RDBMSs (clicks,
installs, events).● Lots of joins still needed for day to day data analysis
work.
Current Architecture
● Continuous event processing of unbounded datasets.● Makes data available in real-time and evens out load of
data processing.
Stream processing
● Requires:○ Persistent log for fault tolerance.
○ Scalable processing layer to deal with spikes in throughput.
○ Allow data enrichment and handling out-of-order data.
A Real-Time Architecture
Current state of the evolution
● Real-time event processing architecture based on best practices for stream processing in AWS.
● Uses Amazon Kinesis for durable streaming data and Amazon Lambda for data processing.
● DynamoDB and Redis are used as temporal data storage for enrichment and analytics.
● S3 provides us with a Source of Truth for batch data applications and Kinesis for stream processing.
Our Real-Time Architecture
Benefits of the Evolution
● Enables the use of stream processing frameworks to keep data as fresh as economically possible.
● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure.
● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR.
● Monitoring, logs and alerts managed by AWS Cloudwatch.
Still, it isn’t perfect...
● There is no easy way to manage windows and out or order data with Amazon Lambda.
● Consistency of DynamoDB and S3.● Price of AWS managed services
for events with large numbers compared to custom maintained solutions.
● ACID guarantees of RDBMs are not an easy thing to part with.
● SQL and indexes in RDBMs make forensics easier.
Lessons Learned● Traditional datawarehouses are still relevant. To
cope with data volumes: lose detail.● RDBMSs can fit a lot of use cases initially: unified
log, OLAP, near real-time processing.● Take full advantage of you Public Cloud provider’s
offerings (rent first, build later).● Development and staging for Big Data projects
should involve production traffic or be prepared for trouble.
● PrestoDB is really amazing in regards to performance, maturity and feature set.
● Kinesis, Dynamo and Firehose use HTTPS as transport protocol which is slow and requires aggressive batching.
● All Amazon services require automatic retries with exponential back-off + jitter.
Lessons Learned
Questions?
geeks.jampp.com