innovation with aerospike · 2018-05-25 · innovation with aerospike: rob russo platform architect...
TRANSCRIPT
AERO SPIKE USER SUM M IT 2018
Innovation with Aerospike:
ROB RUSSOPlatform Architect
AppLovin
How AppLovin Handles Billions of Requests per Day
2 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
A Typical Day at AppLovin
3 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
✓ 3+ billion active devices
✓ 35 individual clusters
✓ Over 350 servers running Aerospike
✓ 700TB+ of disk and 70TB+ of RAM
✓ 250B+ total Aerospike records
✓ 4B+ objects expired daily
AppLovin at Scale
B A C K G R O U N D
Our production Aerospike setup consists of:
4 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
A Typical Day At AppLovin
5 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
✓ Device data storage
✓ Realtime XDR Ad Data
✓ Down funnel event tracking
✓ Dimension counting
✓ Decision tracking
✓ Atomic state tracking
✓ Spark functionality enhancements
Our Aerospike Use Cases
B A C K G R O U N D
At AppLovin we’ve applied Aerospike to the following successful uses:
6 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
7 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
• Aerospike data not quickly accessible in bulk
• Business intelligence questions require scanAll
• How many devices have X feature?
• What is the device distribution of Y?
• Takes up to a day for feedback
• Questions usually lead to follow-ups
• Slow backup/restore reduces room for error
The Problem
Speeding Up Development and BI
B A C K G R O U N D
8 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
• Develop Spark job to export Aerospike to HDFS
• Reduces query time by 750X
• BI becomes interactive and iterative
• Data becomes usable in all spark jobs
• Capable of a full cluster restore
• Saved in Parquet for query speed
• Reduces backup time by 80%
The Solution
Aerospike Backup/Restore with Spark
B A C K G R O U N D
9 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
The Results
Aerospike Backup/Restore with Spark
R E S U L T S
0
5
10
15
20
25
ScanAll Export +SparkJob
SparkJob
Hou
rs
BI Answer Time
750xfaster!
10 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
BI & Debugging with ApacheZeppelin
H O W I T W O R K S
11 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
Scaling it Up
H O W I T W O R K S
12 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
Aerospike Export Benchmarks
R E S U L T S
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
asbackup asbackup parallel ALBackup V1 ALBackup V2 ALBackup V3
RE
CO
RD
S/
S
13 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
What Next?
Combing Spark and Aerospike
B A C K G R O U N D
• We’ve found a way to augment Aerospike with Spark
• What about the reverse?
• How can we use the unique capabilities of Aerospike to augment Spark?
• Aggregation
• Set Operations
• Joins
14 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
THE APPLICATION
• Matching down funnel events to their source event
• Typically involves matching disproportionately sized datasets
• Perfect fit for realtimematching using Aerospike
Down Funnel Event Matching At Scale
B A C K G R O U N D
THE PROBLEM
• Original Vertica query began to fail before realtime solution ready
• Current Vertica solution:
• 384 cores, 2 TB RAM, 1 hour runtime, 2.7M events/s
• Quickly needed to replace it with an Apache Spark Query
• Immediately noticed the heavy resource utilization
• Vertica and Spark use the same inefficient joins
15 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
Spark to the Rescue
H O W I T W O R K S
FIRST APPROACH
▪ Use Spark SQL as a near drop in replacement
▪ Resulting performance was a choice between:
▪ 2000+ cores, 4TB RAM, 1 hour runtime, 2.7M events/s
▪ 384 cores, 1TB RAM, 5 hour runtime, 500K events/s
▪ At least it runs successfully!
SECOND APPROACH
▪ Use Spark broadcast join to avoid the large shuffle
▪ Clearly the fastest option: 100x faster than approach 1
▪ Results: Failed at scale due to 2GB broadcast limit
Hash Join
Broadcast Join
16 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
THE SOLUTION
• Install Aerospike in-memory namespace onto all Spark servers
• Use Aerospike within Spark job to avoid large data shuffles
• Aerospike inter-node communication is limited to small hashes
Avoiding the Shuffle
H O W I T W O R K S
THE METHOD
• Load the smaller dataset into Aerospike:
• Key off the join key
• Store required join data in bins
• Iterate large dataset locally and query Aerospike for matches
17 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
Step 1: Step 2:
The Aerospike-Spark Join
H O W I T W O R K S
18 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
INITIAL IMPLEMENTATION RESULTS
Proof of Concept
H O W I T W O R K S
How to scale it up?1. First simplify the test case, isolate to Aerospike
2. Pick consistent benchmark query
3. Tune Aerospike configuration
4. Re-implement with Aerospike batch protocol~10M reads/s
Used standard get/put
19 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
BENCHMARK RESULTS
Full Scale Implementation
R E S U L T S
Final job resources:384 Cores
500GB RAM (4x less)
5 minute runtime (12x faster)
230M reads/s (7x faster than spark alone)
6M reads/s per node
0
500
1000
1500
2000
2500
3000
3500
Spark Shuffle Spark + Aerospike Spark Broadcast
MIL
LIO
N R
EC
OR
DS/s
Join Throughputs
20 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
Peak Reads/s
O U T C O M E
21 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
The Final Comparison
O U T C O M E
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Vertica Spark Shuffle Spark + Aerospike
MIL
LIO
N R
EC
OR
DS/s
22 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
FORMALIZING THE FRAMEWORK
• Needed to provide simple API to make use of it
• Abstract away all of the Aerospike specific logic
• Hide the use of namespace, sets and bins
• Use TTL records to prevent the cluster from filling
• Provide feature parity to Spark’s included set/join functions:
Bringing it to the Masses
O U T C O M E
DEVELOPER API
largeRDD.aeroSubtract(broadcastRDD)
largeRDD.aeroIntersect(broadcastRDD)
largeRDD.aeroJoin(broadcastRDD)
largeRDD.aeroLeftJoin(broadcastRDD)
largeDF.aeroJoin(broadcastDF,columns)
largeDF.aeroLeftJoin(broadcastDF,columns)
23 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
LIMITATIONS
• Aerospike set/bin limits
• How to properly deal with TTL or removing records
• Key conflicts
• Potential for cascading failures
• Need to know where to apply it
Limitations and Future Work
F O L L O W U P
FUTURE IMPROVEMENT IDEAS
• Plugin to cost based optimizer
• Determine is SQL support is possible
• Launch dedicated job clusters through Mesos?
24 A E R O S P I K E U S E R S U M M I T | Proprietary & Confidential | All rights reserved. © 2018 Aerospike Inc
▪ Quick efficient access to Aerospike data from Spark can speed up BI and development by 750x
▪ Backup solutions through Spark can perform up to 2x faster than as backup alone
▪ A hybrid Aerospike-Spark setup can
perform 7x faster than Spark alone
Key
Takeaways
Aerospike and Spark
F O L L O W U P