Download - Data- How Does It Work-
Data: How Does It Work?
An overview of the Castle Global Analytics Pipeline (CGAP™)
1. General overview
2. Our current solution
3. Numbers and comparisons
Raw data aggregate
d from multiple sources
Extract, transform, and load
(ETL)
Persist ingestible
data to warehouse
Analyze ingested data for insights
Exhibit A: Generic Data Pipeline
Raw data aggregate
d from multiple sources
Extract, transform, and load
(ETL)
Persist ingestible
data to warehouse
Analyze ingested data for insights
● Two main types of real-time events: client and
server
● Challenges: client call site consistency,
batching, QA
● Multiple persistent databases
● Real-time events temporarily stored in Kafka -
revisit
● Databases always up to date, copied over at
intervals
CGAP™ - How We Obtain Raw Data
Kafka Diagram
Raw data aggregate
d from multiple sources
Extract, transform, and load
(ETL)
Persist ingestible
data to warehouse
Analyze ingested data for insights
The ETL Job (Spark)
● Multiple batch jobs, one per kafka topic, run at intervals
● Concurrently read out of multiple kafka partitions
● Save offset state per job
● Parse, flatten, write JSON
● Takes care of compression and uploading to cloud storage
● Extra batch load jobs run for persistent relational DBs
Spark Diagram
Spark Diagram (cont.)
Raw data aggregate
d from multiple sources
Extract, transform, and load
(ETL)
Persist ingestible
data to warehouse
Analyze ingested data for insights
Exhibit A: Generic Data Pipeline
Data Warehousing (AWS Redshift)
● Load in data from AWS S3 to Redshift
● Two clusters, Kiwi/Plaza, 10 nodes total
● Scaling/upgrading requires some write downtime
● Support for full SQL joins, based on PostgreSQL 8.0
● Specialized proprietary hardware/software layers
● Scales up to petabytes
Raw data aggregate
d from multiple sources
Extract, transform, and load
(ETL)
Persist ingestible
data to warehouse
Analyze ingested data for insights
Exhibit A: Generic Data Pipeline
Business Intelligence
● Primarily using Mode as the UI to Redshift, so so
cheap
● Allows for some visualizations from queries
Our Numbers
● Kiwi: half a billion events ingested daily○ 200MM from clients, 280MM from server (170MM
pushes)○ Uncompressed size of JSON: 500GB/day○ Compressed size, Spark output: 8GB/day○ Compressed and indexed size, Redshift: 15GB/day
● Plaza: 14MM events per day, mostly server○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8
minutes
Our Numbers (cont.)
● Currently storing 1.2TB of compressed Redshift data○ Roughly 2 months of full Kiwi data (14 billion
rows)○ Past 2 months, archive and coalesce to save
space● At Plaza’s rate, can store over two years on just
256GB● Costs:
○ S3 storage long-term: $5000/year○ Redshift, current status: $16,000/year○ Machines for batch load: 1 MBP○ Mode: $40/month
Thanks to...
● The Apache Foundation for making open source
systems
● AWS for providing us their economies of scale
● You, for listening