![Page 1: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/1.jpg)
AirbnbHONGBO ZENG
![Page 2: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/2.jpg)
� ����
� ����
� ����
� ����� ����
� ����
![Page 3: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/3.jpg)
� ����
� ����
� ����� ����
� ����
� ����
![Page 4: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/4.jpg)
• Data Platform at Airbnb• Cluster Evolution• Incremental Data Replication - ReAir• Unified Streaming and Batch Processing - AirStream
Agenda
![Page 5: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/5.jpg)
• Data Platform at Airbnb• Cluster Evolution• Incremental Data Replication - ReAir• Unified Streaming and Batch Processing - AirStream
Agenda
![Page 6: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/6.jpg)
>13B >35PB 1400+Warehouse Size#Events Collected Machines
Hadoop + Presto + Spark
Scale of Data Infrastructure at Airbnb
5xYoY Data Growth
![Page 7: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/7.jpg)
Event Logs
MySQL Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
Spark
ReAir
Airflow Scheduling
S3
Presto ClusterAirPal
SuperSetTableau
Data Platform
Yarn HDFS
Hive
Yarn
5
AirStream
![Page 8: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/8.jpg)
Event Logs
MySQL Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
Spark
ReAir
Airflow Scheduling
S3
Presto ClusterAirPal
SuperSetTableau
Data Platform
Yarn HDFS
Hive
Yarn
6
AirStream
![Page 9: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/9.jpg)
• Data Platform at Airbnb• Cluster Evolution• Incremental Data Replication - ReAir• Unified Streaming and Batch Processing - AirStream
Agenda
![Page 10: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/10.jpg)
Setup• Single HDFS, MR and Hive installation• c3.8xlarge (32 cores / 60G mem / 640GB disk)
+ 3TB of EBS volume• 800 nodes• Tested DN on different AZ’s• All data managed by Hive
Original Cluster
Challenges• Limited isolation between production / adhoc• Adhoc
- Difficult to meet SLA’s- Harder for capacity plan
• Disaster recovery• Difficult roll outs
![Page 11: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/11.jpg)
• Two independent HDFS, MR, Hive metastores• d2.8xlarge w/ 48TB local• ~250 instances in final setup• Replication of common / critical data - Silver is super
of Gold• For disaster recovery, separate AZ’s
Two Clusters
Gold Cluster
HDFS
Hive
Silver Cluster
Replication
Yarn HDFS
Hive
Yarn
![Page 12: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/12.jpg)
Advantages• Failure isolation with user jobs• Easy capacity planning• Guarantee SLA’s• Able to test new versions• Disaster Recovery
Multi-Cluster Trade-Offs
Disadvantages• Data synchronization• User confusion• Operational overhead
![Page 13: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/13.jpg)
Advantages• Failure isolation with user jobs• Easy capacity planning• Guarantee SLA’s• Able to test new versions• Disaster Recovery
Multi-Cluster Trade-Offs
Disadvantages• Data synchronization• User confusion• Operational overhead
![Page 14: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/14.jpg)
• Data Platform at Airbnb• Cluster Evolution• Incremental Data Replication - ReAir• Unified Streaming and Batch Processing - AirStream
Agenda
![Page 15: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/15.jpg)
Batch• Scan HDFS, metastore• Copy relevant entries• Simple, no state• High latency
Warehouse Replication Approaches
Incremental• Record changes in source• Copy/re-run operations on destination• More complex, more state• Low latency (seconds)
![Page 16: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/16.jpg)
• Record Changes on Source• Convert Changes to Replication Primitives• Run Primitives on the Destination
IncrementalReplication
14
![Page 17: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/17.jpg)
• Hive provides hooks API to fire at specific points- Pre-execute- Post-execute- Failure
• Use post-execute to log objects that are created into an audit log
• In critical path for queries
Record ChangesOn Source
15
![Page 18: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/18.jpg)
Example Audit Log Entry
16
![Page 19: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/19.jpg)
• 3 types of objects - DB, table, partition• 3 types of operations - Copy, rename, drop• 9 different primitive operations• Idempotent
Convert Changes to Primitive Operations
17
![Page 20: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/20.jpg)
CREATE TABLE srcpart (key STRING) PARTITIONED BY (ds STRING) • Copy Table
INSERT OVERWRITE TABLE srcpart PARTITION(ds=‘1’) SELECT key FROM src • Copy Partition
ALTER TABLE srcpart SET FILEFORMAT TEXTFILE • Copy Table
ALTER TABLE srcpart RENAME to srcpart_old • Rename table
Primitive Example
18
![Page 21: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/21.jpg)
Copy Table Flow
source exists?
dest exists and the same?
copy to temp
locationverify the
copy tmp -> dest add metadata
done
Y
NY
N
![Page 22: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/22.jpg)
• Data Platform at Airbnb• Cluster Evolution• Incremental Data Replication - ReAir• Unified Streaming and Batch Processing - AirStream
Agenda
![Page 23: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/23.jpg)
Batch Infrastructure
21
Event Logs
MySQL Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
SparkReAir
Airflow Scheduling
S3
Presto ClusterAirPal
SuperSetTableau
Yarn HDFS
Hive
Yarn
![Page 24: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/24.jpg)
AirStream
22
Source Process Sink
![Page 25: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/25.jpg)
Streaming at Airbnb - AirStream
23
Cluster
Spark Streaming
Airflow Scheduling
HBase
HDFS
Sources
Kafka
S3
HDFS
…
Sinks
Datadog
Kafka
DynamoDB
ElasticSearch
…
![Page 26: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/26.jpg)
Lambda Architecture
![Page 27: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/27.jpg)
Batch
AirStream
Hive
Spark SQL
Lambda Architecture
25
Streaming
Kafka
Spark Streaming
State Storage
![Page 28: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/28.jpg)
Sources
26
Streaming
source: [ { name: source_example, type: kafka, config: { topic: "example_topic", } }]
Batchsource: [ { name: source_example, type: hive, sql: { select * from db.table where ds=‘2017-06-05’; } }]
![Page 29: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/29.jpg)
Computation
27
Streaming/Batch
process: [{ name = process_example, type = sql, sql = """ SELECT listing_id, checkin_date, context.source as source FROM source_example WHERE user_id IS NOT NULL """ }]
![Page 30: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/30.jpg)
Sinks
28
Streaming
sink: [ { name = sink_example input = process_example type = hbase_update hbase_table_name = test_table bulk_upload = false }]
Batch
sink: [ { name = sink_example input = process_example type = hbase_update hbase_table_name = test_table bulk_upload = true }]
![Page 31: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/31.jpg)
StreamingComputation Flow
29
Source
Process_A Process_B
Process_A1
Sink_A2 Sink_B2
BatchSource
Process_A Process_B
Process_A1
Sink_A2 Sink_B2
![Page 32: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/32.jpg)
Unified API through AirStream• Declarative job configuration
• Streaming source vs static source
• Computation operator or sink can be shared by streaming and batch job.
• Computation flow is shared by streaming and batch
• Single driver executes in both streaming and batch mode job
30
![Page 33: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/33.jpg)
Shared State Storage
![Page 34: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/34.jpg)
AirStream
Shared Global State Store
32
HBase Tables
Spark StreamingSpark StreamingSpark StreamingSpark StreamingSpark BatchSpark BatchSpark BatchSpark Batch
![Page 35: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/35.jpg)
• Well integrated with Hadoop eco system
• Efficient API for streaming writes and bulk uploads
• Rich API for sequential scan and point-lookups
• Merged view based on version 33
Why HBase
![Page 36: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/36.jpg)
Unified Write API
34
DataFrame
HBase
Region 1
Region 2
Region N
Re-partition<Region 1, [RowKey,
Value]>
<Region 2, [RowKey, Value]>
<Region N, [RowKey, Value]>
… …
Puts
HFile BulkLoad
![Page 37: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/37.jpg)
Rich Read API
35
HBase Tables
Spark Streaming/Batch Jobs
Multi-Gets Prefix Scan Time Range Scan
![Page 38: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/38.jpg)
Merged Views
36
Row Key
R1 V200 TS200
R1 V150 TS150
R1 V01 TS01
… … … …
Time
Streaming Writes
Streaming Writes
Streaming Writes
![Page 39: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/39.jpg)
Merged Views
37
Row Key
R1 V200 TS200
R1 V150 TS150
R1 V01 TS01
Time
Streaming Writes
Streaming Writes
Streaming Writes
R1 V100 TS100Batch Bulk Upload
![Page 40: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/40.jpg)
Our Foundations
• Unify streaming and batch process
• Shared global state store
38
![Page 41: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/41.jpg)
MySQL DB Snapshot Using Binlog Replay
![Page 42: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/42.jpg)
• Large amount of data: Multiple large mysql DBs• Realtime-ness: minutes delay/ hours delay• Transaction : Need to keep transaction across different tables• Schema change: Table schema evolves
Database Snapshot
40
Move Elephant
![Page 43: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/43.jpg)
41
Binlog Replay on Spark
20+ hr 4+ hr
AirStream Job
5 mins
15
1 hr
spinal tap
seed
![Page 44: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/44.jpg)
• Streaming and Batch shares Logic: Binlog file reader, DDL processor, transaction processor, DML processor.
• Merged by binlog position: <filenum, offset>
• Idempotent: Log can be replayed multiple times.
• Schema changes: Full schema change history.
42
Log Parser
Transaction Processor
Change Processor
Schema Processor HB
ASE
Lambda ArchitectureBinlog(realtime/history)
DML DD
L
XVID
Mysql Instance
![Page 45: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/45.jpg)
Streaming Ingestion & Realtime Interactive Query
![Page 46: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/46.jpg)
Realtime Ingestion and Interactive Query
44
HBase
AirStream
Spark Streaming
Kafka
Query Engine
DataPortal
Spark SQL
Hive SQL
Presto SQL
![Page 47: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/47.jpg)
Interactive Query in SqlLab
45
![Page 48: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/48.jpg)
Thanks
![Page 49: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/49.jpg)
Realtime OLAP with Druid
![Page 50: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/50.jpg)
Realtime Ingestion for Druid
48
Druid
AirStream
Spark Streaming
KafkaDimension
Metrics
Druid Beam
![Page 51: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/51.jpg)
Superset Powered by Druid
49
![Page 52: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/52.jpg)
Realtime Indexing
![Page 53: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/53.jpg)
Hive
Realtime Indexing
51
Elastic Search
es_version=
mutation id
AirStream
Spark Streaming
Spark Batch
Table A
Event Event Event… …
Kafka
Table BTable C
![Page 54: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/54.jpg)
Backup Slides
![Page 55: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/55.jpg)
Tips
![Page 56: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/56.jpg)
Moving Window Computation
![Page 57: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/57.jpg)
Long Window Computation
55
What if window is weeks, months, or even years?
![Page 58: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/58.jpg)
Distinct in a Large Window
56
I don’t want approximation. What
should I do?
![Page 59: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/59.jpg)
Distinct Count
57
Row Key
Listing 1 Visitor 01 TS100
Listing 1 Visitor 02 TS100
Listing 1 Visitor 04 TS98
Listing 1 Visitor 03 TS99
Prefix Scan with TimeRange
Prefix Scan with TimeRange
Time
![Page 60: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/60.jpg)
Moving Average
58
Row KeyListing 1 Total Review
Cnt: 100 TS100
Listing 1 Total Review Cnt: 98 TS99
Listing 1 Total Review Cnt: 01 TS01
Listing 1 Total Review Cnt: 50 TS50
Count Difference/Time Elapsed
Count Difference/Time Elapsed
Time
… … …
… … …
Window 1
Window 2
![Page 61: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/61.jpg)
Schema EnforcementStreaming Events
![Page 62: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/62.jpg)
Thrift -> DataFrame
60
ThriftEvent
https://github.com/airbnb/airbnb-spark-thrift
Thrift Class
Thrift Object
FieldMetaData
Struct Type
FieldValue
Row
DataFrame
![Page 63: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/63.jpg)
Summary
![Page 64: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/64.jpg)
Unify Batch and Streaming Computation
62
![Page 65: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/65.jpg)
Global State Store Using HBase
63
![Page 66: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/66.jpg)
• Serial execution- Easy to reason about operations- Very slow
• Parallel execution- Fast and scalable- Ordering is important: e.g. create table before copying a
partition- DAG of primitive operations
Run Primitiveson Destination
64
![Page 67: HONGBO ZENG Airbnb Ä c+ - pic. · PDF file•Data Platform at Airbnb • Cluster Evolution • Incremental Data Replication - ReAir • Unified Streaming and Batch Processing - AirStream](https://reader031.vdocuments.us/reader031/viewer/2022030500/5aacf3c97f8b9a2e088d9eaa/html5/thumbnails/67.jpg)
� ����
� ����
� ����� ����
� ����
� ����