Download - Meetup realtime datacollection
![Page 1: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/1.jpg)
1
CONDUIT: REAL TIME DATA COLLECTION AT
SCALESharad Agarwal, Amareshwari & Inder Singh
![Page 2: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/2.jpg)
2
Agenda
• About INMOBI
• Data Collection Challenges
• Goals • Design/tech stack deep dive
• Publisher/Consumer eco-system
![Page 3: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/3.jpg)
3
As you grow to multi-region..
Web ApplicationWeb
Applications
Other ApplicationsOther
Applications Consumer ApplicationsConsumer
Applications
High volume complex data flows across WAN
WAN
![Page 4: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/4.jpg)
4
Data Collection Challenges
• Adhoc log aggregation• Duplicate data transfer• Tightly coupled – point to point• No Reliability Guarantees• Network glitches lead to huge backlog• High peak bandwidth requirement – Transfers in
Bursts• No support or different data paths for real-time
usecase
![Page 5: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/5.jpg)
5
Goals
• collect event data from distributed sub-systems in reliable, efficient, scalable and uniform way for batch as well as near real-time consumption.
• Decouple data consumers from producers• Savings on Network Bandwidth
• Reduce peak network requirements due to bulk data transfers in spurts.
• Minimize Duplicate data transfers across WAN
![Page 6: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/6.jpg)
6
A_dc1 B_dc3B_dc1
A
DC1 Consumers DC2 Consumers DC3 Consumers
B A
DC1 Producers DC2 Producers DC3 ProducersB_dc2
Control Flow
Data Flow
![Page 7: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/7.jpg)
7
How to Achieve?
• Use data collection/relay technology– Kafka
• Pull semantics. Data stored on Kafka brokers.
– Scribe• Push semantics. Data typically written to HDFS.
– Flume• Push semantics similar to Scribe. Being rewritten as Flume
NG• Promising but in nascent stage
http://sharadag.tumblr.com/post/13549427326/durable-event-data-transport-at-scale
![Page 8: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/8.jpg)
8
WAN
Primary Secondary
Worker
ZK Cluster
CollectorsVIP
DISTCP
ConsumersStreaming Consumer
Batch Consumer
Producer
|
|
Producer
Powered by Scribe
For HA
![Page 9: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/9.jpg)
9
Producer
• Producers publish messages using Publisher API- Transparent to the underlying publishing technology (scribe, flume, etc)
![Page 10: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/10.jpg)
10
Consumer
• Batch- Data is published in HDFS cluster in min directories:
../streams/<stream>/YYYY/MM/DD/HR/MN
../streams_local/<stream>/YYYY/MM/DD/HR/MN
• Streaming- Streaming Consumer API through an iterator
interface.- Streams messages directly from HDFS - Streaming from multiple clusters in parallel- Checkpoint the stream at any time- Kafka alike static consumer groups for L.B.
![Page 11: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/11.jpg)
11
WAN
Client library
Consumer
Consumer
Producer
|
|Producer
Consumer
Producer
|
|Producer
publish
next
![Page 12: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/12.jpg)
12
Salient Features
• Data compression
• Data merging
• Mirroring
• Streaming consumer API for low latency data transfers
• E2E Audit for SLA, reliability
![Page 13: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/13.jpg)
13
1. Conduit - https://github.com/InMobi/conduit
2. Pintail – Soon to follow
Open Source
![Page 14: Meetup realtime datacollection](https://reader035.vdocuments.us/reader035/viewer/2022081715/5463ae06af795904328b5f31/html5/thumbnails/14.jpg)
14
Thank you!