meetup realtime datacollection
DESCRIPTION
TRANSCRIPT
1
CONDUIT: REAL TIME DATA COLLECTION AT
SCALESharad Agarwal, Amareshwari & Inder Singh
2
Agenda
• About INMOBI
• Data Collection Challenges
• Goals • Design/tech stack deep dive
• Publisher/Consumer eco-system
3
As you grow to multi-region..
Web ApplicationWeb
Applications
Other ApplicationsOther
Applications Consumer ApplicationsConsumer
Applications
High volume complex data flows across WAN
WAN
4
Data Collection Challenges
• Adhoc log aggregation• Duplicate data transfer• Tightly coupled – point to point• No Reliability Guarantees• Network glitches lead to huge backlog• High peak bandwidth requirement – Transfers in
Bursts• No support or different data paths for real-time
usecase
5
Goals
• collect event data from distributed sub-systems in reliable, efficient, scalable and uniform way for batch as well as near real-time consumption.
• Decouple data consumers from producers• Savings on Network Bandwidth
• Reduce peak network requirements due to bulk data transfers in spurts.
• Minimize Duplicate data transfers across WAN
6
A_dc1 B_dc3B_dc1
A
DC1 Consumers DC2 Consumers DC3 Consumers
B A
DC1 Producers DC2 Producers DC3 ProducersB_dc2
Control Flow
Data Flow
7
How to Achieve?
• Use data collection/relay technology– Kafka
• Pull semantics. Data stored on Kafka brokers.
– Scribe• Push semantics. Data typically written to HDFS.
– Flume• Push semantics similar to Scribe. Being rewritten as Flume
NG• Promising but in nascent stage
http://sharadag.tumblr.com/post/13549427326/durable-event-data-transport-at-scale
8
WAN
Primary Secondary
Worker
ZK Cluster
CollectorsVIP
DISTCP
ConsumersStreaming Consumer
Batch Consumer
Producer
|
|
Producer
Powered by Scribe
For HA
9
Producer
• Producers publish messages using Publisher API- Transparent to the underlying publishing technology (scribe, flume, etc)
10
Consumer
• Batch- Data is published in HDFS cluster in min directories:
../streams/<stream>/YYYY/MM/DD/HR/MN
../streams_local/<stream>/YYYY/MM/DD/HR/MN
• Streaming- Streaming Consumer API through an iterator
interface.- Streams messages directly from HDFS - Streaming from multiple clusters in parallel- Checkpoint the stream at any time- Kafka alike static consumer groups for L.B.
11
WAN
Client library
Consumer
Consumer
Producer
|
|Producer
Consumer
Producer
|
|Producer
publish
next
12
Salient Features
• Data compression
• Data merging
• Mirroring
• Streaming consumer API for low latency data transfers
• E2E Audit for SLA, reliability
13
1. Conduit - https://github.com/InMobi/conduit
2. Pintail – Soon to follow
Open Source
14
Thank you!