siphon - near real time databus using kafka, eric boyd, nitin kumar

27
Thursday, April 14, 2016 Siphon – Near Real Time Databus Using Kafka Eric Boyd – CVP Engineering – Microsoft Nitin Kumar – Principal Eng Manager - Microsoft

Upload: confluent

Post on 16-Apr-2017

1.869 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Thursday, April 14, 2016

Siphon – Near Real Time Databus Using KafkaEric Boyd – CVP Engineering – Microsoft

Nitin Kumar – Principal Eng Manager - Microsoft

Page 2: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Page 3: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Linux is a

cancer

Page 4: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Thursday, April 14, 2016

Page 5: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Page 6: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Ads Oslo Schedule

Page 7: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Ads Oslo Feature List

Page 8: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Page 9: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Bing Ads Execution

• Shipped once every 6 months

• Averaged 3 marketplace experiments per month

• Big bets on marketplace features that didn’t work.

• Focused teams on 6 tracks with

independent metrics.

• Pushed teams to ship as quickly as they

could, focusing only on moving their

metric.

• Built/borrowed infrastructure to enable

much more rapid experimentation.

• Over 3 years got to a rate of >1000

experiments a month

Page 10: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Profitability!!

Eric joinsMSFT

Page 11: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

What drove the turnaround?

• Focus on small teams with clear metrics each team was driving.

• Pushing each team to experiment and iterate as fast as possible. Data alone determines what gets shipped.

• Iterated on key metrics until we found the ones with the most impact.

• Commitment that we would get 1.5-2% better each month, and ship a package of experimentally tested improvements each month.

Page 12: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Relationship with Open Source

• From “Linux is a cancer…”

• To contributing to open source • Storm with C# - SCP.NET (http://www.nuget.org/packages/Microsoft.SCP.Net.SDK/)

• Spark with C# - Mobius (https://github.com/Microsoft/Mobius)

• Kafka with C# - C# Client for Kafka (https://github.com/Microsoft/Kafkanet)

• BOND (https://github.com/Microsoft/bond)

• Across MSFT• C#• VSCode• Hyper-V drivers for Linux• https://github.com/Microsoft/ with 18 pages of repositories!

Page 13: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Microsoft Big Data History

• Massive batch oriented systems• Hundreds of thousands of machines• Exabytes of storage• SQL-like language with C# extensions

Page 14: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Moving to streaming

Page 15: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Data Bus

Devices Services

Streaming Processing

BatchProcessing

Applications

Scalable pub/sub for NRT data streams

Interactive analytics

Page 16: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Vision

• A Databus for all Near Real Time (NRT) data in an organization.

• Quick and Easy Publication, Discovery and Subscription of NRT dataset.

• Compatibility with various Stream Processing systems like Storm, Spark, Splunk.

Page 17: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Siphon Adoption

15 months since launch

Excel Word Outlook

Windows 10

Page 18: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Usage

Bing Ads Campaign perf

Bing Live site telemetryCortana

Office 365

0

10

20

30

40

50

60

70

80

Thro

ugh

pu

t (i

n G

Bp

s)

Siphon Data Volume (Ingress and Egress)

Volume published (GBps) Volume subscribed (GBps) Total Volume (GBps)

0

2

4

6

8

10

12

14

16

18

Thro

ugh

pu

t (e

ven

ts p

er s

ec)

Mill

ion

s

Siphon Events per second (Ingress and Egress)

EPS In Eps Out Total EPS

1.3 millionEVENTS PER SECOND INGRESS AT PEAK

~1 trillionEVENTS PER DAY PROCESSED AT PEAK

3.5 petabytesPROCESSED PER DAY

100 thousandUNIQUE DEVICES AND MACHINES

1,300PRODUCTION KAFKA BROKERS

Page 19: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Scale: Kafka at Microsoft (Ads, Bing, Office)

Kafka Brokers 1300+ across 5 Datacenters

Operating System Windows Server 2012 R2

Hardware Spec 12 Cores, 32 GB RAM, 4x2 TB HDD (JBOD), 10 GB Network

Incoming Events 1.3 million per sec, (112 Billion per day, 500 TB per day)

Outgoing Events 5 million per sec, (~1 Trillion per day, 3.5 PB per day)

Kafka Topics/Partitions 50+/5000+

Kafka version 0.8.1.1 (3 way replication)

Page 20: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Siphon Architecture

Asia DC

Zookeeper Canary

Kafka

Collector

Agent

Services Data Pull (Agent)

Services Data Push

Device Proxy Services

Consumer API (Push/

Pull)

Europe DC

Zookeeper Canary

Kafka

US DC

Zookeeper Canary

Kafka

Streaming

Batch

Audit Trail

Open Source

Microsoft Internal

Siphon

Page 21: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Multiple sources and schemas

Siphon Bond

Schema

Part

A Main Header

MessageId

AuditId

TimeStamp

Part

B Extended HeaderKey-Value[]

Part

C Payload

CSV

XML

JSONJSON

XML

CSVSiphon Bond

Schema

Bond (https://github.com/Microsoft/bond) Cross platform framework for working with schematized data. Cross language (de) serialization. Similar to Protobuf, Thrift and AVRO.

Page 22: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Collector – Data Ingestion (Producer)

• Http(s) Server • Restful API with SSL support.• Abstraction from Kafka

internals (Partition, Kafka version)• Throttling, QPS Monitoring• PII scrubbing• Load balancing/failover to multiple DCs• Supported for both Windows and Linux

servers.

Device Proxy Services

Collector

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

Collector

Collector

Load

Bal

ance

r

Services Data Push

Agent

Services Data Pull (Agent)

Open Source

Microsoft Internal

Siphon

URL : http://localhost/produce/<version>?topic=<toipic>Method : POST

Page 23: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Pull & Push Consumers

Virtual Network A

HLC

Pull

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

P1Collector

Collector

RES

T A

PI

Virtual Network B

Pull• RESTful API with SSL support• Works for out of network consumers• Supports metadata and data operation• Implement Simple consumer APIs• Spark streaming receiver for Kafka REST

Push• Configurable push to destinations like HDFS,

Cosmos, Kafka.• Utilizes KafkaNet - .NET High Level Consumer

(https://github.com/Microsoft/Kafkanet)

Page 24: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

High Level Consumer

Monitoring using Canary

Device Proxy Services

Collector

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

Collector

Collector

Load

Bal

ance

rServices Data Push

Agent

Services Data Pull (Agent)

Synthetic message

Audit Trail

Canary - https://github.com/Microsoft/Availability-Monitor-for-Kafka

Page 25: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

High Level Consumer

Device Proxy Services

Collector

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

Collector

Collector

Load

Bal

ance

rServices Data Push

Agent

Services Data Pull (Agent)

Audit Trail

Sampled vs Full Auditing support

Data completeness – Audit Trail

Page 26: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Production Experience – Telemetry Charts

• Monitoring using ELK• E2E Latency

• Data Completeness

• Processing Lag

• EPS breakdown by data center.

Page 27: Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Key Takeaways

• Scale out with Kafka (50K -> 1M -> multi-million Events Per sec)

• Ability to build tunable Auditing/Monitoring

• Producer/Consumer Restful API provides a nice abstraction

• Config driven Pub/Sub system