scaling the yelp’s logging pipeline [email protected] ......with apache kafka enrico canzonieri...

52
Scaling the Yelp’s logging pipeline with Apache Kafka Enrico Canzonieri [email protected] @EnricoC89

Upload: others

Post on 25-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Scaling the Yelp’s logging pipeline with Apache Kafka

Enrico [email protected] @EnricoC89

Page 2: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Yelp’s MissionConnecting people with great

local businesses.

Page 3: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Yelp StatsAs of Q1 2016

90M 3270%102M

Page 4: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

What to expect

● High-level architecture overview● Kafka best-practices● No code

Page 5: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Logging? What is that?

Page 6: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

A log is a stream of events

Page 7: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Logs provide valuable operational and business

visibility.

Page 8: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Producing and consuming logs should be easy

Page 9: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Logs at scale

Page 10: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Consuming logs becomes hard

Solution: Centralized logging!

Page 11: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Simple aggregation strategy: upload all logs to a centralized datastore

Pros:● Easy to implement (cron,

logrotate, etc)● Unique place to access logsCons:● No real time streaming● Colocation not aggregation

Amazon S3

Page 12: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Log aggregation systems

Page 13: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

ScribePush based architecture

Page 14: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Yelp’s logging pipeline V1.0

Scribe Leaf

Web app

Scribe aggregator xinetd

Yelp Server

Log aggregator

S3 uploader

tail -follow=name /nail/scribe/<category>/current

consumer consumerScribe Leaf

Web app

Yelp Server

Amazon S3

Page 15: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Why use Scribe?

● Simple to configure● Multilang support (via Thrift)● Stable (most of the times) and fast● Support arbitrary message size

Page 16: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

● No support for log consumption● Doesn’t really scale● No replication● Abandoned project● Lack of plugins

Why move away from Scribe?

Page 17: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Stream processing: Statmonster

Page 18: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Stream processing: Statmonster

● Statmonster is a real-time metrics pipeline● The first stage extract multiple metrics from

each line● Consumes most of the high volume logs

Page 19: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

● In 2013 the metrics from high volume logs started falling behind

● The first stage of the pipeline was too slow to keep up with the increased volume of logs

Stream processing: Statmonster

Page 20: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

● Short term solution: sample logs to save timeliness

● Long term solution: run Statmonster on Apache Storm.

Stream processing: Statmonster

Page 21: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

● Short term solution: sample logs to save timeliness

● Long term solution: run Statmonster on Apache Storm.

Stream processing: Statmonster

Page 22: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

The real bottleneckA

ggre

gato

rs

Statmonster

Statmonster

Statmonster

tail -f /nail/scribe/access

tail -f /nail/scribe/access

tail -f /nail/scribe/access

Can’t scale the consumer without increasing the number of physical log aggregators

Page 23: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Need to scale the source!

Page 24: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Kafka into the mix

● High-throughput Publish-subscribe messaging system

● Distributed, replicated commit log● Client library available for a variety of

languages ● Configurable retention

Page 25: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Kafka Message (topic, partition, offset)

Message

0 1 2 3 4 5 6 7 8 9 10

11

12

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 9 10

11

12

Partition0

Partition1

Partition2

Old New

Topic

Page 26: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Kafka Producer

● Java producer is async by default● Messages are produced in batch● Support either at-least-once or at-

most-once semantic● Support compression out of the

box

Page 27: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Kafka Consumer

● Identified by a group id● Consumers coordinate to consume

from different partitions● Ability to re-play messages● Native support for offset commit

Page 28: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Architecture overview

Page 29: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Yelp’s logging pipeline V1.5

Scribe Leaf

Scribe aggregator

Kafka consumer

Kafka consumerLegacy consumer

Legacy consumer

Amazon S3

Scribe Leaf Sekretar

Page 30: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Sekretar

● Act as a Scribe aggregator speaking the Scribe protocol

● Produce messages to Kafka● Map a log category into a topic● Act as an intermediate buffer● Handle big messages

Page 31: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Statmonster 2.0

StatmonsterKafka consumer

access partition 0

StatmonsterKafka consumer

StatmonsterKafka consumer

access partition 1

access partition N

Page 32: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Yelp’s logging pipeline V2.0

Scribe Leaf

Legacy consumer Kafka

consumerScribe Kafka Reader

Legacy consumer

Amazon S3

Scribe Leaf

Sekretar

Page 33: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Multi-regional architectureData center 1 Data center 2

Leaf

Aggregate logs

consumer

Leaf

Amazon S3

Kafka consumer

Kafka consumer

Sekretar Sekretar

Page 34: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Configuration

Page 35: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Availability vs Consistency

Cluster side:● Unclean leader election● Replication factor● Minimum in-sync replicas (ISR)Producer side:● acks

Disabledunclean.leader.election.enable

3: 1 leader + 2 followersdefault.replication.factor

Quorum: 3/2 + 1 = 2 min.insync.replicas

All (or -1): all replicas in the ISR need to ackacks

Page 36: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Provisioning partitionsHow many partitions for a topic?

Producer

Consumer

Partition 1

Partition 2

ProducerProducer

Partition 3

Partition 1

Partition 2

Producer

Topic 1

Topic 2

Group 1

Consumer Consumer Consumer

Group 2

Consumer

Page 37: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Criteria 1: consumer

Consumer speed Number of different consumer groups

Page 38: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Criteria 2: Kafka

Physical Limits of a single broker egress/ingressLog retentionRecovery speed upon broker failures

Page 39: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Criteria 3: producer

Ingress rate into Kafka in msg/s and bytes/s

Page 40: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Accounting for spikesMessage rate (msg/s) percentiles: 50, 75, 90, 95, 99, 99.5

Page 41: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Partitions auto-creation

Regular traffic increase

Automatically increase the number of partitions based on 99th percentile of bytes/s and msg/s over a time window of 3 days.

Use a combined threshold of 700 msg/s and 500 KiB/s.

Page 42: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Tooling

Page 43: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

https://github.com/yahoo/kafka-managerKafka Manager

Page 44: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Kafka UtilsContains several command line tools to help operating and maintaining a Kafka cluster.https://github.com/Yelp/kafka-utils

● Cluster rolling restart● Consumer offset management● Cluster rebalance and broker decommission● Healthchecks

Page 45: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Monitoring

Page 46: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Monitoring Sekretar● Ingress: msg/s, bytes/s● Egress: msg/s, (compressed) bytes/s● Message delay● Producer latency ● Producer memory buffer● Timed-out message count

Page 47: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Monitoring Sekretar

Unhealthy Kafka cluster

Page 48: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Kafka monitoringKafka exports metrics via jmx (http://docs.confluent.io/1.0/kafka/monitoring.html):● Offline partitions● Under replicated partitions● Controller count

Kafka-Utils includes a check for min.isr

Page 49: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Consumer monitoring

● Consumer speed (msg/s)● Kafka speed● Consumer lag (msg)

Page 50: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Consumer monitoringConsumer slow down or logs volume increase

Page 51: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

Apache Kafka @ Yelp

18 clusters~ 100 brokers> 20K topics~ 45K partitions> 5TB of data per day~ 25 Billion messages per day

Page 52: Scaling the Yelp’s logging pipeline enrico@yelp.com ......with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp’s Mission Connecting people with great local businesses

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp