kafka quotas talk at linkedin

33
©2015 LinkedIn Corporation. All Rights Reserved. Aditya Auradkar & Dong Lin

Upload: aditya-auradkar

Post on 28-Jan-2018

1.797 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Aditya Auradkar & Dong Lin

Page 2: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Motivation: Why is this important?

● Shared resources in a multi-tenant environment

● Bad clients can hurt others– Bootstrapping consumers

– Buggy clients

● Better QOS for well-behaved clients

● Preserve throughout and latency for everyone else

● API Limits/Billing

Page 3: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Clients and Client-Ids

● Quotas are enforced per client-id

● Why client-id?

● No quotas per topic

● No quotas per topic * client-id combination

● Blanket produce and fetch quota for all clients

Page 4: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Quota Overrides

● Certain clients justify higher quotas

● Rolling bounces take too long and require too much effort

● Store overrides in ZooKeeper

● Brokers parse config change notifications

● Apply new quota immediately

Page 5: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Quota Overrides

{ "version":1,

"config": {

"producer_byte_rate":"1048576",

"consumer_byte_rate":"1048576”

}

}

Page 6: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Broker Metrics

● Metrics created for each client

● Clients can come and go

● Don’t need to retain client metrics forever

● GC metrics if inactive for longer than 1 hr

● Recreate if client reconnects

Page 7: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Enforcement

● Reduce client throughput to desired rate

● Compute delay based on current throughput

● Small violations result in small delays

● Use smaller measurement windows to avoid long pauses

● Client side metrics available to detect throttling

Page 8: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Delay Calculation

● Delay = W * (μ - Q) / μ

● W = window size, μ = observed rate, Q = desired rate

Page 9: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

replica

manager log

quota

manager

Enforcement

producer

r

e

q

u

e

s

t

c

h

a

n

n

e

l

1. request

7. response

3. append

4. record metric

5. delay

delay queue6. dequeue

delay queue

2. process

Page 10: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

replica

manager log

quota

manager

Enforcement

r

e

q

u

e

s

t

c

h

a

n

n

e

l

1. request

7. Response

(zero copy)

3. fetch offsets

4. record metric

delay queue6. dequeue

delay queue

2. process

5. delay

consumer

Page 11: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Slowdown vs Error

● Error handling is hard

● Tricky to implement backoff and retries

● All client implementations need to handle quota errors

● Need something easier

Page 12: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Getting Started

● Important Broker configs– quota.producer.default (in bytes/sec)– quota.consumer.default (in bytes/sec)

● Apply overrides./bin/kafka-configs.sh --alter

--add-config 'producer_byte_rate=1048576,consumer_byte_rate=1048576’--entity-type clients--entity-name TestTopic--zookeeper localhost:2181

● Read overrides./bin/kafka-configs.sh --describe

--entity-type clients--entity-name TestTopic--zookeeper localhost:2181

Page 13: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Monitoring

● Producer metrics– throttle-time avg and max

● Consumer metrics – throttle-time avg and max

● Broker metrics – byte-rate and avg throttle-time per client-id

– byte-rate is used for enforcement

● ZookeeperConsumerConnector and SimpleConsumer metrics also

available

Page 14: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Rollout Strategy

● Deploy without enforcement

● Monitor metrics to track throughput for all clients

● Identify candidates for overrides

● Start with high thresholds

Page 15: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation

● Validate quota functionality- broker-throughput <= sum(quota_of_clientid)

- sum(client-throughput) <= quota_of_clientId

● Evaluate performance improvement for clients.- Throughput and latency

- Clients with different throughput demand

Page 16: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Validate Quota Functionality

● Unlimited quota

producer

consumer

Page 17: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Validate Quota Functionality

● quota.producer.default = quota.consumer.default = 50 MBps

producer

consumer

Page 18: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Validate Quota Functionality

● quota.producer.default = quota.consumer.default = 10 MBps

producer

consumer

Page 19: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Client Performance Improvement

small client

running alone

Page 20: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Client Performance Improvement

small client

running alone

clients join together

Page 21: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Client Performance Improvement

small client

running alone

clients join together

clients join

in presence of quota

Page 22: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Client Performance Improvement

small client

running alone

clients join together

clients join

in presence of quota

comparison

Page 23: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Producer Performance Improvement

0 100 200 300 400 500 600Time (sec)

0

5

10

15

20

25

30

35

Late

ncy (

ms)

alone

together

quota

Latency (ms)

Page 24: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Producer Performance Improvement

● Producer runs at 2 MBps alone (alone)

0 100 200 300 400 500 600Time (sec)

0

5

10

15

20

25

30

35

Late

ncy (

ms)

alone

together

quota

Alone

Latency (ms) 1.5

Page 25: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Producer Performance Improvement

● Producer runs at 2 MBps alone (alone)

● Producer runs with other producers without quota (together)

0 100 200 300 400 500 600Time (sec)

0

5

10

15

20

25

30

35

Late

ncy (

ms)

alone

together

quota

Alone Together

Latency (ms) 1.5 23.6

Page 26: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Producer Performance Improvement

● Producer runs at 2 MBps alone (alone)

● Producer runs with other producers without quota (together)

● Producer runs with other producers with 10 MBps quota (quota)

0 100 200 300 400 500 600Time (sec)

0

5

10

15

20

25

30

35

Late

ncy (

ms)

alone

together

quota

Alone Together Quota

Latency (ms) 1.5 23.6 2.5

Page 27: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Consumer Performance Improvement

0 100 200 300 400 500 600Time (sec)

20

30

40

50

60

70

80

90

Thro

ugp

ut (M

Bps)

alone

together

quota

alone_quota

Throughput

(MBps)

Page 28: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Consumer Performance Improvement

● Consumer runs alone (alone)

● Consumer runs alone with 50 MBps quota (alone-quota)

0 100 200 300 400 500 600Time (sec)

20

30

40

50

60

70

80

90

Thro

ugp

ut (M

Bps)

alone

together

quota

alone_quota

alonealone-

quota

Throughput

(MBps)87 45

Page 29: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Consumer Performance Improvement

● Consumer runs alone (alone)

● Consumer runs alone with 50 MBps quota (alone-quota)

● Consumer runs with other consumers without quota (together)

0 100 200 300 400 500 600Time (sec)

20

30

40

50

60

70

80

90

Thro

ugp

ut (M

Bps)

alone

together

quota

alone_quota

alonealone-

quotatogether

Throughput

(MBps)87 45 31

Page 30: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation – Consumer Performance Improvement

● Consumer runs alone (alone)

● Consumer runs alone with 50 MBps quota (alone-quota)

● Consumer runs with other consumers without quota (together)

● Consumer runs with other consumers with 50 MBps quota (quota)

0 100 200 300 400 500 600Time (sec)

20

30

40

50

60

70

80

90

Thro

ugp

ut (M

Bps)

alone

together

quota

alone_quota

alonealone-

quotatogether quota

Throughput

(MBps)87 45 31 40

Page 31: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Evaluation - Summary

● Quota functionality is enforced

● Performance improvement for clients from quota in the event that large

clients join

Page 32: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Future Work

● Throttle replica traffic (e.g. during bootstrap)

● Throttle more request types (OffsetCommitRequest etc.)

● Client-id authentication for use in multi-tenancy environment

Page 33: Kafka Quotas Talk at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.

Acknowledgements

● LinkedIn Kafka Engineering team

● Confluent Inc

● John McClean (formerly at LI)