troubleshooting kafka's socket server: from incident to resolution

68
Troubleshooting Kafka’s socket server from incident to resolution Joel Koshy LinkedIn

Upload: joel-koshy

Post on 14-Apr-2017

3.943 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Troubleshooting Kafka's socket server: from incident to resolution

Troubleshooting Kafka’s socket serverfrom incident to resolution

Joel KoshyLinkedIn

Page 2: Troubleshooting Kafka's socket server: from incident to resolution

The incidentOccurred a few days after upgrading to pick up quotas and SSL

Multi-portKAFKA-1809

KAFKA-1928

SSLKAFKA-1690x25 x38

October 13

Various quota patches

June 3April 5 August 18

Page 3: Troubleshooting Kafka's socket server: from incident to resolution

The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster

Page 4: Troubleshooting Kafka's socket server: from incident to resolution

The incident● Alerts fire; NOC engages SYSOPS/Kafka-SRE● Kafka-SRE restarts the broker● Broker failure does not generally cause prolonged application impact

○ but in this incident…

Page 5: Troubleshooting Kafka's socket server: from incident to resolution

The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster

Posts search was one such impacted application

Page 6: Troubleshooting Kafka's socket server: from incident to resolution

The incidentTwo brokers report high request and response queue sizes

Page 7: Troubleshooting Kafka's socket server: from incident to resolution

The incidentTwo brokers report high request queue size and request latencies

Page 8: Troubleshooting Kafka's socket server: from incident to resolution

The incident● Other observations

○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs

Page 9: Troubleshooting Kafka's socket server: from incident to resolution

RemediationShifted site traffic to another data center

“Kafka outage ⇒ member impact

Multi-colo is critical!

Page 10: Troubleshooting Kafka's socket server: from incident to resolution

Remediation● Controller moves did not help● Firewall the affected brokers

● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure

○ … but x25 never exhibited this issue

sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT

sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP

Page 11: Troubleshooting Kafka's socket server: from incident to resolution
Page 12: Troubleshooting Kafka's socket server: from incident to resolution

“Good dashboards/alerts

Skilled operators

Clear communication

Audit/log all operations

CRM principles apply to operations

Page 13: Troubleshooting Kafka's socket server: from incident to resolution

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade

Page 14: Troubleshooting Kafka's socket server: from incident to resolution

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade

Move leaders

Page 15: Troubleshooting Kafka's socket server: from incident to resolution

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade

Firewall

Page 16: Troubleshooting Kafka's socket server: from incident to resolution

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38

Rolling downgrade

Firewall

x25

Page 17: Troubleshooting Kafka's socket server: from incident to resolution

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38

Rolling downgrade

x25

Move leaders

Page 18: Troubleshooting Kafka's socket server: from incident to resolution

… oh and BTWbe careful when saving a lot of public-access/server logs:

● Can cause GC

[Times: user=0.39 sys=0.01, real=8.01 secs]

● Use ionice or rsync --bwlimit

lowlow

high

Page 19: Troubleshooting Kafka's socket server: from incident to resolution

The investigation

Page 20: Troubleshooting Kafka's socket server: from incident to resolution

● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce

● Upgraded the queuing cluster to x38 again○ Could not reproduce

● So nothing… �

Attempts at reproducing the issue

Page 21: Troubleshooting Kafka's socket server: from incident to resolution

Understanding queue backups…

Page 22: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Quota manager

Page 23: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Quota manager

Newconnections

Page 24: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

and then turn off read interest from that connection (for ordering)

Page 25: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Page 26: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time

Page 27: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Page 28: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Hold if quotaviolated

+ quota-time

Page 29: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Hold if quotaviolated

+ quota-time

Await processor

+ response-queue-time

Page 30: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Hold if quotaviolated

+ quota-time

Await processor

+ response-queue-time

Writeresponse

+ response-send-time

Page 31: Troubleshooting Kafka's socket server: from incident to resolution

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Hold if quotaviolated

+ quota-time

Await processor

+ response-queue-time

Writeresponse

+ response-send-time

Turn read interest back on from that connection

Page 32: Troubleshooting Kafka's socket server: from incident to resolution

Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad

Page 33: Troubleshooting Kafka's socket server: from incident to resolution

Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad

Low

Page 34: Troubleshooting Kafka's socket server: from incident to resolution

Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad

Low

Highbut “normal”(purgatory)

Page 35: Troubleshooting Kafka's socket server: from incident to resolution

Investigating high request times● First look for high local time

○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)

● High request queue/response queue times are effects, not causes

Page 36: Troubleshooting Kafka's socket server: from incident to resolution

High local times during incident (e.g., fetch)

Page 37: Troubleshooting Kafka's socket server: from incident to resolution

How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Page 38: Troubleshooting Kafka's socket server: from incident to resolution

Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Not using acks -1

Should be fast

Maybe

Should be fast

Delayed outsideAPI thread

Page 39: Troubleshooting Kafka's socket server: from incident to resolution

ISR churn?

Page 40: Troubleshooting Kafka's socket server: from incident to resolution

● Low ZooKeeper write latencies● Churn in this incident: effect of some other root cause● Long request queues can cause churn

○ ⇒ follower fetches timeout■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)

○ Outstanding fetch catches up and ISR expands

ISR churn? … unlikely

Page 41: Troubleshooting Kafka's socket server: from incident to resolution

High local times during incident (e.g., fetch)

Besides, fetch-consumer (not just follower) has high local time

Page 42: Troubleshooting Kafka's socket server: from incident to resolution

Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Not using acks -1

Should be fast

Should be fast

Delayed outsideAPI thread

Test this…

Page 43: Troubleshooting Kafka's socket server: from incident to resolution

Maintains byte-rate metrics on a per-client-id basis

2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS

Quota metrics

??!

Page 44: Troubleshooting Kafka's socket server: from incident to resolution

Quota metrics - a quick benchmark

for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) }}

Page 45: Troubleshooting Kafka's socket server: from incident to resolution

Quota metrics - a quick benchmark

Page 46: Troubleshooting Kafka's socket server: from incident to resolution

Quota metrics - a quick benchmark

Fixed in KAFKA-2664

Page 47: Troubleshooting Kafka's socket server: from incident to resolution

meanwhile in our queuing cluster…due to climbing client-id counts

Page 48: Troubleshooting Kafka's socket server: from incident to resolution

Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts

○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident

■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time

Page 49: Troubleshooting Kafka's socket server: from incident to resolution

MACRO

● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate

MICRO (live debugging)

● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.

Troubleshooting: macro vs micro

Page 50: Troubleshooting Kafka's socket server: from incident to resolution

MACRO

● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate

MICRO (live debugging)

● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.

Troubleshooting: macro vs micro

Generally more effective Sometimes warranted,but invasive and tedious

Page 51: Troubleshooting Kafka's socket server: from incident to resolution

How to fix high local times● Optimize the request’s handling. For e.g.,:

○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356

● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911

● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:

○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)

○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request

Page 52: Troubleshooting Kafka's socket server: from incident to resolution

as for rogue clients…2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS

“Get apps to use wrapper libraries that

implement good client behavior,

shield from API changes and so on…

Page 53: Troubleshooting Kafka's socket server: from incident to resolution

not done yet!this lesson needs repeating...

Page 54: Troubleshooting Kafka's socket server: from incident to resolution

After deploying the metrics fix to some clusters…

Page 55: Troubleshooting Kafka's socket server: from incident to resolution

After deploying the metrics fix to some clusters…

Deployment

Persistent URP

Page 56: Troubleshooting Kafka's socket server: from incident to resolution

After deploying the metrics fix to some clusters…

Applications also begin to report higher than usual consumer lag

Page 57: Troubleshooting Kafka's socket server: from incident to resolution

Root cause: zero-copy broke for plaintext

Upgradedcluster

Rolledback

With fix(KAFKA-2517)

Page 58: Troubleshooting Kafka's socket server: from incident to resolution

The lesson...

Page 59: Troubleshooting Kafka's socket server: from incident to resolution

● Request queue size● Response queue sizes● Request latencies:

○ Total time○ Local time○ Response send time○ Remote time

● Request handler pool idle ratio

Monitor these closely!

Page 60: Troubleshooting Kafka's socket server: from incident to resolution

Continuous validation on trunk

Page 61: Troubleshooting Kafka's socket server: from incident to resolution

Any other high latency requests?

Image courtesy of ©Nevit Dilmen

Page 62: Troubleshooting Kafka's socket server: from incident to resolution

Local times

ConsumerMetadata OffsetFetch

ControlledShutdown Offsets (by time)

Fetch Produce

LeaderAndIsr StopReplica (for delete=true)

TopicMetadata UpdateMetadata

OffsetCommit

These are (typically 1:N) broker-to-broker requests

Page 63: Troubleshooting Kafka's socket server: from incident to resolution

API layerNetwork layer

Broker-to-broker request latencies - less critical

Processor Response queue

API handler

Purgatory

Acceptor

Read bit off so ties up at most one API handler

Request queue

Page 64: Troubleshooting Kafka's socket server: from incident to resolution

API layerNetwork layer

Broker-to-broker request latencies - less critical

Processor Response queue

API handler

Purgatory

Acceptor

Request queueBut if requesting brokertimes out and retries…

Page 65: Troubleshooting Kafka's socket server: from incident to resolution

API layerNetwork layer

Broker-to-broker request latencies - less critical

Processor Response queue

API handler

Purgatory

Acceptor

Request queueBut if requesting brokertimes out and retries…

Processor Response queueAPI handler

Page 66: Troubleshooting Kafka's socket server: from incident to resolution

API layerNetwork layer

Broker-to-broker request latencies - less critical

Processor Response queue

API handler

Purgatory

Acceptor

Request queue

Processor Response queueAPI handler

Configure socket timeouts >> MAX(latency)

Page 67: Troubleshooting Kafka's socket server: from incident to resolution

● Broker to controller○ ControlledShutdown (KAFKA-1342)

● Controller to broker○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)

○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet…

Broker-to-broker request latency improvements

Page 68: Troubleshooting Kafka's socket server: from incident to resolution

The end