troubleshooting kafka's socket server: from incident to resolution

Troubleshooting Kafka’s socket serverfrom incident to resolution

Joel KoshyLinkedIn

The incidentOccurred a few days after upgrading to pick up quotas and SSL

Multi-portKAFKA-1809

KAFKA-1928

SSLKAFKA-1690x25 x38

October 13

Various quota patches

June 3April 5 August 18

The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster

The incident● Alerts fire; NOC engages SYSOPS/Kafka-SRE● Kafka-SRE restarts the broker● Broker failure does not generally cause prolonged application impact

○ but in this incident…

The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster

Posts search was one such impacted application

The incidentTwo brokers report high request and response queue sizes

The incidentTwo brokers report high request queue size and request latencies

The incident● Other observations

○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs

RemediationShifted site traffic to another data center

“Kafka outage ⇒ member impact

Multi-colo is critical!

Remediation● Controller moves did not help● Firewall the affected brokers

● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure

○ … but x25 never exhibited this issue

sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT

sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP

“Good dashboards/alerts

Skilled operators

Clear communication

Audit/log all operations

CRM principles apply to operations

https://en.wikipedia.org/wiki/Crew_resource_management

https://en.wikipedia.org/wiki/Crew_resource_management

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade



x38 x38 x38 x38

Rolling downgrade

Move leaders



x38 x38 x38 x38

Rolling downgrade

Firewall



x38 x38 x38

Rolling downgrade

Firewall

x25



x38 x38 x38

Rolling downgrade

x25

Move leaders

… oh and BTWbe careful when saving a lot of public-access/server logs:

● Can cause GC

[Times: user=0.39 sys=0.01, real=8.01 secs]

● Use ionice or rsync --bwlimit

lowlow

high

http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux#4

http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux#4

The investigation

● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce

● Upgraded the queuing cluster to x38 again○ Could not reproduce

● So nothing… �

Attempts at reproducing the issue

Understanding queue backups…

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Quota manager

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Quota manager

Newconnections

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

and then turn off read interest from that connection (for ordering)

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling


Handle request

+ local-time

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling


Handle request

+ local-time + remote-time

long-poll requests

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling


Handle request


long-poll requests

Hold if quotaviolated

+ quota-time

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling


Handle request


long-poll requests


+ quota-time

Await processor

+ response-queue-time

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling


Handle request


long-poll requests


+ quota-time

Await processor


Writeresponse

+ response-send-time

API layer


Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling


Handle request


long-poll requests


+ quota-time

Await processor


Writeresponse

+ response-send-time

Turn read interest back on from that connection

Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad


Low


Low

Highbut “normal”(purgatory)

Investigating high request times● First look for high local time

○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)

● High request queue/response queue times are effects, not causes

High local times during incident (e.g., fetch)

How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)


Not using acks -1

Should be fast

Maybe

Should be fast

Delayed outsideAPI thread

ISR churn?

● Low ZooKeeper write latencies● Churn in this incident: effect of some other root cause● Long request queues can cause churn

○ ⇒ follower fetches timeout■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)

○ Outstanding fetch catches up and ISR expands

ISR churn? … unlikely

High local times during incident (e.g., fetch)

Besides, fetch-consumer (not just follower) has high local time

Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)


Not using acks -1

Should be fast

Should be fast

Delayed outsideAPI thread

Test this…

Maintains byte-rate metrics on a per-client-id basis

2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS

Quota metrics

??!

Quota metrics - a quick benchmark

for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) }}


Fixed in KAFKA-2664

https://issues.apache.org/jira/browse/KAFKA-2664

meanwhile in our queuing cluster…due to climbing client-id counts

Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts

○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident

■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time

https://cwiki.apache.org/confluence/display/KAFKA/jmxterm+quickstart

MACRO

● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate

MICRO (live debugging)

● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.

Troubleshooting: macro vs micro

MACRO

● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate

MICRO (live debugging)

● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.

Troubleshooting: macro vs micro

Generally more effective Sometimes warranted,but invasive and tedious

How to fix high local times● Optimize the request’s handling. For e.g.,:

○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356

● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911

● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:

○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)

○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request




as for rogue clients…2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS

“Get apps to use wrapper libraries that

implement good client behavior,

shield from API changes and so on…

not done yet!this lesson needs repeating...

After deploying the metrics fix to some clusters…


Deployment

Persistent URP


Applications also begin to report higher than usual consumer lag

Root cause: zero-copy broke for plaintext

Upgradedcluster

Rolledback

With fix(KAFKA-2517)



The lesson...

● Request queue size● Response queue sizes● Request latencies:

○ Total time○ Local time○ Response send time○ Remote time

● Request handler pool idle ratio

Monitor these closely!

Continuous validation on trunk

Any other high latency requests?

Image courtesy of ©Nevit Dilmen

https://commons.wikimedia.org/wiki/File:Bomb_icon.svg

https://commons.wikimedia.org/wiki/File:Bomb_icon.svg

Local times

ConsumerMetadata OffsetFetch

ControlledShutdown Offsets (by time)

Fetch Produce

LeaderAndIsr StopReplica (for delete=true)

TopicMetadata UpdateMetadata

OffsetCommit

These are (typically 1:N) broker-to-broker requests

API layerNetwork layer

Broker-to-broker request latencies - less critical

Processor Response queue

API handler

Purgatory

Acceptor

Read bit off so ties up at most one API handler

Request queue




API handler

Purgatory

Acceptor

Request queueBut if requesting brokertimes out and retries…




API handler

Purgatory

Acceptor

Request queueBut if requesting brokertimes out and retries…

Processor Response queueAPI handler




API handler

Purgatory

Acceptor

Request queue

Processor Response queueAPI handler

Configure socket timeouts >> MAX(latency)

● Broker to controller○ ControlledShutdown (KAFKA-1342)

● Controller to broker○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)

○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet…

Broker-to-broker request latency improvements



The end

troubleshooting kafka's socket server: from incident to resolution

Data & Analytics