troubleshooting kafka's socket server: from incident to resolution
TRANSCRIPT
Troubleshooting Kafka’s socket serverfrom incident to resolution
Joel KoshyLinkedIn
The incidentOccurred a few days after upgrading to pick up quotas and SSL
Multi-portKAFKA-1809
KAFKA-1928
SSLKAFKA-1690x25 x38
October 13
Various quota patches
June 3April 5 August 18
The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster
The incident● Alerts fire; NOC engages SYSOPS/Kafka-SRE● Kafka-SRE restarts the broker● Broker failure does not generally cause prolonged application impact
○ but in this incident…
The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such impacted application
The incidentTwo brokers report high request and response queue sizes
The incidentTwo brokers report high request queue size and request latencies
The incident● Other observations
○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs
RemediationShifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
Remediation● Controller moves did not help● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
… oh and BTWbe careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8.01 secs]
● Use ionice or rsync --bwlimit
lowlow
high
The investigation
● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce
● Upgraded the queuing cluster to x38 again○ Could not reproduce
● So nothing… �
Attempts at reproducing the issue
Understanding queue backups…
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Quota manager
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Quota manager
Newconnections
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
and then turn off read interest from that connection (for ordering)
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
Writeresponse
+ response-send-time
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
Writeresponse
+ response-send-time
Turn read interest back on from that connection
Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad
Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad
Low
Investigating high request times● Total time is useful for monitoring● but high total time is not necessarily bad
Low
Highbut “normal”(purgatory)
Investigating high request times● First look for high local time
○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
High local times during incident (e.g., fetch)
How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Not using acks -1
Should be fast
Maybe
Should be fast
Delayed outsideAPI thread
ISR churn?
● Low ZooKeeper write latencies● Churn in this incident: effect of some other root cause● Long request queues can cause churn
○ ⇒ follower fetches timeout■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)
○ Outstanding fetch catches up and ISR expands
ISR churn? … unlikely
High local times during incident (e.g., fetch)
Besides, fetch-consumer (not just follower) has high local time
Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outsideAPI thread
Test this…
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!
Quota metrics - a quick benchmark
for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) }}
Quota metrics - a quick benchmark
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
meanwhile in our queuing cluster…due to climbing client-id counts
Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
MACRO
● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate
MICRO (live debugging)
● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
MACRO
● Observe week-over-week trends● Formulate theories● Test theory (micro-experiments)● Deploy fix and validate
MICRO (live debugging)
● Instrumentation● Attach profilers● Take heapdumps● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
Generally more effective Sometimes warranted,but invasive and tedious
How to fix high local times● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356
● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:
○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
as for rogue clients…2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS
“Get apps to use wrapper libraries that
implement good client behavior,
shield from API changes and so on…
not done yet!this lesson needs repeating...
After deploying the metrics fix to some clusters…
After deploying the metrics fix to some clusters…
Deployment
Persistent URP
After deploying the metrics fix to some clusters…
Applications also begin to report higher than usual consumer lag
Root cause: zero-copy broke for plaintext
Upgradedcluster
Rolledback
With fix(KAFKA-2517)
The lesson...
● Request queue size● Response queue sizes● Request latencies:
○ Total time○ Local time○ Response send time○ Remote time
● Request handler pool idle ratio
Monitor these closely!
Continuous validation on trunk
Any other high latency requests?
Image courtesy of ©Nevit Dilmen
Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for delete=true)
TopicMetadata UpdateMetadata
OffsetCommit
These are (typically 1:N) broker-to-broker requests
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Read bit off so ties up at most one API handler
Request queue
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting brokertimes out and retries…
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting brokertimes out and retries…
Processor Response queueAPI handler
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queue
Processor Response queueAPI handler
Configure socket timeouts >> MAX(latency)
● Broker to controller○ ControlledShutdown (KAFKA-1342)
● Controller to broker○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)
○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet…
Broker-to-broker request latency improvements
The end