TROUBLESHOOTING COUCHBASE
David HaikneyHead of Technical Support Couchbase EMEA
©2015 Couchbase Inc. 2
What Can Possibly Go Wrong?
Help! My Node is Down Why Did My Operation Fail?
Some Handy Tools
The following presentation is based on actual events…
Help! My Node is Down
©2015 Couchbase Inc. 4
Why Does A Node Go Down?...
Node 4
Node 1
Node 2
Node 3
©2015 Couchbase Inc. 5
…Because Heartbeats Have Gone Missing
Node 4
Node 1
Node 2
Node 3
©2015 Couchbase Inc. 6
04 15:
©2015 Couchbase Inc. 7
1. Server Itself is Down
Server (or VM) going offline takes Couchbase with it
Occurs during planned or “unscheduled” maintenance
What matters is server status at the time of the event (VMs) check status in management console Check server’s uptime Couchbase’s own uptime: cbstats|grep uptime
Troubleshooting Tips!
©2015 Couchbase Inc. 8
2. Server is Unreachable
Server unavailable on network Usually related to wider datacenter event Our assailant may have fled the scene!
Verify connectivity from other cluster nodes (ping/ ssh)
Check system and network logs around time of first report
Standard network monitoring should be deployed
Troubleshooting Tips!
©2015 Couchbase Inc. 9
3. Couchbase Service is Offline
Server is available but Couchbase service is not running
Relatively rare since babysitter restarts failed processes Check which Couchbase processes are running Check dmesg for Linux’s OOM Killer:
May need to reduce the server quota Attempt to restart the service Possibly warming up - monitor for progress with cbstats
Troubleshooting Tips!
May 14 04:26:49 cgs1 kernel: memcached invoked oom-killer
©2015 Couchbase Inc. 10
4. Couchbase Too Slow to Respond
Service is running but failed to send timely heartbeats Common trigger of autofailover
THP, Sizing, Swap, Hypervisor Disturbance
Transparent Huge Pages (THP) disabled on Linux systems
Follow Sizing Best Practice Check for system swap usage Avoid VM Over-commit or migration Increase the autofailover timeout
Troubleshooting Tips!
©2015 Couchbase Inc. 11
Sizing Matters
8 Buckets60 Design Docs10 XDCR Streams
8 CPU Cores
See Perry Krug’s session at 5:15pm today!
Why Did My Operation Fail?
©2015 Couchbase Inc. 13
The lifecycle of a simple get() Operation
Couchbase
Server Node 1
Client
Your Applicati
on
Couchbase
SDKNetwork
Couchbase
Server Node 2
©2015 Couchbase Inc. 14
Possible Pitfalls: Operation couldn’t be dispatched
?Client
Your Applicati
on
Couchbase
SDK
Use a singleton pattern Check node health on cluster Check vbucket map from
server If necessary, restart client Tune Garbage Collection
Settings
Troubleshooting Tips!
Client unsuccessfully initialised Server Connections exhausted Stop the World Garbage Collection Consecutive Failovers
©2015 Couchbase Inc. 15
Possible Pitfalls: Operation Did Not Complete
Couchbase
Server Node
Client
Your Applicati
on
Couchbase
SDK
Look for a pattern as to which clients and servers are affected
Code defensively and retry (at least once)
Check network and server health When all else fails, tcpdump /
wireshark…
Troubleshooting Tips!
Did operation arrive at server? Firewalls often intervene!
Did server have difficulty responding?
Did client receive a response?
©2015 Couchbase Inc. 16
tcpdump and Wireshark
Wireshark is Couchbase aware
Uses tcpdump packet capture
Can be noisy so filter wisely
specific client Specific server Port 11210
©2015 Couchbase Inc. 17
Possible Pitfalls: Not The Response I Expected
Simplest explanation is often the correct one Test your application code in node down, failover and rebalance
scenarios Defensive Coding for all Couchbase operations
E.g. for Temporary Out Of Memory Have a simple client for quick tests to isolate the problem
< 10 lines of python!
Troubleshooting Tips!
Customer: My operation failed with a “Key Does Not Exist” ErrorCB Support: OK, are you sure the key exists?Customer: Yes, our logs show the key being created and we don’t
do deletes.CB Support: The delete counter on your cluster is increasing….
Some Other Handy Tools
©2015 Couchbase Inc. 19
Scenario: Troubleshooting a 2.2.0 XDCR case
5 Million docs went swimming one day,
Off to a cluster far away,XDCR said quack-quack-quack-
quack,But only 4,999,999 docs came
back
©2015 Couchbase Inc. 20
Finding the Needle in the Haystack
Identify a single vbucket that exhibits the problem Immediately narrows the problem space to 1/1024th of the
data set cbstats vbucket-details
Interrogate the files on disk to find the discrepancy couch_dbdump --no-body --json --by-id <file> jq tool is very useful for CLI json parsing!
Diff the source cluster and the destination cluster Easier on a static data set but still feasible on a live
cluster With ops in flight, perform the diff twice and take the
intersection
©2015 Couchbase Inc. 21
Final Thoughts You now know the common causes of Node and Operation
Failures
Troubleshooting requires taking a logical path through the scenarios
Tools exist to help you isolate the problem
We are an open kitchen: issues.couchbase.com
Support team available for 24 x 7 x 365 emergency assistance…. Co-located with developers for fastest response time
… but we hope you’ll never need us!
Thank you.