best practices: troubleshooting your couchbase application: couchbase connect 2015

22
TROUBLESHOOTING COUCHBASE David Haikney Head of Technical Support Couchbase EMEA

Upload: couchbase

Post on 26-Jul-2015

135 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

TROUBLESHOOTING COUCHBASE

David HaikneyHead of Technical Support Couchbase EMEA

Page 2: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 2

What Can Possibly Go Wrong?

Help! My Node is Down Why Did My Operation Fail?

Some Handy Tools

The following presentation is based on actual events…

Page 3: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

Help! My Node is Down

Page 4: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 4

Why Does A Node Go Down?...

Node 4

Node 1

Node 2

Node 3

Page 5: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 5

…Because Heartbeats Have Gone Missing

Node 4

Node 1

Node 2

Node 3

Page 6: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 6

04 15:

Page 7: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 7

1. Server Itself is Down

Server (or VM) going offline takes Couchbase with it

Occurs during planned or “unscheduled” maintenance

What matters is server status at the time of the event (VMs) check status in management console Check server’s uptime Couchbase’s own uptime: cbstats|grep uptime

Troubleshooting Tips!

Page 8: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 8

2. Server is Unreachable

Server unavailable on network Usually related to wider datacenter event Our assailant may have fled the scene!

Verify connectivity from other cluster nodes (ping/ ssh)

Check system and network logs around time of first report

Standard network monitoring should be deployed

Troubleshooting Tips!

Page 9: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 9

3. Couchbase Service is Offline

Server is available but Couchbase service is not running

Relatively rare since babysitter restarts failed processes Check which Couchbase processes are running Check dmesg for Linux’s OOM Killer:

May need to reduce the server quota Attempt to restart the service Possibly warming up - monitor for progress with cbstats

Troubleshooting Tips!

May 14 04:26:49 cgs1 kernel: memcached invoked oom-killer

Page 10: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 10

4. Couchbase Too Slow to Respond

Service is running but failed to send timely heartbeats Common trigger of autofailover

THP, Sizing, Swap, Hypervisor Disturbance

Transparent Huge Pages (THP) disabled on Linux systems

Follow Sizing Best Practice Check for system swap usage Avoid VM Over-commit or migration Increase the autofailover timeout

Troubleshooting Tips!

Page 11: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 11

Sizing Matters

8 Buckets60 Design Docs10 XDCR Streams

8 CPU Cores

See Perry Krug’s session at 5:15pm today!

Page 12: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

Why Did My Operation Fail?

Page 13: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 13

The lifecycle of a simple get() Operation

Couchbase

Server Node 1

Client

Your Applicati

on

Couchbase

SDKNetwork

Couchbase

Server Node 2

Page 14: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 14

Possible Pitfalls: Operation couldn’t be dispatched

?Client

Your Applicati

on

Couchbase

SDK

Use a singleton pattern Check node health on cluster Check vbucket map from

server If necessary, restart client Tune Garbage Collection

Settings

Troubleshooting Tips!

Client unsuccessfully initialised Server Connections exhausted Stop the World Garbage Collection Consecutive Failovers

Page 15: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 15

Possible Pitfalls: Operation Did Not Complete

Couchbase

Server Node

Client

Your Applicati

on

Couchbase

SDK

Look for a pattern as to which clients and servers are affected

Code defensively and retry (at least once)

Check network and server health When all else fails, tcpdump /

wireshark…

Troubleshooting Tips!

Did operation arrive at server? Firewalls often intervene!

Did server have difficulty responding?

Did client receive a response?

Page 16: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 16

tcpdump and Wireshark

Wireshark is Couchbase aware

Uses tcpdump packet capture

Can be noisy so filter wisely

specific client Specific server Port 11210

Page 17: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 17

Possible Pitfalls: Not The Response I Expected

Simplest explanation is often the correct one Test your application code in node down, failover and rebalance

scenarios Defensive Coding for all Couchbase operations

E.g. for Temporary Out Of Memory Have a simple client for quick tests to isolate the problem

< 10 lines of python!

Troubleshooting Tips!

Customer: My operation failed with a “Key Does Not Exist” ErrorCB Support: OK, are you sure the key exists?Customer: Yes, our logs show the key being created and we don’t

do deletes.CB Support: The delete counter on your cluster is increasing….

Page 18: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

Some Other Handy Tools

Page 19: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 19

Scenario: Troubleshooting a 2.2.0 XDCR case

5 Million docs went swimming one day,

Off to a cluster far away,XDCR said quack-quack-quack-

quack,But only 4,999,999 docs came

back

Page 20: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 20

Finding the Needle in the Haystack

Identify a single vbucket that exhibits the problem Immediately narrows the problem space to 1/1024th of the

data set cbstats vbucket-details

Interrogate the files on disk to find the discrepancy couch_dbdump --no-body --json --by-id <file> jq tool is very useful for CLI json parsing!

Diff the source cluster and the destination cluster Easier on a static data set but still feasible on a live

cluster With ops in flight, perform the diff twice and take the

intersection

Page 21: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

©2015 Couchbase Inc. 21

Final Thoughts You now know the common causes of Node and Operation

Failures

Troubleshooting requires taking a logical path through the scenarios

Tools exist to help you isolate the problem

We are an open kitchen: issues.couchbase.com

Support team available for 24 x 7 x 365 emergency assistance…. Co-located with developers for fastest response time

… but we hope you’ll never need us!

Page 22: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

Thank you.