best practices: troubleshooting your couchbase application: couchbase connect 2015

Post on 26-Jul-2015

135 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TROUBLESHOOTING COUCHBASE

David HaikneyHead of Technical Support Couchbase EMEA

©2015 Couchbase Inc. 2

What Can Possibly Go Wrong?

Help! My Node is Down Why Did My Operation Fail?

Some Handy Tools

The following presentation is based on actual events…

Help! My Node is Down

©2015 Couchbase Inc. 4

Why Does A Node Go Down?...

Node 4

Node 1

Node 2

Node 3

©2015 Couchbase Inc. 5

…Because Heartbeats Have Gone Missing

Node 4

Node 1

Node 2

Node 3

©2015 Couchbase Inc. 6

04 15:

©2015 Couchbase Inc. 7

1. Server Itself is Down

Server (or VM) going offline takes Couchbase with it

Occurs during planned or “unscheduled” maintenance

What matters is server status at the time of the event (VMs) check status in management console Check server’s uptime Couchbase’s own uptime: cbstats|grep uptime

Troubleshooting Tips!

©2015 Couchbase Inc. 8

2. Server is Unreachable

Server unavailable on network Usually related to wider datacenter event Our assailant may have fled the scene!

Verify connectivity from other cluster nodes (ping/ ssh)

Check system and network logs around time of first report

Standard network monitoring should be deployed

Troubleshooting Tips!

©2015 Couchbase Inc. 9

3. Couchbase Service is Offline

Server is available but Couchbase service is not running

Relatively rare since babysitter restarts failed processes Check which Couchbase processes are running Check dmesg for Linux’s OOM Killer:

May need to reduce the server quota Attempt to restart the service Possibly warming up - monitor for progress with cbstats

Troubleshooting Tips!

May 14 04:26:49 cgs1 kernel: memcached invoked oom-killer

©2015 Couchbase Inc. 10

4. Couchbase Too Slow to Respond

Service is running but failed to send timely heartbeats Common trigger of autofailover

THP, Sizing, Swap, Hypervisor Disturbance

Transparent Huge Pages (THP) disabled on Linux systems

Follow Sizing Best Practice Check for system swap usage Avoid VM Over-commit or migration Increase the autofailover timeout

Troubleshooting Tips!

©2015 Couchbase Inc. 11

Sizing Matters

8 Buckets60 Design Docs10 XDCR Streams

8 CPU Cores

See Perry Krug’s session at 5:15pm today!

Why Did My Operation Fail?

©2015 Couchbase Inc. 13

The lifecycle of a simple get() Operation

Couchbase

Server Node 1

Client

Your Applicati

on

Couchbase

SDKNetwork

Couchbase

Server Node 2

©2015 Couchbase Inc. 14

Possible Pitfalls: Operation couldn’t be dispatched

?Client

Your Applicati

on

Couchbase

SDK

Use a singleton pattern Check node health on cluster Check vbucket map from

server If necessary, restart client Tune Garbage Collection

Settings

Troubleshooting Tips!

Client unsuccessfully initialised Server Connections exhausted Stop the World Garbage Collection Consecutive Failovers

©2015 Couchbase Inc. 15

Possible Pitfalls: Operation Did Not Complete

Couchbase

Server Node

Client

Your Applicati

on

Couchbase

SDK

Look for a pattern as to which clients and servers are affected

Code defensively and retry (at least once)

Check network and server health When all else fails, tcpdump /

wireshark…

Troubleshooting Tips!

Did operation arrive at server? Firewalls often intervene!

Did server have difficulty responding?

Did client receive a response?

©2015 Couchbase Inc. 16

tcpdump and Wireshark

Wireshark is Couchbase aware

Uses tcpdump packet capture

Can be noisy so filter wisely

specific client Specific server Port 11210

©2015 Couchbase Inc. 17

Possible Pitfalls: Not The Response I Expected

Simplest explanation is often the correct one Test your application code in node down, failover and rebalance

scenarios Defensive Coding for all Couchbase operations

E.g. for Temporary Out Of Memory Have a simple client for quick tests to isolate the problem

< 10 lines of python!

Troubleshooting Tips!

Customer: My operation failed with a “Key Does Not Exist” ErrorCB Support: OK, are you sure the key exists?Customer: Yes, our logs show the key being created and we don’t

do deletes.CB Support: The delete counter on your cluster is increasing….

Some Other Handy Tools

©2015 Couchbase Inc. 19

Scenario: Troubleshooting a 2.2.0 XDCR case

5 Million docs went swimming one day,

Off to a cluster far away,XDCR said quack-quack-quack-

quack,But only 4,999,999 docs came

back

©2015 Couchbase Inc. 20

Finding the Needle in the Haystack

Identify a single vbucket that exhibits the problem Immediately narrows the problem space to 1/1024th of the

data set cbstats vbucket-details

Interrogate the files on disk to find the discrepancy couch_dbdump --no-body --json --by-id <file> jq tool is very useful for CLI json parsing!

Diff the source cluster and the destination cluster Easier on a static data set but still feasible on a live

cluster With ops in flight, perform the diff twice and take the

intersection

©2015 Couchbase Inc. 21

Final Thoughts You now know the common causes of Node and Operation

Failures

Troubleshooting requires taking a logical path through the scenarios

Tools exist to help you isolate the problem

We are an open kitchen: issues.couchbase.com

Support team available for 24 x 7 x 365 emergency assistance…. Co-located with developers for fastest response time

… but we hope you’ll never need us!

Thank you.

top related