© 2011 cisco all rights reserved.cisco confidential 1 app server client library memory (managed...

© 2011 Cisco All rights reserved. Cisco Confidential 1

APP serverClient library

Memory(Managed Cache)

Queue to disk

Disk

NIC

Replication Queue

Write

ResponseBefore writing to disk

Couchbase Server Node

To other node. replicas within a cluster

Data is persisted to disk by Couchbase Server asynchronously, based on rules in the system.

Great performance, but if node fails all data in queues will be lost forever! No chance to recover such data. This is not acceptable for some of the videoscape apps



Queue to disk

Disk

NIC

Replication Queue

Write

ResponseAfter writing to disk




Queue to disk

Disk

NIC

Replication Queue

WriteResponseAfterwriting to disk and one or more replica?


• Trading performance for better reliability. On a per-operation basis .

• In Couchbase , this is supported, but there is a performance impact that need to be quantified! (assumed to be significant, until proven otherwise)

• Node failure, can still result in the loss of data in the replication queue! Meaning local writes to disk may not have replicas on other nodes yet.

• Q1. Is the above mode supported by Couchbase (i.e. respond to write only after writing to local disk and insuring that at least one replica is successfully copied on another node within a cluster ?

• Q2. If supported, can this be done on a per operation basis?

• Q3. Do you have any performance data on this and the previous case?

Performance vs. reliability (fault tolerance and recoverability)Please see notes section below


2

Replication vs Persistance2

Managed Cache

Disk Queue

Disk

Replication Queue

App Server

Server 2

Replication Queue

2Managed Cache

Disk Queue

Disk

Replication Queue

Replication Queue

2Managed Cache

Disk Queue

Disk

Replication Queue

Replication Queue

Server 1 Server 3

• Replication allows us to block the write while a node persists that write to the memcached layer of 2 other nodes. This is typically a very quick operation and it means we can return control back the app server near immediately instead of waiting on a write to disk




Queue to disk

Disk

NIC

Replication Queue

Write


To other node. replicas within a cluster

XDCR Queue

NIC

Replicas to other DCs

When are the replicas to other DCs put in the XDCR queue? • Based on the user manual, this is done after writing to

local disk! This will obviously add major latencies. • Q5. Is there another option to accelerate this as is the

case of local replicas? my understanding is that when the “replication queue” is used for local replications, couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first?

XDCR: Cross Data Center Replication

ResponseAfterwriting replica to remote DC?

• Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a remote datacenter node/cluster?

Please see notes section below



XDCR: Cross Data Center Replication Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a

remote datacenter node/cluster?

As a workaround from the application side – the write operation can be overloaded so that before it returns control to the app it does 2 gets. One against the current cluster and one against the remote cluster for the item that was written to determine when that write has persisted to the remote cluster.(data center write) – see pseduo-code in notes

We will support synchronous writes to a remote datacenter from the client side in the 2.0.1 release of Couchbase Server – available Q1 2013

Lastly, XDCR is only necessary to span AWS Regions, a Couchbase cluster without XDCR configured can span AWS zones without issue.

Q5. Is there another option to accelerate this as is the case of local replicas? my understanding is that when the “replication queue” is used for local replications, Couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first?

XDCR replication to a remote DC is built on a different technology from the in-memory intra-cluster replication. XDCR is done along with writing data to disk so that it is more efficient. Since the flusher that writes to disk, de-dups data, that is only write the last mutation for the document it is updating on disk, XDCR can benefit from this. This is particularly helpful for write heavy / update heavy workloads. The XDCR queue sends less data over the wire and hence is more efficient.

Lastly – we are also looking at a prioritized disk write queue as a roadmap item for 2013. This feature could be used to accelerate a writes persistence to disk for the purposes of reducing latencies for Indexes and XDCR.

Q6. – per command


Data Center

Rack #1 Rack #2 Rack #n

Couch node1

Couch node5

Couch node7

Couch node2 Couch node3

Couch node4

Couch node6

• Q6. Can couchbase support rack-aware replication (as in Cassandra)?• If we can’t control where the replicas are placed, a rack failure could loose all replicas (i.e. docs become unavailable until the rack recovers)!• Q7. How does couchbase deal with that today? We need at least one of the replicas to be on a different rack.• Note that in actual deployments we can’t always assume that Couchbase nodes will be guaranteed to be placed in different racks.• See the following link for AWS/Hadoop use case for example:http

://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Understanding_Hadoop_Clusters_and_the_Network-bradhedlund_com.pdf

WriteFile-1

File-1Replica-1

File-1Replica-2


Rack-aware replication?

Please see notes section below

http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Understanding_Hadoop_Clusters_and_the_Network-bradhedlund_com.pdf






Couchbase Server Sharding Approach


Fail Over Node

REPLICA

ACTIVE

Doc 5

Doc 2

Doc

Doc

Doc 4

Doc 1

Doc

Doc

SERVER 1

REPLICA

ACTIVE

Doc 4

Doc 7

Doc

Doc

Doc 6

Doc 3

Doc

Doc

SERVER 2

REPLICA

ACTIVE

Doc 1

Doc 2

Doc

Doc

Doc 7

Doc 9

Doc

Doc

SERVER 3 SERVER 4

SERVER 5

REPLICA

ACTIVE

REPLICA

ACTIVE

Doc 9

Doc 8

Doc Doc 6 Doc

Doc

Doc 5 Doc

Doc 2

Doc 8 Doc

Doc

• App servers accessing docs

• Requests to Server 3 fail

• Cluster detects server failedPromotes replicas of docs to activeUpdates cluster map

• Requests for docs now go to appropriate server

• Typically rebalance would follow

Doc

Doc 1 Doc 3

APP SERVER 1COUCHBASE Client

LibraryCLUSTER

MAP

COUCHBASE Client Library

CLUSTER MAP

APP SERVER 2

User Configured Replica Count = 1

COUCHBASE SERVER CLUSTER


Data Center

Rack #1 Rack #2 Rack #3

Vb1 Active

Vb1 Replica1

Vb1000 Replica 2

Vb256 Replica 2 Vb1000 Active

Vb277 Active

Vb1 Replica 2


Rack-aware replication.

Vb1000 Replica1

Vb900 Replica1

Rack #4

Vb1 Replica 3

Vb500 Replica3

Vb256 Active

• Couchbase supports Rack aware replication through the number of replica copies and limiting the number of Couchbase nodes in a given rack.


Availability Zone-aware replicationAWS EAST

Zone A

Zone B

Zone C

Node 1Node 2

Node 3

Node 4Node 5

Node 6

Node 7Node 8

Node 9

• Q8. If all the nodes of a single couchbase cluster (nodes 1-9) are distributed among 3X availability zones (AZ) as shown, can couchbase support AZ-aware replication?

That is to insure that the replicas of a doc are distribute across different zones, so that a zone failure does not result in doc unavailability.

Assume Inter-AZ latency ~1.5 ms


Availability Zone-aware replicationAWS EAST

Zone A

Zone B

Zone C

Node 1Node 2

Node 3

Node 4Node 5

Node 6

Node 7Node 8

Node 9

Assume Inter-AZ latency ~1.5 ms

• Couchbase currently supports zone affinity through the number of replicas and limiting the number of Couchbase nodes in a given zone.

• Replica factor is applied at the bucket level and up to 3 replicas can be specified. Each replica is equal to 1 full copy of the data set that will be distributed across the available nodes in the cluster. With 3 replicas – the cluster contains 4 full copies of the data. 1 active + 3 replica

• By limiting the number of Couchbase nodes in a given zone to the replica count, losing a zone does not result in data loss as there is a full copy of data still in another zone.

• In the example above – in a worst case scenario. We have 3 replicas enabled. Active lives on node1, replica1 lives on node2, replica2 on node3 and replica3 on node4. If zone 1 goes down, those nodes can be automatically failed over which promotes replica3 on node4 to active

• Explicit Zone aware replication(affinity) is a roadmap item for 2013.


Throughput /latency with scale We need performance tests showing scaling from 3 nodes to 50-80 nodes in a single

cluster (performance tests to show performance (throughput and latencies) in 5 or 10 nodes increments). During these tests, the following configurations/assumptions must be used:

• Nodes are physically distributed in 3 availability zones (e.g. AWS EAST zones).

• No data loss (of any acknowledged write) when a single node fails in any zone, or when an entire availability zone fail. Ok to loose un-acknowledged writes since clients can deal with that. To achieve this, we need:

o Durable writes enabled (i.e. don’t ack client’s request to write until the write is physically done to disk on the local node and at least one more replica is written to disks of other nodes in different availability zones).

Even though the shared performance tests look great, unfortunately the test assumptions used (lack of reliable writes) are unrealistic for our videoscape deployments/use case scenarios.

We need performance test results that are close to our use case! Please see the following Netflix’s tests and write durability /multi zone assumptions which are very close to our use case (http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html)

Q9. Please advise if you are willing to conduct such tests!

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html




Throughput /latency with scale We need performance test results that are close to our use case! Please see the following

Netflix’s tests and write durability /multi zone assumptions which are very close to our use case (http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html)

The Netflix test describes that writes can be durable across regions but does not specify if the latency and throughput they saw were from Quorum writes or Single Writes(Single writes are faster and map to a unacknowledged write). This is similar to our benchmark numbers as though we have the ability to do durable writes across regions and racks we do not specify it in the data. Can you confirm that the Netflix benchmark numbers for latency/throughput/cpu util were done wholly with Quorum writes?




© 2011 cisco all rights reserved.cisco confidential 1 app server client library memory (managed...

Documents