couchbase server under the hoodinfo.couchbase.com/rs/northscale/images/couchbase...to talk to...

Couchbase Server Under the Hood

Couchbase Server 1

ACTIVE

REPLICA

Couchbase Server 2

ACTIVE

REPLICA

Couchbase Server 3

ACTIVE

REPLICA

Server Cluster

APPLICATION SERVER 1CLIENT LIBRARY

CLUSTER MAP


CLUSTER MAP

Couchbase Server Under the HoodAn Architectural Overview

Couchbase Server is an open-source distributed NoSQL document-oriented database for interactive applications, uniquely suited for those needing a flexible data model, easy scalability, consistent high-performance, and always-on 24x365 availability. This paper dives deep into the internal architecture of Couchbase Server. It assumes a basic understanding of Couchbase Server and assumes the reader is looking for greater technical understanding of what’s under the hood.

High-Level Deployment ArchitectureCouchbase Server has a true shared-nothing architecture. It is set up as a cluster of multiple servers behind an application server. Figure 1 shows two application servers interacting with a Couchbase Server cluster. It illustrates that documents stored in Couchbase Server can be replicated to other servers in a cluster. Replication will be discussed later.

Figure 1: Couchbase Server deployment architecture

A bucket can be configured to replicate up to three times within a cluster. Because Couchbase Server splits buckets into partitions, a server manages only a subset of active and replica partitions. This means that at any point in time, for a given partition, only one copy is active with zero or more replica partitions on other servers. If a server hosting an active partition fails, the cluster promotes one of the replica partitions to active, ensuring the application can continue to access data without any downtime.

Figure 2: Mapping document IDs to partitions and then to servers

Couchbase Cluster

CLUSTER MAP

vBucket2

vBucket3

vBucket4

vBucket5

vBucket6

vBucket7

vBucket10

24

vBucket1

...

Hashing Algorithm

CRC32

Couchbase SDK

DOCUMENT IDs

PARTITION ID

It’s possible to store JSON documents or binary data in Couchbase Server. Documents are uniformly distributed across the cluster and stored in data containers called buckets. Buckets provide a logical grouping of physical resources within a cluster. For example, when you create a bucket in Couchbase Server, you can allocate a certain amount of RAM for caching data and configure the number of replicas.

By design, each bucket is split into 1024 logical partitions called vBuckets. Partitions are mapped to servers across the cluster, and this mapping is stored in a lookup structure called the cluster map. Applications can use Couchbase’s smart clients to interact with the server. Smart clients hash the document ID, which is specified by the application for every document added to Couchbase Server. As shown in Figure 2, the client applies a hash function (CRC32) to every document that needs to be stored in Couchbase Server, and the document is sent to the server where it should reside.

In Couchbase Server, mutations happen at a document level. Clients retrieve the entire document from the server, modify certain fields, and write the entire document back to Couchbase Server. When Couchbase Server receives a request to write a document as shown in Figure 4 below, the following occurs:

1. Every server in a Couchbase Server cluster has its own object-managed cache. The client writes a document into the cache, and the server sends the client a confirmation. By default, the client does not have to wait for the server to persist and replicate the document as it happens asynchronously. However, using client SDK durability constraints, you can design your application to wait for replication and/or persistence to complete before sending a confirmation.

2. The document is added into the intra-cluster replication queue to be replicated to other servers within the cluster.

3. If cross datacenter replication (XDCR) is setup, the document is added to the XDCR queue and streamed from the source cluster to the destination cluster.

Client ArchitectureTo talk to Couchbase Server, applications use memcached compatible smart-client SDKs in a variety of languages including Java, .NET, PHP, Python, Ruby, Node.js, C, and C++. These clients are cluster topology aware and get updates on the cluster map. They automatically send requests from the application to the appropriate server. With Couchbase Server 3.0, it is possible to use the version 2.0 Couchbase smart clients to communicate to the cluster over SSL.

How does data flow from a client application to Couchbase Server? What happens within a server node in Couchbase Server when it receives a request to read and write data?

After a client first connects to the cluster, it requests the cluster map from the Couchbase Server cluster and maintains an open connection with the server for streaming updates. The cluster map is shared with all the servers in a Couchbase Server cluster and with the Couchbase clients. As shown in the Figure 3, data flows from a client to the server using the following steps:

1. A user interacts with an application resulting in the need to update or retrieve a document in Couchbase Server.

2. The application server contacts Couchbase Server via the smart client SDKs.

3. The smart client SDK takes the document that needs to be updated and hashes its document ID to a partition ID. With the partition ID and the cluster map, the client can figure out on which server and on which partition this document belongs. The client can then update the document on this server.

4. and 5. When a document arrives in a cluster, Couchbase Server replicates the document within the cluster and to other clusters depending on durability configuration, caches it in memory and asynchronously stores it on disk. More details on this process in the next paragraph.

Figure 3: Flow of data from an application to a Couchbase Server cluster

User action results in the need to modify a document

Application updates a document, performing a SET operation

Couchbase Server smart client hashes document key,identifying the server responsible to store the document

SET request sent over the network to the server

Couchbase Server replicates the document, cacheing itin memory and asynchronously storing it to disk

1

2

3

4

5

4. The document is streamed to the view engine to be incrementally indexed.

5. The document is also added into the disk write queue to be asynchronously persisted to disk. The document is persisted to disk after the disk-write queue is flushed.

Figure 4: Data flow within Couchbase Server during a write operation

Couchbase Server

APPLICATION SERVER 1

REPLICATIONQUEUETo other services To destination cluster

OBJECT-MANAGED CACHE

DISK

VIEW ENGINE

XDCR QUEUE

DIS

K W

RIT

E Q

UEU

E

Per Node ArchitectureAs shown in Figure 5, every server in the Couchbase Server cluster is architecturally identical and made up of two main components: the data manager and the cluster manager. This uniform architecture in every Couchbase Server helps scale the database linearly and without any single point of failure.

Couchbase Server

STORAGE ENGINE ERLA

NG

/ O

TP

QUERY ENGINE HTTPOBJECT-MANAGEDACHE

REST MANAGEMENTAPI / WEB UI

OBJECT-MANAGEDACHE

DATA MANAGER CLUSTER MANAGER

8092 11211 / 11210QUERY API DATA ACCESS PORTS

8091ADMIN CONSOLE

Figure 5: Node architecture diagram of Couchbase Server

Within the data manager and cluster manager, there are various other sub-components responsible for different functions including rebalancing, replication, and recovery. To keep all the different components in-sync and to move data between these components at high speed, Couchbase Server 3.0 introduces Database Change Protocol (DCP), a protocol for replicating changes to components, servers, and datacenters via streams. DCP is at the heart of Couchbase Server 3.0 and extends the memory-centric architecture of Couchbase Server by removing I/O bottlenecks from many critical functions that are described in other sections of this paper.

Data Manager

In Couchbase Server, the data manager stores and retrieves data in response to data operation requests from applications. It exposes three ports to the network: one for non-smart clients (11211), which can be proxied via Moxi (a memcached proxy for Couchbase Server) if needed, and another for smart clients (11210). There is also an additional port (11207) that supports secure SSL communication via smart clients. The data manager is written in C and C++.

The data manager has three parts - the object-managed cache, the storage engine, and the query engine. This section will explain the first two parts.

Object Managed Cache

Every server in a Couchbase Server cluster includes a built-in multi-threaded object-managed cache, which provides consistent low-latency for read and write operations. The application can access it using memcached compatible APIs such as get, set, delete, append, and prepend.

The size of the cache for each bucket per server can be configured through the bucket memory quota. Hashtables reside in the object-managed cache and represent bucket partitions in Couchbase Server. They offer a quick way of detecting whether a document exists in memory or not. As shown in Figure 6, each document in a partition will have a corresponding entry in the hashtable, which stores the document’s ID (the key), document metadata and the document’s value.

Figure 6: Hashtable item in Couchbase Server

KEY

POINTER TONEXT ITEM DOC VALUE

METADATA VALUE POINTEREXPIRATION TIME

CAS IDENTIFIER

SEQUENCE NUMBER

REVISION NUMBER

LOCK EXPIRY (GETLOCKED API)

FLAG, NRU

As shown in Figure 7, disk I/O requests in Couchbase Server are handled through the disk read-write queue. These requests are prioritized based on bucket priority (low or high), and are processed by worker threads in a global shared thread pool. The number of threads in the thread pool depends on the number of CPUs available on the server. Before a worker thread accesses the hashtable, it needs to get a lock. Couchbase Server uses fine-grained locking to boost request throughput; each hashtable has multiple locks, and each lock controls access to certain sections of that hashtable. As more documents are added to the cluster, the hashtable grows dynamically at runtime.

Figure 7: Persisting and replicating Couchbase Server checkpoints

PARTITIONHASHTABLE

PARTITIONHASHTABLE

PARTITIONHASHTABLE

CHECKPOINTS CHECKPOINTS CHECKPOINTS

DISK READ-WRITE QUEUEREAD-WRITE REQUESTS

LOW

HIGH

GLOBAL SHARED THREAD POOL

BACKFILLQUEUE

APPEND-ONLY B-TREESTORAGE ENGINE

REPLICATIONQUEUE

ITEMPAGER

EXPIRYPAGER

CHECKPOINTMANAGER

Couchbase Server

MEMORY-TO-MEMORYREPLICATION TO OTHER NODE

CLIENT API

(GET, SET, ADD, DEL, ...)

To keep track of document changes, Couchbase Server uses data structures called checkpoints. These are linked lists that keep track of outstanding changes that still need to be persisted by the storage engine and replicated to other replica partitions.

Low latency memory-to-memory replication of documents to other servers occurs via DCP. When replication starts, multiple DCP streams are set up, and checkpoints are replicated based on sequence numbers from active to replica partitions. DCP manages internal state using replication cursors that move forward only after the replica partitions on the destination server have seen the last sequence number that was sent. If replication is interrupted – for example, due to a server crash or intermittent network connection failure – replication does not have to be restarted from the beginning, and can continue from the last sequence number seen by the replica.

Since Couchbase Server enables the user to have more data in the system than RAM available, there can be a significant amount of data that needs to be read from disk in order to transfer to another node. After all the in-memory data has been read, a backfill process kicks in so that data from disk can also be replicated. The backfill process starts up and decides whether to bulk load data from disk. It looks at the resident item ratio, which is the amount of data cached in RAM versus total data in the system. If that ratio is lower than 95 percent, it bulk loads the whole partition from disk into a temporary RAM buffer, and continues to monitor memory usage until the 95 percent threshold is met. If the resident item ratio is higher than 95 percent, it backs-off until memory usage comes back down, and then resumes.

As the hashtable gets scanned, Couchbase Server begins streaming the keys and documents over the DCP connection to the replica. Anything that is already cached in RAM is sent very quickly; anything else is either taken from that temporary buffer space (if available) or read in a one-off fashion from disk. During this period, mutations can change documents in memory. These changes are tracked using checkpoints. The replication cursor will see that new checkpoint items have been added and will continually stream them to the destination. During replication, working set metadata is also replicated and maintained on the replica servers to reduce latencies after a rebalance or failover operation.

Read and write operations in Couchbase Server first go to the in-memory object-managed cache. If a document being read is not in the cache, it is fetched from disk. By default, Couchbase Server caches all metadata in memory to improve I/O performance for the entire application dataset. This is great for low-latency data access. However, not all workloads may require low latency access and rather might choose to have memory reserved for “hotter” parts of the working set. For example, in a telemetry application, sensors gather and store massive data points from all devices but query only the last 24 hours of data. For such applications, where the working set is large and the resident ratio is less than 100%, tunable memory can be utilized to cache only the hot working set and reduce the overall application memory footprint of the database.

Disk writes in Couchbase Server are asynchronous. This means that as soon as the document is placed into the hashtable, and the mutation is added to the corresponding checkpoint, a success response is sent back to the client. If a mutation already exists for the same document, deduplication takes place and only the latest updated document is persisted to disk. In Couchbase Server 3.0, there are multiple worker threads in the shared thread pool so that multiple processes can simultaneously read and write data from disk to maximize disk I/O throughput. For example, when a thread is assigned to running a task from an I/O queue and a second task is requested, another thread is assigned to pick up the second task. Threads and buckets are de-coupled so that threads in the thread pool can run tasks for any bucket. As shown in Figure 7, separate I/O internal queues exist for low and high priority requests, and threads spend an appropriate fraction of time dispatching tasks from these queues, thus avoiding starvation.

Couchbase Server processes read operations by looking up the document ID from the corresponding active partitions hashtable. If the document is in the cache (the application’s working set), the document body is read from the cache and returned to the client. If the document is not found in the cache, it is read from disk into the cache, and then sent to the client.

Disk fetch requests in Couchbase Server are batched. Documents are read from the underlying storage engine, and corresponding entries in the hashtable are filled. If there is no more space in memory, Couchbase Server needs to free up space from the object-managed cache. This process is called ejection. With tunable memory in Couchbase Server 3.0, it’s possible to configure Couchbase Server on a per bucket basis to eject only values (default), or the key and metadata.

In value-only ejection, as shown in Figure 8, the application’s entire key space is maintained in the hashtable. When ejection happens, the key and metadata remain in memory while the value is flushed to disk. When an application requests a key and its value is not in memory, a request is made to the storage engine to read the value and populate the hashtable in memory before it is returned to the client.

In full ejection, the application’s entire key space is not maintained in the hashtable. When ejection happens, the key and metadata along with its value are flushed to disk. As shown in Figure 9, when an application requests a key that is not in memory, a request is made to the storage engine to read the metadata and value, and populates the hashtable in memory before it is returned to the client.

Figure 8: Value-only Ejection in Couchbase Server

Figure 9: Full ejection in Couchbase Server

KEY: “foo”

GET (“foo”)

HASH TABLE ITEM

read_value (“foo”)

BLOB VALUE

METADATA BLOB POINTER BATCH READER

STORAGE ENGINEPOINTER TONEXT ITEM

KEY: “foo”

GET (“foo”)

HASH TABLE ITEM

read_meta_value (“foo”)

BLOB VALUE

METADATA BLOB POINTER BATCH READER

STORAGE ENGINEPOINTER TONEXT ITEM

Couchbase Server tracks documents that are not recently accessed using the not-recently-used (NRU) metadata bit. Documents with the NRU bit set are ejected out of cache to the disk. Finally, after the disk fetch is complete, pending client connections are notified of the asynchronous I/O completion so that they can complete the read operation.

To keep memory usage per bucket under control, Couchbase Server periodically runs a background task called the item pager. The high and low watermark limits for the item pager is determined by the bucket memory quota. The default high watermark value is 85 percent of the bucket memory quota, and the low watermark is 75 percent of the bucket memory quota. These watermarks are also configurable at runtime. If the high watermark is reached, the item pager scans the hashtable, ejecting items that are not recently used.

The server also tracks read and write operations in the last 24-hours in an access log file. By default, a periodic daemon task is scheduled to run daily at 10AM UTC; it scans each partition hashtable to gather the list of current resident items and writes those resident items into the access log file. If there is a need to restart a server, the access log file is used for server warmup, which we describe next.

When a Couchbase Server cluster is started for the first time, it creates necessary database files and begins serving data immediately. However, if there is already data on disk (likely because a node rebooted or due to a service restart), the servers need to read data off of disk to fill up the cache. This process is called warmup.

Warmup consists of the following steps:

1. If value-only ejection mode is used:

a. For each partition, Couchbase Server loads the keys and metadata for all the documents into memory.

b. If there is still space left in memory, Couchbase Server uses the access log file to prioritize which document’s content should be loaded.

2. If full ejection mode is used:

a. If there is nothing in the access log to load, the warmup process is complete.b. Otherwise, for each partition, Couchbase Server loads only the keys that are in the access

log file.

Full ejection provides much faster system warmup. Warmup is a necessary step before Couchbase Server is able to serve requests, and can be monitored.

Couchbase Server supports document expiration using time to live (TTL). By default, all documents have a TTL of 0, which means the document will be kept indefinitely. However, when adding, setting, or replacing a document, a custom TTL can be specified. Expiration of documents is lazy, meaning the server doesn’t immediately delete items from disk. Instead, it runs a background task every 60 minutes (by default) called the expiry pager. The expiry page tracks and removes expired documents from both the hashtable as well as the storage engine through a delete operation.

Storage Engine

With Couchbase Server’s append-only storage engine design, document mutations go only to the end of the file. If a document is added in Couchbase Server, it gets added at the end of the file. If a document is appended in Couchbase Server, it gets recorded at the end of the file. Files are open always in append mode, and there are never any in-place updates. This improves disk write performance as updates are written sequentially. Figure 10 illustrates how Couchbase Server stores documents.

Figure 10: Data file b-tree structure in Couchbase Server

DOCUMENT ID[0 ... 1000]





[100, METADATA],[123, METADATA]

[309, METADATA],[347, METADATA] [408, METADATA]

[540, METADATA],[662, METADATA]

[751, METADATA],[802, METADATA],[962, METADATA]

POINTERS TO DOCUMENT CONTENT

ROOTNODE

INTERMEDIATENODES

LEAFNODES

Couchbase Server uses multiple files for storage. It has a data file per partition, which we call data file, and multiple index files, which are either active or replicated indexes. The third type of file is a master file, which has metadata related to Couchbase Server views, and will be discussed later in the section on views.

As shown in Figure 10, Couchbase Server organizes data files as b-trees. The root nodes shown in red contain pointers to intermediate nodes. These intermediate nodes contain pointers to the leaf nodes shown in blue. The root and intermediate nodes also track the sizes of documents under their sub-tree. The leaf nodes store the document ID, document metadata and pointers to document content. For index files the root and intermediate nodes track index items under their sub-tree.

To prevent the data file from growing too large and eventually eating up all available disk space, Couchbase Server periodically cleans up stale data from the append-only storage. It calculates

http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-monitoring-startup.html

the ratio of the actual file size to the current file. If this ratio goes above a certain threshold (configurable through the admin UI), a process called compaction is triggered. Compaction scans through the current data file and copies non-stale data into a new data file. Since it is an online operation, Couchbase Server can still service requests. When the compaction process is completed, Couchbase Server copies over the data modified since the start of the compaction process to the new data file. The old data file is then deleted and the new data file is used until the next compaction cycle.

Cluster Manager

The cluster manager supervises server configuration and interaction between servers within a Couchbase Server cluster. It is a critical component that manages replication and rebalancing operations in Couchbase Server. Although the cluster manager executes locally on each cluster node, it elects a clusterwide orchestrator node to oversee cluster conditions and carry out appropriate cluster management functions.

If a machine in the cluster crashes or becomes unavailable, the cluster orchestrator after waiting for the timeout interval, notifies all other machines in the cluster, and promotes to active status all the replica partitions associated with the server that’s down. The cluster map is updated on all the cluster nodes and the clients. This process of activating the replicas is known as failover. Failover can be configured to be automatic or manual. Additionally, you can trigger it by an external monitoring script via the REST API.

If the orchestrator node crashes, existing nodes will detect that it is no longer available and will elect a new orchestrator immediately so that the cluster continues to operate without disruption.

At a high level, there are three primary cluster manager components on each Couchbase Server node:

⋅ The heartbeat watchdog periodically communicates with the cluster orchestrator using the heartbeat protocol. It provides regular health updates of the server. If the orchestrator crashes, existing cluster server nodes will detect the failed orchestrator and elect a new orchestrator.

⋅ The process monitor monitors what the local data manager does, restarts failed processes as required, and contributes status information to the heartbeat process.

⋅ The configuration manager receives, processes, and monitors a node’s local configuration. It controls the cluster map and active replication streams. When the cluster starts, the configuration manager pulls configuration of other cluster nodes and updates its local copy.

Along with handling metric aggregation and consensus functions for the cluster, the cluster man-ager also provides a REST API for cluster management and a rich web admin UI. As of Couchbase Server 3.0, you can connect to the admin UI over either HTTP or HTTPS.

Figure 11: Couchbase Server admin UI

As shown in the Figure 11, the admin UI provides a centralized view of the cluster metrics across the cluster. It’s possible to drill down on the metrics to get an idea of how well a particular server is functioning or if there are any areas that need attention. The cluster manager is built on Erlang/OTP, a proven environment for developing and operating fault-tolerant distributed systems.

In addition to per-node operations that are always executing at each node in a Couchbase Server cluster, certain components are active on only one node at a time, including:

⋅ The rebalance orchestrator. When the number of servers in the cluster changes due to scaling out or failures, data partitions must be redistributed. This ensures that data is evenly distributed throughout the cluster, and that application access to the data is load balanced evenly across all the servers. This process is called rebalancing. It’s possible to trigger rebalancing using an explicit action from the admin UI or through a REST call. When a rebalance begins, the rebalance orchestrator calculates a new cluster map based on the current pending set of servers to be added and removed from the cluster. It streams the cluster map to all the servers in the cluster. During rebalance, the cluster moves data via DCP directly between two server nodes in the cluster. As the cluster moves each partition from one location to another, an atomic and consistent switchover takes place between the two nodes, and the cluster updates each connected client library with a current cluster map. Throughout migration and redistribution of partitions among servers, any given partition on a server will be in one of the following states:

⋅ Active: The server hosting the partition is servicing all requests for this partition.

⋅ Replica: The server hosting the partition cannot handle client requests, but it will receive replication commands. Rebalance marks destination partitions as replica until they are ready to be switched to active.

⋅ Dead: The server is not in any way responsible for this partition.

⋅ The node health monitor receives heartbeat updates from individual nodes in the cluster, updating configuration and raising alerts as required.

⋅ The partition state and replication manager is responsible for establishing and monitoring the current network of replication streams.

Query Engine With Couchbase Server, It’s possible to easily index and query JSON documents. As shown in Figure 12 below, view indexes are defined using design documents and views. Each design document can have multiple views, and each Couchbase Server bucket can have multiple design documents. Design documents allow the user to group views together. Views within a design document are processed in parallel and different design documents can be processed in parallel at different times.

Figure 12: Design documents and views in Couchbase Server

Couchbase Bucket

DESIGN DOCUMENT 2DESIGN DOCUMENT 1Indexers AreAllocated Per Design Doc

VIEWVIEW VIEW VIEWVIEW

All Updatesat Same Time

Can only Access Datain the Bucket Namespace

Views in Couchbase Server are defined in JavaScript using a map function, which pulls out data from the documents and an optional reduce function that aggregates the data emitted by the map function. The map function defines what attributes to build the index on.

Indexes in Couchbase Server are distributed; each server indexes the data it contains. As shown in Figure 13, indexes are created on active documents and optionally on replica documents.

Figure 13: Distributed indexing and querying in Couchbase Server

Couchbase Server 1

ACTIVE

REPLICA

Couchbase Server 2

ACTIVE

REPLICA

Couchbase Server 3

ACTIVE

REPLICA

Server Cluster


CLUSTER MAP


CLUSTER MAP

Applications can query Couchbase Server using any of the Couchbase SDKs, even during a rebal-ance. Queries are sent via port 8092 to a randomly selected server within the cluster. The server that receives the query sends this request to other servers in the cluster, and aggregates results from them while streaming the query result to the client.

During initial view building, Couchbase Server reads the partition data files on every server and applies the map function across every document. Before writing the index file, the reduce function is applied to the b-tree nodes; if the node is a non-root node, the output of the reduce function is a partial reduce value of its sub-tree. If the node is a root node, the reduce value is a full reduce value over the entire b-tree.

As shown in Figure 14, the map reduce function is applied to every document, and output key-val-ue pairs are stored in the index.

Figure 14: Index file b-tree structure in Couchbase Server

INDEX-KEYK1 ... KN

INDEX-KEYK651 ... K900

INDEX-KEYK901 ... KN

INDEX-KEYK 301 ... K650

INDEX-KEYK1 ... K300

ROOTNODE

INTERMEDIATENODES

LEAFNODES

[K1, DOC, V1],[K2, DOC, V2]

...[K300, DOC, V300]

[K1 - K300,REDUCE VALUE]

[K301, DOC, V301],[K302, DOC, V302]

...[K450, DOC, V450]


[K451, DOC, V451],[K452, DOC, V452]

...[K650, DOC, V650]


[K651, DOC, V651],[K652, DOC, V652]

...[K900, DOC, V900]


[K901, DOC, V901],[K902, DOC, V902]

...[KN, DOC, VN ]

[K901 - KN,REDUCE VALUE]

Like primary indexes, it is possible to emit one or more attributes of the document as the key and query for an exact match or a range. For example, as shown in Figure 15, a view index is defined using the name attribute.

Views in Couchbase Server are built asynchronously and hence eventually indexed. In Couchbase Server 3.0, the view engine makes use of the streaming DCP to consume changes made to doc-uments in memory and stream them to the indexing engine at RAM speeds without first writing them to disk. This provides faster view consistency and fresher data. In Couchbase Server 3.0, the view engine also has been improved by re-writing index update components in C. By default, queries on these views are eventually consistent with respect to the document updates. Indexes are created by Couchbase Server based on the view definition, but updating of these indexes can be controlled at the point of data querying, rather than each time data is inserted. The stale query parameter can be used to control when an index gets updated.

Indexing in Couchbase Server is incremental. If a view is accessed for the first time, the map and reduce functions are applied to all the documents within the bucket. Each new access to the view processes only the documents that have been added, updated, or deleted since the last time the view index was updated. If a node failure happens, the view engine can rollback failures based on DCP sequence numbers without incurring a full index rebuild. By default, every 5 seconds the view engine checks whether there have been 5000 document mutations, and if so triggers a view update operation.

For each changed document, Couchbase Server invokes the corresponding map function and stores the map output in the index. If a reduce function is defined, the reduce result is also calculated and stored. Couchbase Server maintains a back index that maps IDs of documents to a list of keys the view has produced. If a document is deleted or updated, the key emitted by the map output is removed from the view and the back index. Additionally, if a document is updated, new keys emitted by the map output are added to the back index and the view.

Because of the incremental nature of the view update process, information is only appended to the index stored on disk. This helps ensure that the index is updated efficiently. Compaction will optimize the index size on disk and the structure.

During a rebalance operation, partitions change state – the original partition transitions from active to dead, and the new one transitions from replica to active. Since the view index on each node is synchronized with the partitions on the same server, when the partition has migrated to a different server, the documents that belong to a migrated partition should not be used in the view result anymore. To keep track of the relevant partitions that can be used in the view index, Couchbase Server has a modified indexing data structure called the B-Superstar index.

Figure 15: Defining a view index with attribute name

VIEW CODE

Map

1 function (doc, meta) {2 if(doc.name)3 emit(doc.name, null);4 }

IndexDefinition

ValueKey

In the B-Superstar index, each non-leaf B-tree node has a bitmask with 1024 bits – each bit corresponding to a partition. As shown in Figure 16 at query time, when a query engine traverses a b-tree, if the partition ID of a key-value pair is not set in the bitmask, it is not factored into the query result. To maintain consistency for indexes during rebalance or failover, partitions can be excluded by a single bit change in the bitmask. The server updates the bitmask every time the cluster manager tells the query engine that partitions changed to active or dead states.

Rack-Zone AwarenessTo ensure enterprise-class availability and reliability, master data and replicated data should be stored on different server racks. Rack-zone awareness in Couchbase Server allows logical grouping of servers on a cluster where each server group physically belongs to a rack or availability zone. This intelligent data replication ensures that data is available despite disruptions such as power outages or switch or rack failure.

For example, as shown in Figure 17, there are 3 datacenters each with 2 physical racks. There is an equal distribution of servers within each server group and the configuration is balanced. Within the datacenter, even if a rack fails, rack-zone awareness in Couchbase Server ensures that data will still be available to the application. Across datacenters, cross datacenter replication (XDCR) provides high availability for the Couchbase Server cluster, and is discussed next.

Figure 16: A distributed range query touches only a subset of partitions

KEY: “f”SUM: 17

BITS: 0 1 1 1

KEY: “l”SUM: 13

BITS: 1 0 1 1

KEY: “c”SUM: 8

BITS: 0 1 O 1

KEY: “f”SUM: 9

BITS: 0 1 1 1

KEY: “i”SUM: 7

BITS: 1 0 1 1

KEY: “I”SUM: 6

BITS: 1 0 1 1

KEY: “a”

VALUE: 1

VBUCKET: 3

KEY: “b”

VALUE: 5

VBUCKET: 3

KEY: “c”

VALUE: 2

VBUCKET: 1

KEY: “d”

VALUE: 4

VBUCKET: 3

KEY: “e”

VALUE: 3

VBUCKET: 1

KEY: “f”

VALUE: 2

VBUCKET: 2

KEY: “g”

VALUE: 4

VBUCKET: 2

KEY: “h”

VALUE: 1

VBUCKET: 3

KEY: “i”

VALUE: 2

VBUCKET: 0

KEY: “j”

VALUE: 1

VBUCKET: 2

KEY: “k”

VALUE: 3

VBUCKET: 3

KEY: “l”

VALUE: 2

VBUCKET: 0

KEYS IN RANGE

NODES VISITED

Query:Start Key: “c” End Key: “h”

Valid bitmask: 1 1 1 1

Result: 18

NEEDED VALUE

Figure 17: Rack-zone awareness in Couchbase Server

Couchbase Server

DATA CENTER 1 DATA CENTER 2 DATA CENTER 3

RACK 1 RACK 1 RACK 1

RACK 2 RACK 2 RACK 2

Cross Datacenter ReplicationCross datacenter replication (XDCR) provides an easy way to replicate active data to multiple, geographically diverse datacenters either for disaster recovery or to bring data closer to users.

Figure 18 shows how XDCR and intra-cluster replication occurs simultaneously. Intra-cluster replication is taking place within the clusters at both Datacenter 1 and Datacenter 2, while at the same time XDCR is replicating documents across datacenters. On each node XDCR uses DCP to stream in-memory document mutations to the destination cluster. Replica documents will be sent to the Couchbase Server object managed cache for low latency read/write operations on the destination cluster.

XDCR can be set up on a per bucket basis. Depending on the application requirements, the developer might want to replicate only a subset of the data in Couchbase Server between two clusters. With XDCR, it is possible to selectively pick which buckets to replicate between two clusters in a unidirectional or bidirectional fashion. As shown in Figure 19, bidirectional XDCR is set up between bucket A (cluster 1) and bucket D (cluster 2). There is unidirectional XDCR between bucket B (cluster 1) and bucket C (cluster 2); and from bucket E (cluster 2) and bucket C (cluster 1).

Figure 18: Cross datacenter replication architecture in Couchbase Server

Figure 19: Replicating selective buckets between two Couchbase Server clusters

Couchbase Server 1

ACTIVE

REPLICA

Couchbase Server 2

ACTIVE

REPLICA

Couchbase Server 3

ACTIVE

REPLICA

Server Cluster

REPLICATION

Couchbase Server 1

ACTIVE

REPLICA

Couchbase Server 2

ACTIVE

REPLICA

Couchbase Server 3

ACTIVE

REPLICA

Server Cluster

REPLICATION

DATA CENTER 1 DATA CENTER 2

BIDEIRECTIONAL XDCR

READ-WRITE REQUESTS READ-WRITE REQUESTS

Couchbase Cluster 1

BUCKET A

BUCKET B

BUCKET C

Couchbase Cluster 2

BUCKET D

BUCKET C

BUCKET E

XDCR supports continuous replication of data via DCP. There are multiple DCP streams per XDCR connection. These streams are spread across the partitions and they move data in parallel from the source cluster to the destination cluster.

XDCR is cluster topology aware. The source and destination clusters can have different number of servers. If a server in the destination cluster goes down, XDCR is able to get the updated cluster topology information and continue replicating data to the available servers in the destination cluster.

XDCR is push-based. The source cluster uses DCP’s sequence numbers to keep track of what data the destination cluster last received. If the replication process is interrupted for example due to a server crash or intermittent network connection failure, it is not required to restart replication from the beginning. Instead, once the replication link is restored, replication can continue from the last sequence number seen by the destination cluster.

By default, XDCR in Couchbase Server is designed to optimize bandwidth. This includes optimizations like mutation de-duplication as well as checking the destination cluster to see if a mutation is required to be shipped across. If the destination cluster has already seen the latest version of the data mutation, the source cluster does not send it across. However, in some use cases with active-active XDCR, it may be preferable to skip checking whether the destination cluster has already seen the latest version of the data mutation, and optimistically replicate all mutations. Couchbase Server supports optimistic XDCR replication. For documents with a size within the XDCR size threshold (which can be configured), no pre-check is performed and the mutation is replicated to the destination cluster.

Secure XDCR provides on-the-wire encryption for XDCR. In the absence of a virtual private network (VPN), data sent between networks is typically sent over a wide area network (WAN). Data moving across a WAN is vulnerable unless it is encrypted. With Secure XDCR, introduced in Couchbase Server 2.5, data can be transmitted using SSL encryption between data centers.

Within a cluster, Couchbase Server provides strong consistency at the document level. On the other hand, XDCR provides eventual consistency across clusters. Built-in conflict resolution will pick the same “winner” on both clusters if the same document was mutated on both clusters before it was replicated across. If a conflict occurs, the document with the most updates will be considered the “winner.” If both the source and the destination clusters have the same number of updates for a document, additional metadata such as numerical sequence, CAS value, document flags and expiration TTL value are used to pick the “winner.” XDCR applies the same rule across clusters to make sure document consistency is maintained.

Figure 20 shows an example of how conflict detection and resolution works in Couchbase Server. Bidirectional replication is set up between Datacenter 1 and Datacenter 2 and both the clusters start off with the same JSON document (Doc 1). In addition, two additional updates to Doc 1 happen on Datacenter 2. In the case of a conflict, Doc 1 on Datacenter 2 is chosen as the winner because it has seen more updates.

Figure 20: Conflict detection strategy in Couchbase Server

1st update

2nd updateDoc 1 on DC1 Doc 1 on DC2

WINNER

Couchbase provides the world’s most complete, most scalable and best performing NoSQL database. Couchbase Server is

designed from a simple yet bold vision: build the first and best, general-purpose NoSQL database. That goal has resulted

in an industry leading solution that includes a shared nothing architecture, a single node-type, a built in caching layer, true

auto-sharding and the world’s first NoSQL mobile offering: Couchbase Mobile, a complete NoSQL mobile solution comprised

of Couchbase Server, Couchbase Sync Gateway and Couchbase Lite. Couchbase Server and all Couchbase Mobile products are

open source projects. Couchbase counts many of the worlds biggest brands as its customers, including Amadeus, Bally’s, Beats

Music, Cisco, Comcast, Concur, Disney, eBay / PayPal, Neiman Marcus, Orbitz, Rakuten / Viber, Sky, Tencent and Verizon, as

well as hundreds of other household names worldwide. Couchbase is headquartered in Silicon Valley, and has raised $115 million

in funding from Accel Partners, Adams Street Partners, Ignition Partners, Mayfield Fund, North Bridge Venture Partners and

WestSummit. www.couchbase.com

About Couchbase

2440 West El Camino Real | Ste 101

Mountain View, California 94040

1-650-417-7500

www.couchbase.com

ConclusionWhile this paper describes the architecture of Couchbase Server, the best way to get to know the technology is to download and use it. Couchbase Server is engineered to be a general purpose distributed database, providing an integrated cache, key-value store, and document database to support a broad and growing set of use cases. Couchbase Server supports partial and full working data sets in memory and large data sets on persistent storage. It supports operational data and machine data. Couchbase Server is deployed in data centers and on public cloud infrastructure. It’s an operational database for interactive and enterprise applications. Couchbase Mobile extends the operational database to mobile phones, tablets, and other connected devices. To download Couchbase Server, visit http://www.couchbase.com/downloads.

http://www.couchbase.com/nosql-databases/couchbase-mobile

http://www.couchbase.com/downloads