understanding views in couchbase server – couchbase live new york 2015

Understanding Views in Couchbase Server

Perry Krug | Solutions Engineering Manager, Couchbase

©2015 Couchbase Inc. 2

Agenda

Introduction - Ways to Query

Views in Couchbase

Database Design Considerations for Views

How Views Work

Configuration Settings and their Effects

Resource Requirements


Introduction


Introduction – Data Access with Couchbase Server

Couchbase provides multiple forms of data access and indexing

Key-value/document access:

– Pro: operations are extremely and predictably fast

• Data is cached in memcached-like layer

• Reads and writes are evenly distributed across all nodes through consistent hashing

– Con: Limited application flexibility

• More complexity required for maintaining lists and retrieving data by non-primary key

• No dynamic querying




Global Secondary Indexes (as of 4.0)

– Indexes are partitioned independently from data

– Pro: Much faster index scan

• All (or most) of index is in one place

• Built-in managed cache

– Con: More resources needed to keep up-to-date

• All mutations are flowing to one point for a given index

• Faster disk and CPU needed in general




Views in Couchbase are “local indexes”:

– Indexes are partitioned alongside data

– Pro: Incremental map-reduce keeps indexes updated incrementally

• Append-only B-Tree format

• Each node only processes it’s own changes, whole index is updated in parallel

– Con: Querying has limited performance and scalability due to “scatter-gather”

• Latencies of 50-100ms

• Throughput up to 4k queries/sec

• More than 20 nodes not ideal



Patch

Management

Many others..

90%Views Key/Value Access

Analytics Transactional Workloads

10%

75%N1QL/GSI

Understand the “performance-critical” paths of your application


Couchbase Server Views


Indexing and Querying via Views

Multiple roles

– Primary Index: All document IDs

– Secondary Index: Sub-document IDs and/or compound keys

– View: “Materialized” view on data (transformation)

– Aggregation/Reduction

– Multi-dimensional/Spatial

Very efficient

– All values pre-computed

– Reductions computed throughout tree

– Each node processes subset of dataset



Use Cases:

Best for “low-volume” querying, i.e. semi-offline or not directly user facing

Real-time Analytics:

– Incremental map-reduce keeps very large indexes updated efficiently

Spatial Views for Geospatial/Multidimensional indexing:

– True R-tree == More accurate results than other implementations

– Bounding Box queries (Polygon / n-way on the roadmap)

– Fully supported as of 4.0 (previously experimental)



Exact match query

Range query

With/without reduction

With/without grouping


Staleness/Consistency/Updating

Stale = OK

– Give me whatever you’ve got!

– Might not include most recent changes

– Fastest and predictable

Stale = false

– Make my query consistent!

– Include all latest changes

– Will be slower and variable depending on how many changes are waiting

Stale = “update after”

– Give me whatever you’ve got!

– But also kick off an update in the background

Couchbase updates view indexes in the background periodically


How Views Work


Views over the years…

Couchbase Server 1.8: No Views

Couchbase Server 2.0:

– Introduction of views

– View processing only happens after mutation is persisted to disk

– Stale=false didn’t account for changes in RAM


– View processing rewritten in C++

– View processing happens from RAM

– Drastic improvement in latency for stale=false and is now strongly consistent

– Geo-spatial views experimental


– Geo-Spatial views fully supported

– Introduction of GSI



Organized in Design Documents

– One or more DD per bucket

– One or more views per DD

Incremental Map-Reduce

– Spread load across nodes

– Each node indexes it’s data

Map Reduce

Process, filter, map and emit a row

Aggregate mappeddataBuilt in: _count, _sum, _stats


View Processing after Write

17

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

VIEWENGINE

TO OTHER NODE

DOC 1

DOC 1DOC 1DOC 1

Single-node type means easier administration and scaling

Writes are async by default

Index changes processed from RAM, stored to disk

Each node only processes an index for its own data


Indexing and Querying

18

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

SHARD7

SHARD

SHARD6

SHARD

SHARD8

SHARD9

SHARD

Application has single logical connection to cluster (client object)

Indexing work is distributed amongst nodes in parallel

Each node has index for data stored on it

Optional replica indexes

Client library round-robins queries

Any node receives query, “scatters” to other nodes and “gathers” response

Application see single logical database


Design Considerations


Best Practices - Selection, Projection, Aggregation

Try avoid computing too many things in a View

Check for attribute existence

Pre-Filter data to avoid unnecessary entries in the View

– Use document types to make Views more selective

Project (map) only necessary data by emitting it as part of the value

– Do not emit the full document

– Back-reference via the original document id (SDKs have built-in functionality)

Use the built-in reduce functions if possible

Leverage default “staleness” (update_after) instead of stale=false


Best Practices - Selection, Projection, Aggregation


Number of Design Documents per Bucket

Indexers are allocated per Design Document

Bad cases

– One Design Document contains all Views

All Views are updated the same time

A lot to do for the Indexer

– One View per Design Document

Resource intensive because one Indexer per View

Good balance!


Separated Buckets for Indexing / Querying

Creating a View for one large Bucket may be heavy weighted

– Extra processing of data that doesn’t need to be indexed

– Slower overall indexing of data that does

Separate data to be indexed / queried

– Short-lived data

– High write rate (if doesn’t need to be indexed)

– Binary data separate from metadata (JSON or other)

Don’t create too many Buckets!


XDCR – Separated Cluster for Indexing

Separate the load

– Reporting cluster vs. operational one

– Different hardware possible

– Active-Passive XDCR


Configuration Settings

and their Effects


Indexing Settings

Index Path

– Separated disks for data and indexes

– Improve I/O performance


Indexing Settings

Indexing Interval

– Controls how up-to-date the index is by default

– ‘stale = false’ as explained before


Indexing Settings

Max. number of in parallel working indexers

– Increase the number of threads per node

– Higher level of concurrency

– Higher disk and CPU load


Rebalance Settings

Index-aware rebalance

– Indexing by default as part of rebalancing

– Ensures that you get query results from a new node during rebalance that are consistent with the query results you would have received from the node before rebalance started

– Performance impact if enabled, so rebalance takes significantly more time


Rebalance Settings

Rebalance before compaction

– Default is 16, so 16 vBuckets are moved before rebalance is paused for compaction

– Higher value may increase rebalance performance

– Implicitly increases rebalance priority


Rebalance Settings

Rebalance moves per node

– Default is 1

– Number of vBuckets moved at a time during the rebalance operation


Compaction Settings

(Auto) Compaction

– Append only storage engine

– In-place updates are expensive

– Removes tombstone objects and fragmentation

Process Data and View compaction in parallel

– Implies a heavier processing and disk I/O load during compaction process


Compaction Settings



More CPU cores are recommended

Configure your OS File System Buffer!

Use SSD-s for Views!

CPU Disk (size, I/O)

Number of Views per Design Document

Number of the emitted items

Compaction

Complexity of Map/Reduce functions

Size of the emitted value

0 100 200

ms

0 5000

q / s

Thank you

understanding views in couchbase server – couchbase live new york 2015

Software