couchconf israel 2013_couchbase server in production
TRANSCRIPT
1
Couchbase Server 2.0 in
ProductionPerry Krug
Sr. Solutions Architect
2
Typical Couchbase production environment
Application users
Load Balancer
Application Servers
Servers
3
We’ll focus on App-Couchbase interaction …
Application users
Load Balancer
Application Servers
Servers
4
… at each step of the application lifecycle
Dev/Test Size Deploy Monitor Manage
5
KEY CONCEPTS
6
Couchbase Single Node Architecture
Replication, Rebalance, Shard State Manager
REST management API/Web UI
8091Admin Console
Erla
ng
/OTP
11210 / 11211Data access ports
Object-managedCache
Storage Engine
8092Query API
Qu
ery
En
gin
e
http
Data Manager Cluster Manager
11
33 2
Couchbase Single Node
2
Managed Cache
Dis
k Q
ueu
e
Disk
Replication Queue
App Server
Couchbase Server Node
Doc 1Doc 1
Doc 1
To other node
XDCR Queue
Doc 1
To other cluster
View engine
Doc 1
12
Web Application
Couchbase deployment
Data Flow
Cluster Management
Web Application
CouchbaseClient Library
Web Application … …
Couchbase Server Couchbase Server Couchbase Server Couchbase Server
Replication Flow
13
COUCHBASE SERVER CLUSTER
Couchbase in a Cluster
User Configured Replica Count = 1
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 1
REPLICA
Doc 3
Doc 1
Doc 7
Doc
Doc
Doc
APP SERVER 1
COUCHBASE Client Library
CLUSTER MAP
COUCHBASE Client Library
CLUSTER MAP
APP SERVER 2
Doc 9
ACTIVE
Doc 3
Doc 1
Doc
Doc
Doc
SERVER 2
REPLICA
Doc 6
Doc 4
Doc 9
Doc
Doc
Doc
Doc 8
ACTIVE
Doc 4
Doc 6
Doc
Doc
Doc
SERVER 3
REPLICA
Doc 2
Doc 5
Doc 8
Doc
Doc
Doc
Doc 7
Query
READ/WRITE/UPDATE
15
NODE AND CLUSTER SIZING
Dev-Test Size Deploy Monitor Manage
16
Size Couchbase Server
Sizing == performance• Serve reads out of RAM• Enough IO for writes and disk operations• Mitigate inevitable failures
Reading Data Writing Data
Server
Give medocument A
Here is document A
Application Server
A
Server
Please storedocument A
OK, I storeddocument A
Application Server
A
17
ServerServer Server
Scaling out permits matching of aggregate flow rates so queues do not grow
Application ServerApplication Server Application Server
network networknetwork
18
How many nodes?
5 Key Factors determine number of nodes needed:
1) RAM2) Disk3) CPU4) Network5) Data Distribution/Safety
Couchbase Servers
Web application server
Application user
19
RAM sizing
1) Total RAM:• Managed document cache:
• Working set• Metadata• Active+Replicas
• Index caching (I/O buffer)
Keep working set in RAM for best read performance
Server
Give medocument A
Here is document A
Application Server
A
A
A
Reading Data
20
Working set ratio depends on your application
Server Server Server
Late stage social gameMany users no longer
active; few logged in at any given time.
Ad NetworkAny cookie can show up
at any time.
Business applicationUsers logged in during
the day. Day moves around the globe.
working/total set = 1working/total set = .01 working/total set = .33
Reading Data
21
RAM sizing – Working set managed cache
As memory grows, some cached data will be removed from RAM to make space:
• Active and replica data share RAM• Threshold based (NRU, favoring active data)• Only cleanly persisted data can be “ejected”• Only data values can be “ejected” which means
RAM can fill up with metadata
22
RAM Sizing - View/Index cache (disk I/O)
• File system cache availability for the index has a big impact performance:
• Test runs based on 10 million items with 16GB bucket quota and 4GB, 8GB system RAM availability for indexes
• Performance results show that by doubling system cache availability– query latency reduces by half
– throughput increases by 50%
• Leave RAM free with quotas
23
Disk sizing: Space and I/O
2) Disk• Sustained write rate• Rebalance capacity• Backups• XDCR • Compaction• Total dataset:
(active + replicas + indexes)
• Append-only
I/O
Space
Please storedocument A
OK, I storeddocument A
Application Server
A
Server
A
A
Writing Data
24
Disk sizing: I/O
Impacting disk I/O needed:• Peak write load• Sustained write load• Compaction• XDCR• Views/indexing
Configurable paths/partitions for data and indexes allows for separation of space and I/O
25
Disk sizing: Space
Impacting amount of disk space needed:• Total data set • Indexes• Overhead for compaction (~3x): Both data
and indexes are “append-only”
Configurable paths/partitions for data and indexes allows for separation of space and I/O
26
Disk sizing: Impact of Views on IO and Space
• Number of Design Documents• Extra space for each DD• Extra IO to process for each DD• Segregate views by DD
• Complexity of Views (IO)
• Amount of view output (space)• Emit as little as possible• Doc ID automatically included
• Use Development views and extrapolate
27
• Append-only file format puts all new/updated/deleted items at the end of the on-disk file.
– Better performance and reliability
– No more fragmentation!
• This can lead to invalidated data in the “back” of the file.
• Need to compact data
Disk sizing: Append only
28
Initial file layout:
Update some data:
After compaction:
Disk compaction
Doc A Doc B Doc C
Doc C Doc B’ Doc A’’
Doc A Doc B Doc A’ Doc B’ Doc A’’Doc A Doc B Doc C Doc A’ Doc D
Doc D
29
• Compaction happens automatically:
– Settings for “threshold” of stale data
– Settings for time of day
– Split by data and index files
– Per-bucket or global
• Reduces size of on-disk files – data files AND index files
• Temporarily increased disk I/O
and CPU, but no downtime!
Disk compaction
30
CPU sizing
3) CPU• Disk writing• Views/compaction/XDCR• RAM r/w performance not impacted
1.8 used VERY little CPU. Under the same workloads, 2.0 should not be much different.
New 2.0 features will require more CPU
31
Network sizing
4) Network• Client traffic• Replication (writes)• Rebalancing• XDCR
Reads+Writes
Replication (multiply writes) and Rebalancing
32
Consistent low latency with varying doc sizes
Consistently low latencies in microseconds for
varying documents sizes with a mixed workload
33
Data Distribution
5) Data Distribution / Safety (assuming one replica):• 1 node = BAD• 2 nodes = …better…• 3+ nodes = BEST!
Note: Many applications will need more than 3 nodes
Servers fail, be prepared. The more nodes, the less impact a failure will have.
34
How many nodes? (recap)
New to 2.0 feature will affect sizing requirements:• Views/Indexing/Querying• XDCR• Append-only file format
5 Key Factors still determine number of nodes needed:1) RAM2) Disk3) CPU4) Network5) Data Distribution
Couchbase Servers
Web application server
Application user
35
MONITORING
Dev-Test Size Deploy Monitor Manage
36Server
Key resources: RAM, Disk, Network, CPU
RAM
DISK
NETWORK
Server
RAM
DISK
Server
RAM
DISK
Application Server Application Server Application Server
37
Monitoring
Once in production, heart of operations is monitoring
• RAM Usage• Disk space and I/O:
• write queues / read activity / indexing• Network bandwidth, replication queues• CPU Usage• Data distribution (balance, replicas)
38
Monitoring
IMMENSE amount of information available
• Real-time traffic graphs
• REST API accessible
• Per bucket, per node and aggregate statistics
• Application and inter-node traffic
• RAM <-> Disk
• Inter-system timing
39
Key Stats to Monitor
• Working set doesn’t fit in RAM
–Cache miss rate / disk fetches
• Disk I/O not keeping up
–Disk Write queue size
• Internal replication lag
– TAP queues
• Indexing not keeping up
• XDCR lag
40
41
MANAGEMENT AND MAINTENANCE
Dev-Test Size Deploy Monitor Manage
42
Management/Maintenance
• Scaling
• Upgrading/Scheduled maintenance
• Backup/Restore
• Dealing with Failures
43
Scaling
Couchbase Scales out Linearly:
Need more RAM? Add nodes…
Need more Disk IO or space? Add nodes…
Couchbase also makes it easy to scale up by swapping larger nodes for smaller ones without any disruption
44
Couchbase + Cisco + Solarflare
Number of servers in cluster
Op
era
tio
ns p
er
seco
nd
High throughput with 1.4 GB/sec data transfer rate using 4 servers
Linear throughput scalability
45
Additional benchmark details
• Cluster of 8 nodes running Couchbase Server 1.8.0 • One server used as the client to run the workload• Workload used for the test was Couchbase’s streaming load generator • GET and SET operations were performed in the 70:30 ratio
Test System and Parameters • Couchbase Server 1.8.0 • Cisco Nexus 5548UP Switch • Solarflare SFN5122F 10 Gigabit Ethernet Enhanced Small Form-Factor
Pluggable (SFP+) server adapters • Solarflare OpenOnload• Servers: Nine Cisco UCS C200 M2 High-Density Rack Servers with Intel
Xeon processor X5670 six-core 2.93-GHz CPU, running Red Hat Enterprise Linux (RHEL) 5.5 x86 64-bit, with 100-GB RAM and four 2-TB hard drives
46
1. Add nodes of new version, rebalance…
2. Remove nodes of old version, rebalance…
3. Done!
No disruption
General use for software upgrade, hardware refresh, planned maintenance
Upgrade existing Couchbase Server 1.8 to
Couchbase Server 2.0!
Upgrade
47
Easy to Maintain Couchbase
• Use remove+rebalance on “malfunctioning” node:
– Protects data distribution and “safety”
– Replicas recreated
– Best to “swap” with new node to maintain capacity and move minimal amount of data
48
Backup
Data Files
cbbackup
ServerServer Server
network networknetwork
49
Restore
2) “cbrestore” used to restore data into live/different cluster
Data Files
cbrestore
50
Failures Happen!
Hardware
NetworkBugs
51
Easy to Manage failures with Couchbase
• Failover (automatic or manual):
– Replica data and indexes promoted for immediate access
– Replicas not recreated
– Do NOT failover healthy node
– Perform rebalance after returning cluster to full or greater capacity
52
Fail Over
Doc 7
Doc 9
Doc 3
Active Docs
Replica Docs
Doc 6
COUCHBASE CLIENT LIBRARY
CLUSTER MAP
APP SERVER 1
COUCHBASE CLIENT LIBRARY
CLUSTER MAP
APP SERVER 2
Doc 4
Doc 2
Doc 5
SERVER 1
Doc 6
Doc 4
SERVER 2
Doc 7
Doc 1
SERVER 3
Doc 3
Doc 9
Doc 7 Doc 8
Doc 6
Doc 3
DOC
DOC
DOCDOC
DOC
DOC
DOC DOC
DOC
DOC
DOC DOC
DOC
DOC
DOC
Doc 9
Doc 5DOC
DOC
DOC
Doc 1
Doc 8
Doc 2
Replica Docs Replica Docs Replica Docs
Active Docs Active Docs Active Docs
SERVER 4 SERVER 5
Active Docs Active Docs
Replica Docs Replica Docs
COUCHBASE SERVER CLUSTER
53Dev/Test Size Deploy Monitor Manage
Conclusion
54
Want more?
Lots of details and best practices in our documentation:
http://www.couchbase.com/docs/