mesoscon- running cassandra on apache mesos across ... · move from custom deployment system to...
TRANSCRIPT
![Page 1: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/1.jpg)
Karthik Gandhi ([email protected])
Running Cassandra on Apache Mesos across multiple datacenters at Uber
Percona Live, April 2017
![Page 2: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/2.jpg)
“Transportation as reliable as running water, everywhere. for everyone”
Uber’s mission
2
![Page 3: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/3.jpg)
“Transportation as reliable as running water, everywhere. for everyone”
3
99.99% available
Uber’s mission
![Page 4: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/4.jpg)
“Transportation as reliable as running water, everywhere. for everyone”
4
cheap and efficient
Uber’s mission
![Page 5: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/5.jpg)
Cluster Management @ Uber
5
● Statically partitioned machines across different services
● Move from custom deployment system to everything
running on Mesos
● Gain efficiency by increasing machine utilization
○ Co-locate services on the same machine
○ Can lead to 30% fewer machines1
● Build stateful service frameworks to run on Mesos“Large-scale cluster management at Google with Borg”, EuroSys 2015
![Page 6: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/6.jpg)
Cassandra advantages
6
● Horizontal scalability
○ Scales reads and writes linearly as new nodes are added
● High availability
○ Fault tolerant with tunable consistency levels
● Low latency, solid performance
● Operational simplicity
○ Homogeneous cluster, no SPOF
● Rich data model
○ Columns, composite keys, counters, secondary indexes
![Page 7: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/7.jpg)
Uber● Abhishek Verma
● Karthik Gandhi
● Matthias Eichstaedt
● Zhitao Li
● Zhiyan Shao
DCOS Cassandra ServiceMesosphere-Uber collaboration
7
Mesosphere● Chris Lambert
● Keith Chambers
● Kenneth Owens
● Mohit Soni
https://github.com/mesosphere/dcos-cassandra-service
![Page 8: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/8.jpg)
Cassandra service architecture
8
Frameworkdcos-cassandra-service
Mesos agent
Mesos master(Leader)
Web interfaceControl plane API
C*Cluster 1 C*Cluster 2
Aurora (DC1)
Mesos master(Standby)
C*Node1a
C*Node2a
Mesos agent
C*Node1b
C*Node2b
Mesos agent
C*Node1c
Aurora (DC2)
Deployment system
DC2ZK ZK
ZK
ZooKeeperquorum
Client App uses CQL interface
CQL CQL CQL CQL CQL . . .
![Page 9: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/9.jpg)
Cassandra Service: Features
9
● Custom seed provider
● Increasing cluster size
● Replacing a dead node
● Backup/Restore
● Cleanup
● Repair
● Multi-datacenter support
![Page 10: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/10.jpg)
Mesos primitives
10
● Persistent volumes
○ Data stored outside of the sandbox directory
○ Offered to the same task if it crashes and restarts
● Dynamic reservations
![Page 11: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/11.jpg)
Plan, Phases and Blocks
11
● Plan
○ Phases
■ Reconciliation
■ Deployment
■ Backup
■ Restore
■ Cleanup
■ Repair
![Page 12: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/12.jpg)
Spinning up a new Cassandra cluster
12https://www.youtube.com/watch?v=gbYmjtDKSzs
![Page 13: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/13.jpg)
Using the Cassandra cluster
13https://www.youtube.com/watch?v=qgqO39DteHo
![Page 14: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/14.jpg)
Automate Cassandra operations
14
● Repair
○ Synchronize all data across replicas
■ Last write wins
○ Anti-entropy mechanism
○ Repair primary key range node-by-node
● Cleanup
○ Remove data whose ownership has changed
■ Because of addition or removal of nodes
○ Cleanup node-by-node
![Page 15: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/15.jpg)
Cleanup operation
15https://www.youtube.com/watch?v=VxRLSl8MpYI
![Page 16: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/16.jpg)
Failure scenarios
16
● Executor failure
○ Restarted automatically
● Cassandra daemon failure
○ Restarted automatically
● Node failure
○ Manual REST endpoint to replace node
● Scheduling framework failure
○ Existing nodes keep running, new nodes cannot be added
● Mesos master failure: new leader election
![Page 17: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/17.jpg)
Experiments
17
![Page 18: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/18.jpg)
Cluster startup
18
For each node in the cluster:
1.Receive and accept offer
2.Launch task
3.Fetch executor, JRE, Cassandra binaries from S3/HDFS
4.Launch executor
5.Launch Cassandra daemon
6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL
![Page 19: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/19.jpg)
Cluster startup
19
For each node in the cluster:
1.Receive and accept offer
2.Launch task
3.Fetch executor, JRE, Cassandra binaries from S3/HDFS
4.Launch executor
5.Launch Cassandra daemon
6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL
Aurora hogging offers
![Page 20: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/20.jpg)
Aurora hogs offers
20
● Aurora designed to be the only framework running on Mesos and
controlling all the machines
● Holds on to all received offers
○ Does not accept or reject them
● Mesos waits for --offer_timeout time duration and rescinds offer
● --offer_timeout config
○ Duration of time before an offer is rescinded from a framework. This helps fairness when
running frameworks that hold on to offers, or frameworks that accidentally drop offers. If not
set, offers do not timeout.
Set to 5mins in our setup, reduced to 10secs
![Page 21: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/21.jpg)
Cluster startup time
21
Framework can start ~ one new node per minute
![Page 22: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/22.jpg)
Long term solution: dynamic reservations
22
● Dynamically reserve all the machines resources to the “cassandra”
role
● Resources are offered only to cassandra frameworks
● Improves node startup time: 30s/node
● Node failure replacement or updates are much faster
![Page 23: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/23.jpg)
Tuning JVM Garbage collection
23
Changed from CMS to G1 garbage collector
Left: https://github.com/apache/cassandra/blob/cassandra-2.2/conf/cassandra-env.sh#L213Right: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_tune_jvm_c.html?scroll=concept_ds_sv5_k4w_dk__tuning-java-garbage-collection
![Page 24: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/24.jpg)
Tuning JVM Garbage collection
24
Metric CMS G1 G1 : CMS Factorop rate 1951 13765 7.06latency mean (ms) 3.6 0.4 9.00latency median (ms) 0.3 0.3 1.00latency 95th percentile (ms) 0.6 0.4 1.50latency 99th percentile (ms) 1 0.5 2.00latency 99.9th percentile (ms) 11.6 0.7 16.57latency max (ms) 13496.9 4626.9 2.92
G1 garbage collector is much better without any tuning
Using cassandra-stress, 32 threads client
![Page 25: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/25.jpg)
Cluster Setup
25
● 3 nodes
● Local DC
● 24 cores, 128 GB RAM, 2TB SAS drives
● Cassandra running on bare metal
● Cassandra running in a Mesos container
![Page 26: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/26.jpg)
Bare metal Mesos
Read LatencyBare metal vs Mesos managed cluster
26
Mean: 0.38 msP95: 0.74 msP99: 0.91 ms
Mean: 0.44 msP95: 0.76 msP99: 0.98 ms
![Page 27: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/27.jpg)
Bare metal Mesos
Read ThroughputBare metal vs Mesos managed cluster
27
![Page 28: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/28.jpg)
Bare metal Mesos
Write LatencyBare metal vs Mesos managed cluster
28
Mean: 0.43 msP95: 0.94 msP99: 1.05 ms
Mean: 0.48 msP95: 0.93 msP99: 1.26 ms
![Page 29: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/29.jpg)
Bare metal Mesos
Write ThroughputBare metal vs Mesos managed cluster
29
![Page 30: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/30.jpg)
Running across datacenters
30
● Four datacenters
○ Each running dcos-cassandra-service instance
○ Sync datacenter phase
■ Periodically exchange seeds with external dcs
● Cassandra nodes gossip topology
○ Discover nodes in other datacenters
![Page 31: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/31.jpg)
Asynchronous cross-dc replication latency
31
● Write a row to dc1 using consistency level LOCAL_ONE
○ Write timestamp to a file when operation completed
● Spin in a loop to read the same row using consistency LOCAL_ONE in dc2
○ Write timestamp to a file when operation completed
● Difference between the two gives asynchronous replication latency
○ p50 : 44.69ms, p95 : 46.38ms, p99:47.44ms
● Round trip ping latency
○ 77.8ms
![Page 32: MesosCon- Running Cassandra on Apache Mesos across ... · Move from custom deployment system to everything running on Mesos Gain efficiency by increasing machine utilization Co-locate](https://reader035.vdocuments.us/reader035/viewer/2022070710/5ec55e76419eb03a82219609/html5/thumbnails/32.jpg)
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or
by any information storage or retrieval systems, without permission in writing from Uber. This document is intended
only for the use of the individual or entity to whom it is addressed and contains information that is privileged,
confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person
other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
First Last Name
32