scylla summit 2017: cry in the dojo, laugh in the battlefield: how we constantly try to bring scylla...

29
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Cry in the dojo, laugh in the battlefield: how we constantly try to bring Scylla to its knees so you don't have to. QA Manager, Scylla Roy Dahan

Upload: scylladb

Post on 22-Jan-2018

204 views

Category:

Technology


0 download

TRANSCRIPT

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Cry in the dojo, laugh in the battlefield: how we constantly

try to bring Scylla to its knees so you don't have to.

QA Manager, Scylla

Roy Dahan

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Roy Dahan

2

Roy has over of 10 years of experience testing

large-scale distributed systems, with a focus on

storage/data systems, and managing small to large

teams responsible for all testing aspects using a

highly automated approach.

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Our Goal

▪ Achieving Highest Levels of System Stability & Availability

▪ Maintaining Data Integrity

▪ Prevent Performance Degradations Over Time

▪ Increase Users Confidence

All of the above, even when BAD THINGS happen on

“Production-like Environments”

3

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

How We Test Scylla

4

ScyllaTesting

Unit✓ scylla-unittest

Functional✓ dtest

Compatibility✓ dtest✓ Driver Tests

Integration✓ Janus-Graph

Tests✓ Titan-test ✓ Spark

Scale / Performance

✓ S-C-T

Stress / Load✓ S-C-T✓ Cassandra

Stress

System / Longevity

✓ S-C-T✓ Jepsen

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Distributed Tests (dtest)

▪ Functional “Black Box” Tests

▪ Verifies our Compatibility with Cassandra

▪ Enhanced & Extended to Catch Scylla Regressions

▪ Around 10% (208) of the Reported Issues on the Scylla Project

reference a dtest - (Detected/Reproduced by dtest)

▪ About 675 Tests Runs Regularly as part of “Regression Suite”

5

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Scylla-Cluster-Tests (SCT)

▪ Automation Library and Test Collection for Scylla & Cassandra

Clusters

▪ Supports Multiple Backends such as: AWS / GCE / OpenStack /

Libvirt

▪ Tests are Based on Chaos Engineering Principles:o Build a Hypothesis around Steady State Behavior

o Vary Real-world Events

o Automate Experiments to Run Continuously

▪ Around 4% (105) of the Reported Issues on the Scylla Project

Reference SCT test - (Detected/Reproduced by SCT test)

6

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

7

Test Setup (Our Defaults):

▪ Cluster of N Scylla DB nodes (N=6)

▪ Set of X Loaders Nodes (x=2)

▪ Scylla Monitoring Server

client

Cluster of nodes

client

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

8

Test Setup - Example on GCE:

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

9

The Test flow:

▪ Client Side Loaders Run Workloads(Set of Cassandra-Stress loads run on the loaders (Write,

Mixed, Counters, User Profiles)

▪ During X hours / days / weeks

▪ A “Nemesis” Out of the Predefined List is

Randomly Selected

o Some Nemesis Disrupts Nodes in the

Cluster.

o Someone Runs Standard Cluster

Operations

Current Nemesis types:StopStartServiceStopWaitStartServiceDrainerDecommissionCorruptThenRepairCorruptThenRebuildNoCorruptRepairRefreshMajorCompactionModifyTablePropertiesEnospc

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

10

Test Fixture Example:test_duration: 5760stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5",

"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000",

"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]n_db_nodes: 6n_loaders: 2n_monitor_nodes: 1nemesis_class_name: 'ChaosMonkey'nemesis_interval: 5failure_post_behavior: keepspace_node_threshold: 644245094ip_ssh_connections: 'private'experimental: 'true'

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

11

Test Fixture Example:test_duration: 5760stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5",

"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000",

"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]n_db_nodes: 6n_loaders: 2n_monitor_nodes: 1nemesis_class_name: 'ChaosMonkey'nemesis_interval: 5failure_post_behavior: keepspace_node_threshold: 644245094ip_ssh_connections: 'private'experimental: 'true'

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

12

Nemesis Code Examples:def disrupt_destroy_data_then_repair(self): self._set_current_disruption('CorruptThenRepair %s' % self.target_node) # Delete set of sstables from data directory self._destroy_data() # Try to save the node self.repair_nodetool_repair()

def disrupt_stop_wait_start_scylla_server(self, sleep_time=300): self._set_current_disruption('StopWaitStartService %s' % self.target_node) self.target_node.remoter.run('sudo systemctl stop scylla-server.service') self.target_node.wait_db_down() self.log.info("Sleep for %s seconds", sleep_time) time.sleep(sleep_time) self.target_node.remoter.run('sudo systemctl start scylla-server.service') self.target_node.wait_db_up()

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

13

Test Verification & Analysis:

▪ Application Load (cassandra-stress) Doesn’t Stop

▪ Auto Detection of:

• Coredumps

• Errors

• Exceptions

• Operations failures (repair, add node, refresh, compaction, etc.)

▪ Auto Detection of Performance Degradations (unexpected lower throughput

/ higher latencies due to operations)

▪ Compare Nemesis Execution Durations Across Builds to Detect Possible

Regressions

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

14

Longevity monitoring example:

“Total Requests Served” (op/s) correlated with Nemesis executions.

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

15

Longevity monitoring example:

“Requests Rate Served” (op/s per instance) correlated with Nemesis executions.

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

16

Longevity monitoring example:

“CPU utilization” (% per instance) correlated with Nemesis executions.

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

SCT Longevity Testing

17

Test Summary Output - Nemesis Execution:

50GB DataSet Test: (Nemesis every 5 minutes, 4 days)

--------------------------------------------| Nemesis Type |Count | Avg Time(s) | -------------------------------------------| CorruptThenRebuild | 103 | 93.79 || Decommission | 111 | 231.89 || Drainer | 109 | 48.27 || CorruptThenRepair | 113 | 285.71 || Refresh | 95 | 7.72 || NoCorruptRepair | 97 | 331.73 || StopStartService | 133 | 26.92 || MajorCompaction | 134 | 20.63 || ModifyTable | 197 | 1.50 || Enospc | 114 | 26.33 || StopWaitStartService| 98 | 66.30 |--------------------------------------------

1TB DataSet Test: (Nemesis every 30 minutes, 6 days)

--------------------------------------------| Nemesis Type |Count | Avg Time(s) | -------------------------------------------| CorruptThenRebuild | 2 | 732.50 || Decommission | 7 | 2913.86 || Drainer | 6 | 213.00 || CorruptThenRepair | 5 | 4942.60 || Refresh | 6 | 10.50 || NoCorruptRepair | 3 | 2835.33 || StopStartService | 2 | 195.00 || MajorCompaction | 3 | 663.33 || ModifyTable | 6 | 4.67 || Enospc | 6 | 221.00 || StopWaitStartService| 6 | 492.17 |--------------------------------------------

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

18

SCT Longevity Testing Nemesis Execution Analysis:

Auto-analysis and reports based on test

statistics stored automatically in ElasticSearch

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Example of Issue detected by Longevity

19

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Example of Nemesis Added due to Issue

20

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Example of Nemesis Added due to Issue

21

def disrupt_modify_table_comment(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24)) cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True)

def disrupt_modify_table_gc_grace_time(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) gc_grace_seconds = random.choice(xrange(216000, 864000)) cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND" \ " gc_grace_seconds = {};".format(gc_grace_seconds) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True)

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Multi DC Longevity - The plot thickens

22

Test Setup (Our Defaults):

▪ Cluster of N Scylla DB nodes (N=15)

▪ Across M “Data Centers” (M=3)

▪ Set of X Loaders nodes. (X=3)

▪ Scylla Monitoring Server.

▪ Set of Cassandra-Stress commands

running on the loaders (Write,

Mixed, Counters, User Profiles).

The tc utility is being used to impose random network delays,

packet drops and reorder packets between Data Centers.

DC1client

DC2client

DC3client

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Performance Regression

23

▪ Set of Predefined Workloads & Setups○ Write

○ Read

○ Mixed

○ Customers Workloads

▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch

▪ Master Daily Regression Suite - Automatically Compare Results

with a Previous Build & “Best” Build

▪ Release Regression Suite - Automatically Compare Results with

Previous Releases (including RCs)

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Performance Regression

24

Test-Write - Total Op rate (op/s) by Release:

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Performance Regression

25

Test-Write - 99th Percentile Latency (ms) by Release:

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Large Scale Tests

26

▪ 100’s of Nodes Clusters

▪ 10’s TB DataSets

▪ Multi-Core Scylla nodes

▪ Many sstables

Sample of 101 nodes Scylla cluster running on AWS.

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

On QA Roadmap Longevity:

▪ Embed CharybdeFS (fault injection FS) in Longevity

▪ Extend workload types

▪ Two+ Nemesis in Parallel

▪ Adding more “Sudden Death” Types of Nemesis

▪ Enable “sstables integrity checker”

Load & Scale

▪ XXL Clusters Sizes (1000+ nodes)

▪ Enhance Load Testing to More Server Dimensions (network, Disk)

27

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

On QA Roadmap Performance:

▪ Add more “Real World Workloads” to Daily Regressions

▪ Performance Impact Per Operation (e.g. repair, majorCompaction)

▪ Collecting Latency Histograms for Various Load Types

3rd Party Integration:

▪ Spark & Titan Integration Suites

▪ Java & Golang Driver Integration Suites

Tools & Infrastructure:

▪ Enhance auto analysis based on Statistics in ElasticSearch

▪ Running SCT using an Existing Env28

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

THANK YOU

[email protected]

Please stay in touch

Any questions?