cassandra at teads

Cassandra @

Lyon Cassandra UsersRomain Hardouin - Cassandra architect @ Teads2017-02-16

I.

II.

III.

IV.

V.

VI.

VII.

VIII.

Cassandra @ TeadsAbout Teads

Architecture

Provisioning

Monitoring & alerting

Tuning

Tools

C’est la vie

A light fork

I . About Teads

Teads is the inventor of native video advertisingWith inRead, an award-winning format*

*IPA Media owner survey 2014, IAB recognized format

27 officesin 21 countries

500+Global employees

1.2B usersGlobal reach

90+R&D employees

Teads growthTracking events

Advertisers (to name a few)

Publishers (to name a few)

II. Architecture

Custom C* 2.1.16 C* 3.0 jvm.options

C* 2.2 logback

Backports

Patches

Apache Cassandra versionApache Cassandra version

UsageUsage

Up to 940K qps: Writes vs Reads

TopologyTopology

2 regions: EU & US 3rd region APAC coming soon

4 clusters7 DC110 nodes

Up to 150 with temporary DCs

HP server blades1 cluster18 nodes

AWS nodes

i2.2xlarge8 vCPU 61GB2 x 800 GB attached SSD in RAID0

c3.4xlarge16 vCPU 30GB 2 x 160 GB attached SSD in RAID0

c4.4xlarge16 vCPU 30GB EBS 3.4 TB + 1 TB

AWS instance typesAWS instance types

Tons of counters

Big Data, wide rows

Many billions keys, LCS with TTL

20 x c4.4xlarge with SSD GP2 3.4 TB data 10,000 IOPS ⇒ 16KB 1 TB commitlog 3,000 IOPS ⇒ 16KB

25 tables: batch + real timeTemporary DC

Cheap storage, great for STCSSnapshots (S3 backup)No coupling between disks and CPU/RAM

High latency => high I/O waitThroughput: 160 MB/sUnsteady performances

More on EBS nodesMore on EBS nodes

Physical nodes

HP Apollo XL170R Gen912 CPU Xeon @ 2.60GHz128 GB RAM3 x 1,5 TB High-end SSD in RAID0

Hardware nodesHardware nodes

For Big Data, supersedes EBS DC

DC/Cluster split

Instance type changeInstance type change

20 x i2.2xlarge 20 x c3.4xlarge

Counters

Cheaper and more CPUs

Counters Rebuild

DC X DC Y

Workload isolationWorkload isolation

20 x i2.2xlarge 20 x c3.4xlarge

Counters + Big Data

Counters

20 x c4.4xlarge

Big Data

EBS

Step 1: DC split

DC A DC B DC C

Rebuild +

Workload isolationWorkload isolation

20 x c4.4xlarge

Big Data

EBS

Step 2: Cluster split

Big DataAWS Direct Connect

Data model

“KISS” principle

No fancy stuff No secondary index No list/set/tuple No UDT

III. Provisioning

Capistrano Chef→

Custom Cookbooks: C* C* tools C* reaper Datadog wrapper

Chef provisioning to spawn a cluster

NowNow

C* cookbook michaelklishin/cassandra-chef-cookbook + Teads custom wrapper

Terraform + Chef provisioner

FutureFuture

IV. Monitoring & alerting

PastPast

OpsCenter (Free)

Turnkey dashboardsSupport is reactive

Main metrics onlyPer host graphs

impossible with many hosts

Ring viewMore than monitoringLots of metrics

Still lacks some metricsDashboard creation: no templatesAgent is heavyFree version limitations:

Data stored in production cluster Apache C* <= 2.1 only

DataStax OpsCenter

Free version (v5)

All metrics you wantDashboard creation

● Templating TimeBoard vs ScreenBoard

Graph creation Aggregation, trend, rate, anomaly detection

No turnkey dashboards yet May change: TLP templates

Additional fees if >350 metrics We need to increase this limit for our use case

Now we can easily Find outliers Compare a node to average Compare two DCs Explore a node’s metrics Create overview dashboards Create advanced dashboards for

troubleshooting

Datadog’s cassandra.yamlDatadog’s cassandra.yaml

- include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count

- include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile

- include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize

- include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued

ScreenBoardScreenBoard

TimeBoardTimeBoard

alpha

beta

gamma

delta

epsilon

zeta

eta

alpha

ExampleExample

Hints monitoring during maintenance on physical nodes

Storage

Streaming

Datadog Alerting

Down nodeExceptionsCommitlog sizeHigh latencyHigh GCHigh IO WaitHigh PendingsMany hintsLong thrift connectionsClock out of syncDisk space…

Don’t miss this one

Don’t forget /

V. Tuning

Java 8CMS G1→

cassandra-env.sh-Dcassandra.max_queued_native_transport_requests=4096-Dcassandra.fd_initial_value_ms=4000-Dcassandra.fd_max_interval_ms=4000

GC logs enabled

-XX:MaxGCPauseMillis=200-XX:G1RSetUpdatingPauseTimePercent=5-XX:G1HeapRegionSize=32m-XX:G1HeapWastePercent=25

-XX:InitiatingHeapOccupancyPercent=?-XX:ParallelGCThreads=#CPU-XX:ConcGCThreads=#CPU

-XX:+ExplicitGCInvokesConcurrent-XX:+ParallelRefProcEnabled-XX:+UseCompressedOops

jvm.optionsjvm.options

-XX:HeapDumpPath= <dir with enough free space>-XX:ErrorFile= <custom dir>-Djava.io.tmpdir= <custom dir>

-XX:-UseBiasedLocking-XX:+UseTLAB-XX:+ResizeTLAB-XX:+PerfDisableSharedMem-XX:+AlwaysPreTouch...

Backport from C* 3.0

num_tokens: 256 native_transport_max_threads: 256 or 128compaction_throughput_mb_per_sec: 64concurrent_compactors: 4 or 2concurrent_reads: 64concurrent_writes: 128 or 64concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6 or 4memtable_cleanup_threshold: 0.6, 0.5 or 0.4memtable_flush_writers: 4 or 2trickle_fsync: truetrickle_fsync_interval_in_kb: 10240dynamic_snitch_badness_threshold: 2.0internode_compression: dc

AWS nodesAWS nodes

Heap c3.4xlarge: 15 GBi2.2xlarge: 24 GB

EBS volume != disk

compaction_throughput_mb_per_sec: 32concurrent_compactors: 4concurrent_reads: 32concurrent_writes: 64concurrent_counter_writes: 64trickle_fsync_interval_in_kb: 1024

AWS nodesAWS nodes

Heap c4.4xlarge: 15 GB

num_tokens: 8 initial_token: ...native_transport_max_threads: 512compaction_throughput_mb_per_sec: 128concurrent_compactors: 4concurrent_reads: 64concurrent_writes: 128concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6memtable_cleanup_threshold: 0.4memtable_flush_writers: 8trickle_fsync: truetrickle_fsync_interval_in_kb: 10240


More on this later Heap: 24 GB

Why 8 tokens?

Better repair performance, important for Big DataEvenly distributed tokens, stored in a Chef data bag


./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4{ "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503"}

https://github.com/rhardouin/cassandra-scripts

Watch out! Know the drawbacks

Small entries, lots of reads

compression = {'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}+ nodetool scrub (few GB)

CompressionCompression

Disabled on 2 small clustersdynamic_snitch: false

Less hop count

Dynamic SnitchDynamic Snitch

Client side latency

Dynamic SnitchDynamic Snitch

P95

P75

Mean

Which node to decommission?

DownscaleDownscale

Clients

Scala appsDataStax driver wrapper

Spark & Spark streamingDataStax Spark Cassandra Connector

DataStax driver policy

LatencyAwarePolicy TokenAwarePolicy→

LatencyAwarePolicy Hotspots due to premature nodes eviction

Needs thorough tuning and steady workload

We drop it

TokenAwarePolicy Shuffle replicas depending on CL

For cross-region scheduled jobs

VPN between AWS regions

20 executors with 6GB RAM

output.consistency.level = (LOCAL_)ONE

output.concurrent.writes = 50

connection.compression = LZ4

Useless writes99% of empty unlogged batches on one DC

What an optimization!

VI. Tools

{Parallel SSH + cron} on steroids Security History

who/what/when/whyOutput is kept

CQL migrationRolling restartNodetool or JMX commandsBackup and snapshot jobs

“Job Scheduler & Runbook Automation”

We added a “comment” field

Scheduled range repairSegments: up to 20,000 for TB tables

Hosted fork for C* 2.1We will probably switch to TLP’s fork

We do not use incremental repairsSee fix in C* 4.0

cassandra_snapshotter Backup on S3 Scheduled with Rundeck

We created and use a fork Some PR merged upstream Restore PR still to be merged

Logs management"C* " and "out of sync""C* " and "new session: will sync" | count...

Alerts on pattern"C* " and "[ERROR]""C* " and "[WARN]" and not ( … )...

VII. C’est la vie

Cassandra issues & failures

OS reboot… seems harmless, right?Cassandra service enabled

Want a clue?C* 2.0 + counters

Upgrade to C* 2.1 was a relief

Without any obvious reason

Upgrade 2.0 2.1→

LCS cluster suffered High load Pending compactions was growing

Switch to off heap memtable Less GC => less load

Reduce clients load Better after sstables upgrade

Took days

Upgrade 2.0 2.1→

Lots of NTR “All time blocked”

NTR queue undersized for our workload 128 (hard coded)

We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096

NTR pool needs to be sized accordingly

After replacing nodes

DELETE FROM system.peers WHERE peer = '<replaced node>';

Used by DataStax Driver for auto discovery

AWSissues & failures

When you have to shoot, shoot, don’t talk!

The Good, the Bad and the Ugly

We’ll shoot your instance

We shot your instance

EBS volume latency spike

EBS volume unreachable

SSDSSD

SSDSSD

dstat -tnvlr

Hardwareissues & failures

One SSD failed

CPUs suddendly became slow on one server

Smart Array Battery BIOS bug

Yup, not a SSD...

VIII. A light fork

Why a fork?

1. Need to add a patch ASAP High Blocked NTR CASSANDRA-11363 Require to deploy from source

2. Why not backport interesting tickets?

3. Why not add small features/fixes? Expose tasks queue length via JMX CASSANDRA-12758

You betcha!

A tiny fork

We keep it as small as possible to fit our needs

Even smaller when we will upgrade to C* 3.0 Backports will be obsolete

« Hey, if your fork is tiny, is it that useful? »

One example: repair

Backport of CASSANDRA-12580

Paulo Motta: log2

vs ln

Get these info without DEBUG burden

Original fix

One-liner fix

The most impressive result for a set of tables: Before: 23 days With CASSANDRA-12580: 16 hours

Longest repair for a table: 2,5 days Impossible to repair this table before the patch Fit in gc_grace_seconds

It was a critical fix for us

It should have landed in 2.1.17 IMHO* Repair is a mandatory operation in many use cases

Paulo already made the patch for 2.1

C* 2.1 is widely used

[*] Full post: http://www.mail-archive.com/[email protected]/msg49344.html

« Why bother with backports? Why not upgrade to 3.0? »

Because C* is critical for our business

We don’t need fancy stuff (SASI, MV, UDF, ...)We just want a rock solid scalable DB

C* 2.1 does the job for the time being

We p l a n t o u p g ra d e t o C * 3 . 0 i n 2 0 1 7We w i l l d o t h o r o u g h t e s t s ; - )

« What about C* 2.2? »

C* 2.2 has some nice improvements: Boostrapping with LCS: send source sstable level 1 Range movement causes CPU & performance impact 2 Resumable bootstrap/rebuild streaming 3

[1] CASSANDRA-7460[2] CASSANDRA-9258[3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810

But migration path 2.2 3.0 is risky→ Just my opinion based on users mailing list DSE never used 2.2

Questions?

Thanks!

C* devs for their awesome workZenika for hosting this meetupYou for being here!

cassandra at teads

Software