cassandra at teads
TRANSCRIPT
Cassandra @
Lyon Cassandra UsersRomain Hardouin - Cassandra architect @ Teads2017-02-16
I.
II.
III.
IV.
V.
VI.
VII.
VIII.
Cassandra @ TeadsAbout Teads
Architecture
Provisioning
Monitoring & alerting
Tuning
Tools
C’est la vie
A light fork
I . About Teads
Teads is the inventor of native video advertisingWith inRead, an award-winning format*
*IPA Media owner survey 2014, IAB recognized format
27 officesin 21 countries
500+Global employees
1.2B usersGlobal reach
90+R&D employees
Teads growthTracking events
Advertisers (to name a few)
Publishers (to name a few)
II. Architecture
Custom C* 2.1.16 C* 3.0 jvm.options
C* 2.2 logback
Backports
Patches
Apache Cassandra versionApache Cassandra version
UsageUsage
Up to 940K qps: Writes vs Reads
TopologyTopology
2 regions: EU & US 3rd region APAC coming soon
4 clusters7 DC110 nodes
Up to 150 with temporary DCs
HP server blades1 cluster18 nodes
AWS nodes
i2.2xlarge8 vCPU 61GB2 x 800 GB attached SSD in RAID0
c3.4xlarge16 vCPU 30GB 2 x 160 GB attached SSD in RAID0
c4.4xlarge16 vCPU 30GB EBS 3.4 TB + 1 TB
AWS instance typesAWS instance types
Tons of counters
Big Data, wide rows
Many billions keys, LCS with TTL
20 x c4.4xlarge with SSD GP2 3.4 TB data 10,000 IOPS ⇒ 16KB 1 TB commitlog 3,000 IOPS ⇒ 16KB
25 tables: batch + real timeTemporary DC
Cheap storage, great for STCSSnapshots (S3 backup)No coupling between disks and CPU/RAM
High latency => high I/O waitThroughput: 160 MB/sUnsteady performances
More on EBS nodesMore on EBS nodes
Physical nodes
HP Apollo XL170R Gen912 CPU Xeon @ 2.60GHz128 GB RAM3 x 1,5 TB High-end SSD in RAID0
Hardware nodesHardware nodes
For Big Data, supersedes EBS DC
DC/Cluster split
Instance type changeInstance type change
20 x i2.2xlarge 20 x c3.4xlarge
Counters
Cheaper and more CPUs
Counters Rebuild
DC X DC Y
Workload isolationWorkload isolation
20 x i2.2xlarge 20 x c3.4xlarge
Counters + Big Data
Counters
20 x c4.4xlarge
Big Data
EBS
Step 1: DC split
DC A DC B DC C
Rebuild +
Workload isolationWorkload isolation
20 x c4.4xlarge
Big Data
EBS
Step 2: Cluster split
Big DataAWS Direct Connect
Data model
“KISS” principle
No fancy stuff No secondary index No list/set/tuple No UDT
III. Provisioning
Capistrano Chef→
Custom Cookbooks: C* C* tools C* reaper Datadog wrapper
Chef provisioning to spawn a cluster
NowNow
C* cookbook michaelklishin/cassandra-chef-cookbook + Teads custom wrapper
Terraform + Chef provisioner
FutureFuture
IV. Monitoring & alerting
PastPast
OpsCenter (Free)
Turnkey dashboardsSupport is reactive
Main metrics onlyPer host graphs
impossible with many hosts
Ring viewMore than monitoringLots of metrics
Still lacks some metricsDashboard creation: no templatesAgent is heavyFree version limitations:
Data stored in production cluster Apache C* <= 2.1 only
DataStax OpsCenter
Free version (v5)
All metrics you wantDashboard creation
● Templating TimeBoard vs ScreenBoard
Graph creation Aggregation, trend, rate, anomaly detection
No turnkey dashboards yet May change: TLP templates
Additional fees if >350 metrics We need to increase this limit for our use case
Now we can easily Find outliers Compare a node to average Compare two DCs Explore a node’s metrics Create overview dashboards Create advanced dashboards for
troubleshooting
Datadog’s cassandra.yamlDatadog’s cassandra.yaml
- include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count
- include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile
- include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize
- include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued
ScreenBoardScreenBoard
TimeBoardTimeBoard
alpha
beta
gamma
delta
epsilon
zeta
eta
alpha
ExampleExample
Hints monitoring during maintenance on physical nodes
Storage
Streaming
Datadog Alerting
Down nodeExceptionsCommitlog sizeHigh latencyHigh GCHigh IO WaitHigh PendingsMany hintsLong thrift connectionsClock out of syncDisk space…
Don’t miss this one
Don’t forget /
V. Tuning
Java 8CMS G1→
cassandra-env.sh-Dcassandra.max_queued_native_transport_requests=4096-Dcassandra.fd_initial_value_ms=4000-Dcassandra.fd_max_interval_ms=4000
GC logs enabled
-XX:MaxGCPauseMillis=200-XX:G1RSetUpdatingPauseTimePercent=5-XX:G1HeapRegionSize=32m-XX:G1HeapWastePercent=25
-XX:InitiatingHeapOccupancyPercent=?-XX:ParallelGCThreads=#CPU-XX:ConcGCThreads=#CPU
-XX:+ExplicitGCInvokesConcurrent-XX:+ParallelRefProcEnabled-XX:+UseCompressedOops
jvm.optionsjvm.options
-XX:HeapDumpPath= <dir with enough free space>-XX:ErrorFile= <custom dir>-Djava.io.tmpdir= <custom dir>
-XX:-UseBiasedLocking-XX:+UseTLAB-XX:+ResizeTLAB-XX:+PerfDisableSharedMem-XX:+AlwaysPreTouch...
Backport from C* 3.0
num_tokens: 256 native_transport_max_threads: 256 or 128compaction_throughput_mb_per_sec: 64concurrent_compactors: 4 or 2concurrent_reads: 64concurrent_writes: 128 or 64concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6 or 4memtable_cleanup_threshold: 0.6, 0.5 or 0.4memtable_flush_writers: 4 or 2trickle_fsync: truetrickle_fsync_interval_in_kb: 10240dynamic_snitch_badness_threshold: 2.0internode_compression: dc
AWS nodesAWS nodes
Heap c3.4xlarge: 15 GBi2.2xlarge: 24 GB
EBS volume != disk
compaction_throughput_mb_per_sec: 32concurrent_compactors: 4concurrent_reads: 32concurrent_writes: 64concurrent_counter_writes: 64trickle_fsync_interval_in_kb: 1024
AWS nodesAWS nodes
Heap c4.4xlarge: 15 GB
num_tokens: 8 initial_token: ...native_transport_max_threads: 512compaction_throughput_mb_per_sec: 128concurrent_compactors: 4concurrent_reads: 64concurrent_writes: 128concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6memtable_cleanup_threshold: 0.4memtable_flush_writers: 8trickle_fsync: truetrickle_fsync_interval_in_kb: 10240
Hardware nodesHardware nodes
More on this later Heap: 24 GB
Why 8 tokens?
Better repair performance, important for Big DataEvenly distributed tokens, stored in a Chef data bag
Hardware nodesHardware nodes
./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4{ "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503"}
https://github.com/rhardouin/cassandra-scripts
Watch out! Know the drawbacks
Small entries, lots of reads
compression = {'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}+ nodetool scrub (few GB)
CompressionCompression
Disabled on 2 small clustersdynamic_snitch: false
Less hop count
Dynamic SnitchDynamic Snitch
Client side latency
Dynamic SnitchDynamic Snitch
P95
P75
Mean
Which node to decommission?
DownscaleDownscale
Clients
Scala appsDataStax driver wrapper
Spark & Spark streamingDataStax Spark Cassandra Connector
DataStax driver policy
LatencyAwarePolicy TokenAwarePolicy→
LatencyAwarePolicy Hotspots due to premature nodes eviction
Needs thorough tuning and steady workload
We drop it
TokenAwarePolicy Shuffle replicas depending on CL
For cross-region scheduled jobs
VPN between AWS regions
20 executors with 6GB RAM
output.consistency.level = (LOCAL_)ONE
output.concurrent.writes = 50
connection.compression = LZ4
Useless writes99% of empty unlogged batches on one DC
What an optimization!
VI. Tools
{Parallel SSH + cron} on steroids Security History
who/what/when/whyOutput is kept
CQL migrationRolling restartNodetool or JMX commandsBackup and snapshot jobs
“Job Scheduler & Runbook Automation”
We added a “comment” field
Scheduled range repairSegments: up to 20,000 for TB tables
Hosted fork for C* 2.1We will probably switch to TLP’s fork
We do not use incremental repairsSee fix in C* 4.0
cassandra_snapshotter Backup on S3 Scheduled with Rundeck
We created and use a fork Some PR merged upstream Restore PR still to be merged
Logs management"C* " and "out of sync""C* " and "new session: will sync" | count...
Alerts on pattern"C* " and "[ERROR]""C* " and "[WARN]" and not ( … )...
VII. C’est la vie
Cassandra issues & failures
OS reboot… seems harmless, right?Cassandra service enabled
Want a clue?C* 2.0 + counters
Upgrade to C* 2.1 was a relief
Without any obvious reason
Upgrade 2.0 2.1→
LCS cluster suffered High load Pending compactions was growing
Switch to off heap memtable Less GC => less load
Reduce clients load Better after sstables upgrade
Took days
Upgrade 2.0 2.1→
Lots of NTR “All time blocked”
NTR queue undersized for our workload 128 (hard coded)
We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096
NTR pool needs to be sized accordingly
After replacing nodes
DELETE FROM system.peers WHERE peer = '<replaced node>';
Used by DataStax Driver for auto discovery
AWSissues & failures
When you have to shoot, shoot, don’t talk!
The Good, the Bad and the Ugly
We’ll shoot your instance
We shot your instance
EBS volume latency spike
EBS volume unreachable
SSDSSD
SSDSSD
dstat -tnvlr
Hardwareissues & failures
One SSD failed
CPUs suddendly became slow on one server
Smart Array Battery BIOS bug
Yup, not a SSD...
VIII. A light fork
Why a fork?
1. Need to add a patch ASAP High Blocked NTR CASSANDRA-11363 Require to deploy from source
2. Why not backport interesting tickets?
3. Why not add small features/fixes? Expose tasks queue length via JMX CASSANDRA-12758
You betcha!
A tiny fork
We keep it as small as possible to fit our needs
Even smaller when we will upgrade to C* 3.0 Backports will be obsolete
« Hey, if your fork is tiny, is it that useful? »
One example: repair
Backport of CASSANDRA-12580
Paulo Motta: log2
vs ln
Get these info without DEBUG burden
Original fix
One-liner fix
The most impressive result for a set of tables: Before: 23 days With CASSANDRA-12580: 16 hours
Longest repair for a table: 2,5 days Impossible to repair this table before the patch Fit in gc_grace_seconds
It was a critical fix for us
It should have landed in 2.1.17 IMHO* Repair is a mandatory operation in many use cases
Paulo already made the patch for 2.1
C* 2.1 is widely used
[*] Full post: http://www.mail-archive.com/[email protected]/msg49344.html
« Why bother with backports? Why not upgrade to 3.0? »
Because C* is critical for our business
We don’t need fancy stuff (SASI, MV, UDF, ...)We just want a rock solid scalable DB
C* 2.1 does the job for the time being
We p l a n t o u p g ra d e t o C * 3 . 0 i n 2 0 1 7We w i l l d o t h o r o u g h t e s t s ; - )
« What about C* 2.2? »
C* 2.2 has some nice improvements: Boostrapping with LCS: send source sstable level 1 Range movement causes CPU & performance impact 2 Resumable bootstrap/rebuild streaming 3
[1] CASSANDRA-7460[2] CASSANDRA-9258[3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810
But migration path 2.2 3.0 is risky→ Just my opinion based on users mailing list DSE never used 2.2
Questions?
Thanks!
C* devs for their awesome workZenika for hosting this meetupYou for being here!