using time window compaction strategy for time series workloads

Jeff Jirsa

Using TimeWindowCompactionStrategy for Time Series Workloads

© 2016. All Rights Reserved.

1 Who Am I?

2 LSM DBs

3 TWCS

4 The 1%

5 Things Nobody Else Told You About Compaction

6 Q&A 2

Who Am I?(Or: Why You Should Believe Me)

© 2016. All Rights Reserved. 6

We’ve Spent Some Time With Time Series

• We keep some data from sensors for a fixed time period

• Processes• DNS queries• Executables created

• It’s a LOT of data• 2015 Talk: One million writes per

second with 60 nodes• Multiple Petabytes Per Cluster



• TWCS was written to solve problems CrowdStrike faced in production

• It wasn’t meant to be clever, it was meant to be efficient and easy to reason about

• I’m on the pager rotation, this directly impacts my quality of life



• I have better things to do on my off time

Log Structured – Database, Not CabinsIf You’re Going To Use Cassandra, Let’s Make Sure We Know How It Works


Log Structured Merge Trees• Cassandra write path:

1. First the Commitlog

2. Then the Memtable

3. Eventually flushed to a SSTable

• Each SSTable is written exactly once• Over time, Cassandra combines those data files

Duplicate cells are merged

Obsolete data is purged

• On reads, Cassandra searches for data in each SSTable, merging any existing records and returning the result

12


Real World, Real Problems• If you can’t get compaction happy, your cluster will never be happy

• The write path relies on efficient flushing• If your compaction strategy falls behind, you can block flushes (CASSANDRA-9882)

• The read path relies on efficient merging• If your compaction strategy falls behind, each read may touch hundreds or thousands of sstables

• IO bound clusters are common, even with SSDs• Dynamic Snitch - latency + “severity”

13


What We Hope For• We accept that we need to compact sstables sometimes, but we want to do it when we have a

good reason• Good reasons:

• Data has been deleted and we want to reclaim space

• Data has been overwritten and we want to avoid merges on reads

• Our queries span multiple sstables, and we’re having to touch a lot of sstables on each read

• Bad Reasons:• We hit some magic size threshold and we want to join two non-overlapping files together

• We’re aiming for a situation where the merge on read is tolerable• Bloom filter is your friend – let’s read from as few sstables as possible

• We want as few tombstones as possible (this includes expired data)• Tombstones create garbage, garbage creates sadness

14

Use The Defaults?It’s Not Just Naïve, It’s Also Expensive


The Basics: SizeTieredCompactionStrategy• Each time min_threshold (4) files of the same size appear, combine them into a new file

• Over time, you’ll naturally end up with a distribution of old data in large files, new data in small files

• Deleted data in large files stays on disk longer than desired because those files are very rarely compacted

16


SizeTieredCompactionStrategy

17


SizeTieredCompactionStrategy

18

If each of the smallest blocks represent 1 day of data, and each write had a 90 day TTL, when do you actually delete files and reclaim disk

space?


SizeTieredCompactionStrategy• Expensive IO:

• Far more writes than necessary, you’ll recompact old data weeks after it was written

• Reads may touch a ton of sstables – we have no control over how data will be arranged on disk

• Expensive Operationally:• Expired data doesn’t get dropped until you happen to re-compact the table it’s in

• You have to keep up to 50% spare disk

19

TWCSBecause Everything Else Made Me Sad


Kübler Ross Stages of Grief• Denial

• Anger

• Bargaining

• Depression

• Acceptance

21


Sad Operator: Stages of Grief• Denial

• STCS and LCS aren’t gonna work, but DTCS will fix it

• Anger• DTCS seemed to be the fix, and it didn’t work, either

• Bargaining• What if we tweak all these sub-properties? What if we just fix things one at a time?

• Depression• Still SOL at ~hundred node scale

• Can we get through this? Is it time for a therapist’s couch?

22


Sad Operator: Stages of Grief• Acceptance

• Compaction is pluggable, we’ll write it ourselves

• Designed to be simple and efficient

• Group sstables into logical buckets

• STCS in the newest time window

• No more confusing options, just Window Size + Window Unit

• Base time seconds? Max age days? Overloading min_threshold for grouping? Not today.

• “12 Hours”, “3 Days”, “6 Minutes”

• Configure buckets so you have 20-30 buckets on disk

24


That’s It.• 90 day TTL

• Unit = Days, # = 3

• Each file on disk spans 3 days of data (except the first window), expect ~30 + first window

• Expect to have at least 3 days of extra data on disk*

• 2 hour TTL• Unit = Minutes, # = 10

• Each file on disk represents 10 minutes of data, expect 12-13 + first window

• Expect to have at least 10 minutes of extra data on disk*

25


Example: IO (Real Cluster)


Example: Load (Real Cluster)


The Only Real Optimization You Need

• Align your partition keys to your TWCS windows• Bloom filter reads will only touch a single sstable

• Deletion gets much easier because you get rid of overlapping ranges

• Bucketing partitions keeps partition sizes reasonable ( < 100MB ), which saves you a ton of GC pressure

• If you’re using 30 day TTL and 1 day TWCS windows, put a “day_of_year” field into the partition key• Use parallel async reads to read more than one day at a time

• Spread reads across multiple nodes

• Each node should touch exactly 1 sstable on disk (watch timezones)

• That sstable is probably hot for all partitions, so it’ll be in page cache

• Extrapolate for other windows (you may have to chunk things up into 3 day buckets or 30 minute buckets, but it’ll be worth it)

29

What We’ve Discussed Is Good Enough For 99% Of Time Series Use Cases

But let’s make sure the 1% knows what’s up


Out Of Order Writes• If we mix write timestamps “USING TIMESTAMP”…

• Life isn’t over, it just potentially blocks expiration

• Goal:• Avoid mixing timestamps within any given sstable

• Options:

• Don’t mix in the memtable

• Don’t use the memtable

31


Out Of Order Writes• Don’t comingle in the memtable

• If we have a queue-like workflow, consider the following option:• Pause kafka consumer / celery worker / etc

• “nodetool flush”

• Write old data with “USING TIMESTAMP”

• “nodetool flush

• Resume consumer/workers for new data

• Positives: No comingled data

• Negatives: Have to pause ingestion

32


Out Of Order Writes• Don’t use the memtable

• CQLSSTableWriter• Yuki has a great blog at: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

• Write sstables offline

• Stream them in with sstableloader

• Positives: No comingled data, no pausing ingestion, incredibly fast, easy to parallelize

• Negatives: Requires code (but it’s not difficult code, your ops team should be able to do it)

33


Per-Window Major Compaction• At the end of each window, you’re going to see a major compaction for all sstables in that window

34


Per-Window Major Compaction• At the end of each window, you’re going to see a major compaction for all sstables in that window

• Expect a jump in CPU usage, disk usage, and disk IO

• The DURATION of these increases depends on your write rate and window size

• Larger windows will take longer to compact because you’ll have more data on disk

• If this is a problem for you, you’re under provisioned

35

Per-Window Major Compaction


CPU UsageDuring the end-of-window major, cpu usage on ALL OF THE NODES (in all DCs) will increase at the same time.

This will likely impact your read latency.

When you validate TWCS, be sure to make sure your application works well at this transition.

We can surely fix this, just need to find a way to avoid cluttering the options.



Disk UsageDuring the daily major, disk usage on ALL OF THE NODES will increase at the same time.



Disk UsageIn some cases, you’ll see the window major compaction run twice because of the timing of flush. You can manually flush (cron) to work around it if it bothers you.

This is on my list of things to fix

No reason to do two majors, better to either delay the first major until we’re sure it’s time, or keep a history that we’ve already done a window major compaction, and skip it the second time

There Are Things Nobody Told You About Compaction

The More You Know…


Things Nobody Told You About Compaction• Compaction Impacts Read Performance More Than Write Performance

• Typical advice is use LCS if you need fast reads, STCS if you need fast writes

• LCS optimizes reads by limiting the # of potential SSTables you’ll need to touch on the read path

• The goal of LCS (fast reads/low latency) and the act of keeping levels are in competition with each other

• It takes a LOT of IO for LCS to keep up, and it’s generally not a great fit for most time series use cases

• LCS will negatively impact your read latencies in any sufficiently busy cluster

40


Things Nobody Told You About Compaction• You can change the compaction strategy on a single node using JMX

• The change won’t persist through restarts, but it’s often a great way to test / canary before rolling it out to the full cluster

• You can change other useful things in JMX, too. No need to restart to change:• Compaction threads

• Compaction throughput

• If you see an IO impact of changing compaction strategies, you can slow-roll it out to the cluster using JMX.

41


Things Nobody Told You About Compaction• Compaction Task Prioritization

42


Things Nobody Told You About Compaction• Compaction Task Prioritization

• Just kidding, stuff’s going to run in an order you don’t like.

• There’s nothing you can do about it (yet)

• If you run Cassandra long enough, you’ll eventually OOM or run a box out of disk doing cleanup or bootstrap or validation compactions or similar

• We run watchdog daemons that watch for low disk/RAM conditions and interrupts cleanups/compactions

• Not provided, but it’s a 5 line shell script

• 2.0 -> 2.1 was a huge change• Cleanup / Scrub used to be single threaded

• Someone thought it was a good idea to make it parallel (CASSANDRA-5547)

• Now cleanup/scrub blocks normal sstable compactions

• If you run parallel operations, be prepared to interrupt and restart them if you run out of disk, RAM, or if your sstable count gets too high (CASSANDRA-11179). Consider using –seq or userDefinedCleanup (JMX)

• CASSANDRA-11218 (priority queue)

43


Things Nobody Told You About Compaction• ”Fully Expired”

• Cassandra is super conservative

• Find global minTimestamp of any overlapping sstable, compacting sstable, and memtables

• This is the oldest “live” data

• Build a list of “candidates” that we think are fully expired

• See if the candidates are completely older than that global minTimestamp

• Operators are not as conservative

• CASSANDRA-7019 / Philip Thompson’s talk from yesterday

• When you’re running out of disk space, Cassandra’s definition may seem silly => • Any out of order write can “block” a lot of data from being deleted

• Read repair, hints, whatever

• It used to be so hard to figure out, cassandra now has `sstableexpiredblockers`

44


Things Nobody Told You About Compaction• Tombstone compaction sub-properties• Show of hands if you’ve ever set these on a real cluster

45


Things Nobody Told You About Compaction• Tombstone compaction sub-properties

• Cassandra has logic to try to eliminate mostly-expired sstables

• Three basic knobs:

1. What % of the table must be tombstones for it to be worth compacting?

2. How long has it been since that file has been created?

3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful?

46


Things Nobody Told You About Compaction• Tombstone compaction sub-properties

• Cassandra has logic to try to eliminate mostly-expired sstables

• Three basic knobs:

1. What % of the table must be tombstones for it to be worth compacting?

• tombstone_threshold (0.2 -> 0.8)

2. How long has it been since that file has been created?

• tombstone_compaction_interval (how much IO do you have?

3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful?

• unchecked_tombstone_compaction (false -> true)

47


Q&A

48

SpoilersTWCS is available in mainline Cassandra in 3.0.8 and newer.

If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on github.com/jeffjirsa/twcs

You PROBABLY don’t need to do anything special to change from DTCS -> TWCS


Thanks!

49

CrowdStrike Is HiringTalk to me about TWCS on Twitter: @jjirsa

Find me on IRC: jeffj on Freenode (#cassandra)

If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on

https://github.com/jeffjirsa/twcs

using time window compaction strategy for time series workloads

Software