c* summit 2013: time is money jake luciani and carl yeksigian
DESCRIPTION
This session will focus on our approach to building a scalable TimeSeries database for financial data using Cassandra 1.2 and CQL3. We will discuss how we deal with a heavy mix of reads and writes as well as how we monitor and track performance of the system.TRANSCRIPT
Time is Money
Financial Time Series Jake Luciani and Carl Yeksigian
BlueMountain Capital
About this talk Part 1: Our use case and architecture Part 2: Our deployment and tuning Part 3: Q&A
Know your problem. 1000s of consumers ..creating and reading data as fast as possible ..consistent to all readers ..and handle ad-hoc user queries ..quickly ..across data centers.
Know your data.
AAPL price
MSFT price
Know your queries.
Time Series Query
Start, End, Periodicity defines query
1 minute periods
Know your queries.
Cross Section Query
As Of time defines the query
As Of Time (11am)
Know your queries. Cross sections are random Storing for all possible Cross Sections is not possible. We also support bi-temporality
Let's optimize for Time Series.
CREATE TABLE tsdata ( id blob, property string, asof_ticks bigint, knowledge_ticks bigint, value blob, PRIMARY KEY(id,property,asof_ticks,knowledge_ticks)
) WITH COMPACT STORAGE AND CLUSTERING ORDER BY(asof_ticks DESC, knowledge_ticks DESC)
Data Model (CQL 3)
SELECT * FROM tsdata WHERE id = 0x12345 AND property = 'lastPrice' AND asof_ticks >= 1234567890 AND asof_ticks <= 2345678901
CQL3 Queries: Time Series
CQL3 Queries: Cross Section SELECT * FROM tsdata WHERE id = 0x12345 AND property = 'lastPrice' AND asof_ticks = 1234567890 AND knowledge_ticks < 2345678901 LIMIT 1
A Service, not an app
C*
Olympus
Olym
pus
Olympus
Oly
mpu
s
App
App
App
App
App
App
App
App
App
App
Fat Client
Olympus Thrift Service Olympus Thrift Service
Complex Value Types Not every value is a double Some values belong together (Bid and Ask should always come back together) Thrift structures as values Typed, extensible schema Union types give us a way to deserialize any type
Ad-hoc querying UI
But that's the easy part...
(queue transition)
Scaling... The first rule of scaling is you do not just turn everything to 11.
Scaling... Step 1 - Fast Machines for your workload Step 2 - Avoid Java GC for your workload Step 3 - Tune Cassandra for your workload Step 4 - Prefetch and cache for your workload
Can't fix what you can't measure Riemann (http://riemann.io) Easily push application and system metrics into a single system We push 6k metrics per second to a single Riemann instance
Metrics: Riemann Yammer Metrics with Riemann
https://gist.github.com/carlyeks/5199090
Metrics: Riemann Push stream based metrics library Riemann Dash for Why is it Slow? Graphite for Why was it Slow?
VisualVM: The greatest tool EVER Many useful plugins... Just start jstatd on each server and go!
Scaling Reads: Machines SSDs for hot data JBOD config As many cores as possible (> 16) 10GbE network Bonded network cards Jumbo frames
JBOD is a lifesaver SSDs are great until they aren't anymore JBOD allowed passive recovery in the face of simultaneous disk failures (SSDs had a bad firmware)
Scaling Reads: Cassandra Changes we've made: • Configuration • Compaction • Compression
Leveled Compaction Wide rows means data can be spread across a huge number of SSTables Leveled Compaction puts a bound on the worst case (*) Fewer SSTables to read means lower latency, as shown below; orange SSTables get read
L0
L1
L2
L3
L4
L5
* In Theory
Leveled Compaction: Breaking Bad Under high write load, forced to read all of the L0 files
L0
L1
L2
L3
L4
L5
Hybrid Compaction: Breaking Better Size Tiering Level 0 On by default in 2.0
L0
L1
L2
L3
L4
L5
{ Hybrid
Compaction
Size Tiered
Leveled
Overlapping Compaction Instead of forcing a combination of L0 files with L1, we can just push up files This allows a higher level of concurrency in compactions We still know the SSTables that might contain the keys We can force a proper compaction at any configurable level
L0
L1
L2
L3
L4
L5
C optimized library Read path needs to be fast for our workload CRC check, composite comparison eat a lot of cycles CRC is implemented on chip for some architectures (why not use it?) We want to move some of the operations into a JNI library to reduce latency and improve throughput
Current Stats 16 nodes 2 Data Centers Replication Factor 6 200k Writes/sec at EACH_QUORUM 150k Reads/sec at LOCAL_QUORUM > 30 Million time series > 15 Billion points 10 TB on disk (compressed) Read Latency 50%/95% is 1ms/5ms
Questions? Thank you! @tjake and @carlyeks