building a time series database
TRANSCRIPT
![Page 1: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/1.jpg)
© Man 2015
Building a time series database… 10^12 rows and counting
James Blackburn
@jimmybb
@ManAHLTech
![Page 2: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/2.jpg)
2
Agenda
Data Storage At AHL1. AHL2. Data: size and shape3. Implementation: Arctic4. Performance5. Conclusion
![Page 3: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/3.jpg)
3
AHL Systematic Fund Management
© Man 2015
![Page 4: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/4.jpg)
4© Man 2015
AHL Systematic Fund Management
![Page 5: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/5.jpg)
5© Man 2015
AHL Systematic Fund Management
![Page 6: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/6.jpg)
6© Man 2015
AHL Systematic Fund Management
![Page 7: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/7.jpg)
Quant researchers • Interactive work – latency sensitive• Batch jobs run on a cluster – maximize throughput• Historical data• New data• ... want control of storing their own data
Trading system• Auditable – SVN for data• Stable• Performant
7
Overview – Data consumers
![Page 8: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/8.jpg)
8
AHL’s Data Pipeline
![Page 9: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/9.jpg)
…[2-2092492678] FFIM6.NaE 2015-11-11 13:38:18.330 4 UPDATE TIMACT: '13:38', BID: 6192.0, QUOTIM: '13:38:18', QUOTIM_MS: 49098000.0, BIDSIZE: 1.0, EXCHTIM: '13:38:18'[2-2092492759] FFIM6.NaE 2015-11-11 13:38:18.676 16 UPDATE TIMACT: '13:38', ASKSIZE: 1.0, QUOTIM: '13:38:18', QUOTIM_MS: 49098000.0, ASK: 6200.5, EXCHTIM: '13:38:18'[2-2092493019] FFIM6.NaE 2015-11-11 13:38:20.333 14 UPDATE TIMACT: '13:38', BID: 6192.5, QUOTIM: '13:38:20', QUOTIM_MS: 49100000.0, BIDSIZE: 1.0, EXCHTIM: '13:38:20'[2-2092493079] FFIM6.NaE 2015-11-11 13:38:20.685 2 UPDATE TIMACT: '13:38', ASKSIZE: 1.0, QUOTIM: '13:38:20', QUOTIM_MS: 49100000.0, ASK: 6201.0, EXCHTIM: '13:38:20'…
9
Tick Data
© Man 2015
![Page 10: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/10.jpg)
10
TimeSeries – single stock
© Man 2015
![Page 11: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/11.jpg)
Data sizes we deal with…• ~1MB x 1000s 1x a day price data 10k rows (30 years)• ~0.5GB x 1000s 1-minute data 4M rows (20 years)• ~1GB x 1000s 10k x 10k data matrices 100M cells (30 years)• ~30TB Tick data 100k msgs/s
1.2B msgs/day
… and different shapes• Time series of prices• Event data• News data• Metadata• What’s next?
11
Data sizes
![Page 12: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/12.jpg)
12
lib.read(‘US Equity Adjusted Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)
Problems - Scale
![Page 13: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/13.jpg)
13
lib.read(‘US Equity Adjusted Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)
Equity Prices: 77M float64s 600MB of data ~= 5Gbits! 600 MB
Problems - Scale
![Page 14: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/14.jpg)
14
SQL
© Man 2015
cx_Oracle
~ 200us per Row.
10k rows => 2.2s
![Page 15: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/15.jpg)
15
SQL
© Man 2015
cx_Oracle
~ 200us per Row.
10k rows => 2.2s
How do we query 4 million rows?
![Page 16: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/16.jpg)
Many different existing data stores• Relational databases• Tick databases• Flat files • HDF5 files• Caches
16
Overview – Databases
![Page 17: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/17.jpg)
Many different existing data stores• Relational databases• Tick databases• Flat files • HDF5 files• Caches
17
Can we build one system to rule them all?
Overview – Databases
![Page 18: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/18.jpg)
18
Implementation
https://github.com/manahl/arctic/
![Page 19: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/19.jpg)
Requirements
• Easy to use – and we mean easy
• Fast – as fast as local files
• Scalable – unbounded in data-size and number of clients
• Agile – any data shape; new shapes; iterative development
• Complete – all data behind the simple API
19
Project Requirements
![Page 20: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/20.jpg)
Goals• 20 years of 1 minute data in <1s• 200 instruments x all history x once a day data <1s
• Single data store for all data types• 1x day data Tick data
• Data versioning + Audit
20
Project Goals
![Page 21: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/21.jpg)
Data bucketed into named Libraries• One minute• Daily• User-data: jbloggs.EOD• Metadata Index
Pluggable Library types:• VersionStore• TickStore• Pickle Store• … pluggable …
https://github.com/manahl/arctic/blob/master/howtos/how_to_custom_arctic_library.py
21
Arctic Libraries
![Page 22: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/22.jpg)
22© Man 2015
![Page 23: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/23.jpg)
23© Man 2015
![Page 24: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/24.jpg)
24© Man 2015
![Page 25: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/25.jpg)
Document ~= Python Dictionary / Java HashMap
Flexible schema Rapid prototyping
OpenSource database
Great support
#1 NoSQL DB (#4 overall) http://db-engines.com/en/ranking
25
Why MongoDB
![Page 26: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/26.jpg)
Arctic key-value store
26
from arctic import Arctic
a = Arctic('research') # Connect to the data store
a.list_libraries() # What data libraries are availablelibrary = a[‘jbloggs.EOD’] # Get a Librarylibrary.list_symbols() # List symbols
library.write(‘SYMBOL’, <TS or other data>) # Writelibrary.read(‘SYMBOL’, version=…) # Read, with an optional version
library.snapshot('snapshot-name') # Create a named snapshot of the libraryLibrary.list_snapshots()
https://github.com/manahl/arctic/blob/master/howtos/how_to_use_arctic.py
Arctic API
![Page 27: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/27.jpg)
27© Man 2015
![Page 28: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/28.jpg)
28
Arctic - TickStore
Arctic(‘localhost’).initialize_library(‘tickdb’, ‘TickStoreV3’)
![Page 29: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/29.jpg)
29
Implementation – slicing a chunk
© Man 2015
![Page 30: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/30.jpg)
30
Implementation – a chunk
{ ID: ObjectId('52b1d39eed5066ab5e87a56d'), SYMBOL: 'symbol' INDEX: Binary(‘…, datetime, …', 0), COLUMNS: { ASK: { DATA: Binary(‘<compressed>', 0), DTYPE: '<f8', ROWMASK: Binary('...', 0) }, ... } START: DateTime(2015-01-01), END: DateTime(2015-11-12), SEGMENT: 1386933906826L, SHA: 1386933906826L, VERSION: 3,}
![Page 31: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/31.jpg)
31
Implementation – TickStore
Sym1
Sym2
![Page 32: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/32.jpg)
32© Man 2015
![Page 33: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/33.jpg)
33
Arctic - VersionStore
Arctic(‘localhost’).initialize_library(‘library’)
![Page 34: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/34.jpg)
34
Implementation – VersionStore
Snap A
Snap B
Sym1, v1
Sym1, v3
Sym2, v4
Sym2, v5
![Page 35: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/35.jpg)
© Man 2014 35
Architecture – final system
ReutersR
MD
S M
essa
ge B
us
Bloomberg
Banks
Kafka Queue
Kafka Queue
16 (micro-)shard cluster
Master + 1 replicaLinux
8 cores256 GB RAM
96TB Disk
Infiniband network LZ4 compressed data
MongoDB Cluster
-> Arctic -> -> Arctic ->
![Page 36: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/36.jpg)
36
Performance
![Page 37: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/37.jpg)
Flat files on NFS – Random market
37
Results – Performance Once a Day Data
![Page 38: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/38.jpg)
HDF5 files – Random instrument
38
Results – Performance One Minute Data
![Page 39: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/39.jpg)
Random E-Mini S&P contract from 2013
© Man 2013 39
Results – TickStore
![Page 40: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/40.jpg)
40
Results – TickStore II
© Man 2015 40
Infinibandsaturated
25x greater tick throughput
With just 2 machines!
![Page 41: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/41.jpg)
Random E-Mini S&P contract from 2013
41
Results – System Load
OtherTick Mongo (x2)N Tasks = 32
![Page 42: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/42.jpg)
42
Performance II
![Page 43: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/43.jpg)
43
TickStore message input
© Man 2015
![Page 44: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/44.jpg)
44
TimeSeries Query Throughput
© Man 2015
![Page 45: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/45.jpg)
Low latency:- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL) - OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)- 1s for 15M rows Java
Parallel Access:- Cluster with 256+ concurrent data access- Consistent throughput – little load on the Mongo server
Efficient:- 10-15x reduction in network load- Negligible decompression cost (lz4: 1.8Gb/s)
45
Conclusion
![Page 46: Building a Time Series Database](https://reader035.vdocuments.us/reader035/viewer/2022062310/586f75391a28ab10258b5f91/html5/thumbnails/46.jpg)
46
The Future
© Man 2015
- Python 3 support- Mac Support- VersionStore write performance
- OpenSource other native clients- JavaScript- Java- C#
- Contributions Welcome!