opentsdb - metrics for a distributed world - netways · pdf fileopentsdb - metrics for a...
TRANSCRIPT
![Page 1: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/1.jpg)
openTSDB - Metrics for a
distributed world Oliver Hankeln / gutefrage.net
@mydalon
![Page 2: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/2.jpg)
Who am I?
Senior Engineer - Data and Infrastructure at
gutefrage.net GmbH
Was doing software development before
DevOps advocate
![Page 3: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/3.jpg)
Who is Gutefrage.net?
Germany‘s biggest Q&A platform
#1 German site (mobile) about 5M Unique Users
#3 German site (desktop) about 17M Unique Users
> 4 Mio PI/day
Part of the Holtzbrinck group
Running several platforms (Gutefrage.net,
Helpster.de, Cosmiq, Comprano, ...)
![Page 4: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/4.jpg)
What you will get
Why we chose openTSDB
What is openTSDB?
How does openTSDB store the data?
Our experiences
Some advice
![Page 5: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/5.jpg)
Why we chose
openTSDB
![Page 6: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/6.jpg)
We were looking at some
options
Munin Graphite openTSDB Ganglia
Scales well no sort of yes yes
Keeps all
data no no yes no
Creating
metrics easy easy easy easy
![Page 7: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/7.jpg)
We have a winner!
Munin Graphite openTSDB Ganglia
Scales well no sort of yes yes
Keeps all
data no no yes no
Creating
metrics easy easy easy easy B
ing
o!
![Page 8: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/8.jpg)
Separation of concerns
$
unzip|strip|touch|finger|grep|mount|fsck|more|yes|fsck|fsck|fsck|umo
unt|sleep
![Page 9: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/9.jpg)
The ecosystem
App feeds metrics in via RabbitMQ
We base Icinga checks on the metrics
We evaluate etsy Skyline for anomaly
detection
We deploy sensors via chef
![Page 10: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/10.jpg)
openTSDB
Written at StumbleUpon but OpenSource
Uses HBase (which is based on HDFS) as a
storage
Distributed system (multiple TSDs)
![Page 11: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/11.jpg)
The big picture
HBase
TSD
TSD
TSD
TSD UI
API
tcollector
This is really a
cluster
![Page 12: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/12.jpg)
Putting data into
openTSDB
$ telnet tsd01.acme.com 4242
put proc.load.avg5min 1382536472 23.2 host=db01.acme.com
![Page 13: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/13.jpg)
It gets even better
tcollector is a python script that runs your
collectors
handles network connection, starts your
collectors at set intervals
does basic process management
adds host tag, does deduplication
![Page 14: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/14.jpg)
A simple tcollector script
#!/usr/bin/php
<?php#Cast a die$die = rand(1,6);echo "roll.a.d6 " . time() . " " . $die . "\n";
![Page 15: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/15.jpg)
What was that HDFS
again?
HDFS is a distributed filesystem suitable for
Petabytes of data on thousands of machines.
Runs on commodity hardware
Takes care of redundancy
Used by e.g. Facebook, Spotify, eBay,...
![Page 16: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/16.jpg)
Okay... and HBase?
HBase is a NoSQL database / data store on
top of HDFS
Modeled after Google‘s BigTable
Built for big tables (billions of rows, millions of
columns)
Automatic sharding by row key
![Page 17: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/17.jpg)
How openTSDB stores
the data
![Page 18: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/18.jpg)
Keys are key!
Data is sharded across regions based on their
row key
You query data based on the row key
You can query row key ranges (say e.g. A...D)
So: think about key design
![Page 19: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/19.jpg)
Take 1
Row key format: timestamp, metric id
![Page 20: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/20.jpg)
Take 1
Row key format: timestamp, metric id
1382536472, 5 17
Server A
Server B
![Page 21: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/21.jpg)
Take 1
Row key format: timestamp, metric id
1382536472, 5 17
1382536472, 6 24 Server A
Server B
![Page 22: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/22.jpg)
Take 1
Row key format: timestamp, metric id
1382536472, 5 17
1382536472, 6 24
1382536472, 8 12
1382536473, 5 134
1382536473, 6 10
1382536473, 8 99
Server A
Server B
![Page 23: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/23.jpg)
Take 1
Row key format: timestamp, metric id
1382536472, 5 17
1382536472, 6 24
1382536472, 8 12
1382536473, 5 134
1382536473, 6 10
1382536473, 8 99
1382536474, 5 12
1382536474, 6 42
Server A
Server B
![Page 24: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/24.jpg)
Solution: Swap
timestamp and metric id Row key format: metric id, timestamp
5, 1382536472 17
6, 1382536472 24
8, 1382536472 12
5, 1382536473 134
6, 1382536473 10
8, 1382536473 99
5, 1382536474 12
6, 1382536474 42
Server A
Server B
![Page 25: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/25.jpg)
Solution: Swap
timestamp and metric id Row key format: metric id, timestamp
5, 1382536472 17
6, 1382536472 24
8, 1382536472 12
5, 1382536473 134
6, 1382536473 10
8, 1382536473 99
5, 1382536474 12
6, 1382536474 42
Server A
Server B
![Page 26: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/26.jpg)
Take 2
Metric ID first, then timestamp
Searching through many rows is slower than
searching through viewer rows. (Obviously)
So: Put multiple data points into one row
![Page 27: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/27.jpg)
Take 2 continued
5, 1382608800 +23 +35 +94 +142
17 1 23 42
5, 1382612400 +13 +25 +88 +89
3 44 12 2
![Page 28: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/28.jpg)
Take 2 continued
5, 1382608800 +23 +35 +94 +142
17 1 23 42
5, 1382612400 +13 +25 +88 +89
3 44 12 2
Row key
![Page 29: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/29.jpg)
Take 2 continued
5, 1382608800 +23 +35 +94 +142
17 1 23 42
5, 1382612400 +13 +25 +88 +89
3 44 12 2
Row key
Cell Name
![Page 30: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/30.jpg)
Take 2 continued
5, 1382608800 +23 +35 +94 +142
17 1 23 42
5, 1382612400 +13 +25 +88 +89
3 44 12 2
Row key
Cell Name Data point
![Page 31: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/31.jpg)
Where are the tags
stored?
They are put at the end of the row key
Both metric names and metric values are
represented by IDs
![Page 32: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/32.jpg)
The Row Key
3 Bytes - metric ID
4 Bytes - timestamp (rounded down to the
hour)
3 Bytes tag ID
3 Bytes tag value ID
Total: 7 Bytes + 6 Bytes * Number of tags
![Page 33: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/33.jpg)
Let‘s look at some
graphs
![Page 34: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/34.jpg)
Our experiences
![Page 35: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/35.jpg)
What works well
We store about 200M data points in several
thousand time series with no issues
tcollector is decoupling measurement from
storage
Creating new metrics is really easy
![Page 36: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/36.jpg)
Challenges
The UI is seriously lacking
no annotation support out of the box
Only 1s time resolution (and only 1
value/s/time series)
![Page 37: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/37.jpg)
salvation is coming
OpenTSDB 2 is around the corner
millisecond precision
annotations and meta data
improved API
![Page 38: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/38.jpg)
Friendly advice
Pick a naming scheme and stick to it
Use tags wisely (not more than 6 or 7 tags per
data point)
Use tcollector
wait for openTSDB 2 ;-)
![Page 39: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon](https://reader031.vdocuments.us/reader031/viewer/2022021817/5a9f97717f8b9a7f178d0323/html5/thumbnails/39.jpg)
Questions?
Please contact me:
@mydalon
I‘ll upload the slides and tweet about it