inside the influxdb storage engine.ppt · shards 10/11/2015 10/12/2015 data organized into shards...

Post on 03-Jun-2020

14 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Inside the InfluxDB Storage Engine

Paul Dix paul@influxdb.com

@pauldix

TSDB

Inverted Index

preliminary intro materials…

Everything is indexed by time and series

Shards

10/11/2015 10/12/2015

Data organized into Shards of time, each is an underlying DBefficient to drop old data

10/13/201510/10/2015

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement Tags

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement Tags Fields

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement Tags Fields Timestamp

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement Tags Fields Timestamp

We actually store up to ns scale timestamps but I couldn’t fit on the slide

Each series and field to a unique IDtemperature,device=dev1,building=b1#internal

temperature,device=dev1,building=b1#external

1

2

Data per ID is tuples ordered by timetemperature,device=dev1,building=b1#internal

temperature,device=dev1,building=b1#external

1

2

1 (1443782126,80)

2 (1443782126,18)

Arranging in Key/Value Stores

1,1443782126

Key Value

80

ID Time

Arranging in Key/Value Stores

1,1443782126

Key Value

802,1443782126 18

Arranging in Key/Value Stores

1,1443782126

Key Value

80

2,1443782126 181,1443782127 81 new data

Arranging in Key/Value Stores

1,1443782126

Key Value

80

2,1443782126 181,1443782127 81key space

is ordered

Arranging in Key/Value Stores

1,1443782126

Key Value

80

2,1443782126 181,1443782127 81

2,1443782256 152,1443782130 17

3,1443700126 18

Many existing storage engines have this model

New Storage Engine?!

First we used LSM Trees

deletes expensive

too many open file handles

Then mmap COW B+Trees

write throughput

compression

met our requirements

High write throughput

Awesome read performance

Better Compression

Writes can’t block reads

Reads can’t block writes

Write multiple ranges simultaneously

Hot backups

Many databases open in a single process

Enter InfluxDB’s Time Structured Merge Tree

(TSM Tree)

Enter InfluxDB’s Time Structured Merge Tree

(TSM Tree) like LSM, but different

Components

WALIn

memory cache

Index Files

Components

WALIn

memory cache

Index Files

Similar to LSM Trees

Components

WALIn

memory cache

Index Files

Similar to LSM Trees

Same

Components

WALIn

memory cache

Index Files

Similar to LSM Trees

Same like MemTables

Components

WALIn

memory cache

Index Files

Similar to LSM Trees

Same like MemTables like SSTables

awesome time series data

WAL (an append only file)

awesome time series data

WAL (an append only file)

in memory index

awesome time series data

WAL (an append only file)

in memory index

on disk index

(periodic flushes)

awesome time series data

WAL (an append only file)

in memory index

on disk index

(periodic flushes)

Memory mapped!

TSM File

TSM File

TSM File

TSM File

TSM File

TSM File

Compression

Timestamps: encoding based on precision and deltas

Timestamps (best case): Run length encoding

Deltas are all the same for a block

Timestamps (good case): Simple8B

Ann and Moffat in "Index compression using 64-bit words"

Timestamps (worst case): raw values

nano-second timestamps with large deltas

float64: double deltaFacebook’s Gorilla - google: gorilla time series facebook

https://github.com/dgryski/go-tsz

booleans are bits!

int64 uses double delta, zig-zagzig-zag same as from Protobufs

string uses Snappysame compression LevelDB uses

(might add dictionary compression)

UpdatesWrite, resolve at query

Deletestombstone, resolve at query & compaction

Compactions• Combine multiple TSM files

• Put all series points into same file

• Series points in 1k blocks

• Multiple levels

• Full compaction when cold for writes

Example Query

select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host

Example Query

select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host

How to map to series?

Inverted Index!

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 series to ID

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2

cpu -> [idle] measurement to fields

series to ID

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2

cpu -> [idle]

host -> [A, B]

measurement to fields

host to values

series to ID

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2

cpu -> [idle]

host -> [A, B]

region -> [west]

measurement to fields

host to values

region to values

series to ID

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2

cpu -> [idle]

host -> [A, B]

region -> [west]

cpu -> [1, 2] host=A -> [1] host=B -> [1] region=west -> [1, 2]

measurement to fields

host to values

region to values

series to ID

postings lists

Index V1

• In-memory

• Load on boot

• Memory constrained

• Slower boot times with high cardinality

Index V2

in memory index on disk index (do we already have?)

time series meta data

in memory index on disk index (do we already have?)

time series meta data

nope

WAL (an append only file)

in memory index on disk index (do we already have?)

time series meta data

nope

WAL (an append only file)

on disk indices

(periodic flushes)

in memory index on disk index (do we already have?)

time series meta data

nope

WAL (an append only file)

on disk indices

(periodic flushes)

(compactions)on disk index

Index File Layout

Robin Hood Hashing

• Can fully load table

• No linked lists for lookup

• Perfect for read-only hashes

[ , , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

[ , , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

A -> 0

[ A, , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

A -> 0

[ A, , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

B -> 1

[ A, B, , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

B -> 1

[ A, B, , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

C -> 1

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

C -> 2

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Keys

Probe Lengths

C -> probe 1

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Keys

Probe Lengths

D -> 0

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Keys

Probe Lengths

D -> probe 1

[ A, D, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 0, 0 ]

Keys

Probe Lengths

B -> probe 1

[ A, D, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 0, 0 ]

Keys

Probe Lengths

B -> probe 2

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe Lengths

B -> probe 2

Steal from probe rich, give to probe poor

Refinement: average probe

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe LengthsAverage: 1

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe LengthsAverage: 1

D -> hashes to 0 + 1

Cardinality Estimation

HyperLogLog++

More Details & Write Up to Come!

Thank you.Paul Dix

paul@influxdb.com @pauldix

top related