inside the influxdb storage engine.ppt · shards 10/11/2015 10/12/2015 data organized into shards...

Inside the InfluxDB Storage Engine

Paul Dix paul@influxdb.com

@pauldix

Inverted Index

preliminary intro materials…

Everything is indexed by time and series

Shards

10/11/2015 10/12/2015

Data organized into Shards of time, each is an underlying DBefficient to drop old data

10/13/201510/10/2015

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

InfluxDB data

Measurement

InfluxDB data

Measurement Tags

InfluxDB data

Measurement Tags Fields

InfluxDB data

Measurement Tags Fields Timestamp

InfluxDB data

Measurement Tags Fields Timestamp

We actually store up to ns scale timestamps but I couldn’t fit on the slide

Each series and field to a unique IDtemperature,device=dev1,building=b1#internal

temperature,device=dev1,building=b1#external

Data per ID is tuples ordered by timetemperature,device=dev1,building=b1#internal

temperature,device=dev1,building=b1#external

1 (1443782126,80)

2 (1443782126,18)

Arranging in Key/Value Stores

1,1443782126

Key Value

ID Time

1,1443782126

Key Value

802,1443782126 18

1,1443782126

Key Value

2,1443782126 181,1443782127 81 new data

1,1443782126

Key Value

2,1443782126 181,1443782127 81key space

is ordered

1,1443782126

Key Value

2,1443782126 181,1443782127 81

2,1443782256 152,1443782130 17

3,1443700126 18

Many existing storage engines have this model

New Storage Engine?!

First we used LSM Trees

deletes expensive

too many open file handles

Then mmap COW B+Trees

write throughput

compression

met our requirements

High write throughput

Awesome read performance

Better Compression

Writes can’t block reads

Reads can’t block writes

Write multiple ranges simultaneously

Hot backups

Many databases open in a single process

Enter InfluxDB’s Time Structured Merge Tree

(TSM Tree)

Enter InfluxDB’s Time Structured Merge Tree

(TSM Tree) like LSM, but different

Components

memory cache

Index Files

Components

memory cache

Index Files

Similar to LSM Trees

Components

memory cache

Index Files

Components

memory cache

Index Files

Same like MemTables

Components

memory cache

Index Files

Same like MemTables like SSTables

awesome time series data

WAL (an append only file)

in memory index

on disk index

(periodic flushes)

in memory index

on disk index

(periodic flushes)

Memory mapped!

TSM File

Compression

Timestamps: encoding based on precision and deltas

Timestamps (best case): Run length encoding

Deltas are all the same for a block

Timestamps (good case): Simple8B

Ann and Moffat in "Index compression using 64-bit words"

Timestamps (worst case): raw values

nano-second timestamps with large deltas

float64: double deltaFacebook’s Gorilla - google: gorilla time series facebook

https://github.com/dgryski/go-tsz

booleans are bits!

int64 uses double delta, zig-zagzig-zag same as from Protobufs

string uses Snappysame compression LevelDB uses

(might add dictionary compression)

UpdatesWrite, resolve at query

Deletestombstone, resolve at query & compaction

Compactions• Combine multiple TSM files

• Put all series points into same file

• Series points in 1k blocks

• Multiple levels

• Full compaction when cold for writes

Example Query

select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host

Example Query

select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host

How to map to series?

Inverted Index!

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 series to ID

Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2

cpu -> [idle] measurement to fields

series to ID

cpu -> [idle]

host -> [A, B]

measurement to fields

host to values

series to ID

cpu -> [idle]

host -> [A, B]

region -> [west]

host to values

region to values

series to ID

cpu -> [idle]

host -> [A, B]

region -> [west]

cpu -> [1, 2] host=A -> [1] host=B -> [1] region=west -> [1, 2]

host to values

region to values

series to ID

postings lists

Index V1

• In-memory

• Load on boot

• Memory constrained

• Slower boot times with high cardinality

Index V2

in memory index on disk index (do we already have?)

time series meta data

on disk indices

(periodic flushes)

on disk indices

(periodic flushes)

(compactions)on disk index

Index File Layout

Robin Hood Hashing

• Can fully load table

• No linked lists for lookup

• Perfect for read-only hashes

[ , , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

[ , , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

A -> 0

[ A, , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

A -> 0

[ A, , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

B -> 1

[ A, B, , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

B -> 1

[ A, B, , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

C -> 1

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Probe Lengths

C -> 2

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Probe Lengths

C -> probe 1

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Probe Lengths

D -> 0

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Probe Lengths

D -> probe 1

[ A, D, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 0, 0 ]

Probe Lengths

B -> probe 1

[ A, D, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 0, 0 ]

Probe Lengths

B -> probe 2

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Probe Lengths

B -> probe 2

Steal from probe rich, give to probe poor

Refinement: average probe

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Probe LengthsAverage: 1

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Probe LengthsAverage: 1

D -> hashes to 0 + 1

Cardinality Estimation

HyperLogLog++

More Details & Write Up to Come!

Thank you.Paul Dix

paul@influxdb.com @pauldix

inside the influxdb storage engine.ppt · shards 10/11/2015 10/12/2015 data organized into shards...

Documents

reconstructing the formation of massive early-type galaxies...

member sitting days export - dáil Éireann · 2018. 1....

shards - presentation

baram 1999 clay tobacco pipes and coffee cup shards

a waa a wa - transferware collectors club · 2020. 7....

shards of the throne rules

ingredients chocolate slab€¦ · shards of this slab make...

shattered v , shards of hope · spaces between the...

(honkyokucover - jinnyodohonkyokubooks.pdf) · title...

project weekly update shards hunter ridho laksono / saroeun...

shards 2nd presentation

lent menu - brick n' mortareatbricknmortar.com/menu.pdf ·...

efficient mrc construction with shards

shards as part of place, space and identity 3

· conference 5/22/2015 10/14/2015 10/14/2015 10 14 2015...

th tim evans · 2015. 12. 3. · •genes not memes...

inter-controller communication in switching sdn...

oracle sharding technical deep dive -...

sharding mysql with vitess - percona...database traffic...

the shards -london.pptx