ordered set problems - pages.di.unipi.it

46
Ordered Set Problems Giulio Ermanno Pibiri 07/06/2019 [email protected] http://pages.di.unipi.it/pibiri

Upload: others

Post on 23-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ordered Set Problems - pages.di.unipi.it

Ordered Set Problems

Giulio Ermanno Pibiri

07/06/2019

[email protected] http://pages.di.unipi.it/pibiri

Page 2: Ordered Set Problems - pages.di.unipi.it

The Static Ordered Set Problem

Given a set of n items and an order relation defined on them,we are asked to design a data structure that supports

Access, Contains, Successor, Predecessor efficiently.

Page 3: Ordered Set Problems - pages.di.unipi.it

The Static Ordered Set Problem

Given a set of n items and an order relation defined on them,we are asked to design a data structure that supports

Access, Contains, Successor, Predecessor efficiently.

Let us assume our items are integers drawn from some universe of size u ≥ n.

Page 4: Ordered Set Problems - pages.di.unipi.it

The Static Ordered Set Problem

Given a set of n items and an order relation defined on them,we are asked to design a data structure that supports

Access, Contains, Successor, Predecessor efficiently.

Let us assume our items are integers drawn from some universe of size u ≥ n.

If the integers are not to be compressed:use an array.

Operations are made efficient by binary search with loop unrolling

with cut-off to SSE/AVX (SIMD) linear search on small segments.

If the keys are uniformly distributed, interpolation search can help:

O(log log n) time with high probability.

Page 5: Ordered Set Problems - pages.di.unipi.it

The Static Ordered Set Problem

Given a set of n items and an order relation defined on them,we are asked to design a data structure that supports

Access, Contains, Successor, Predecessor efficiently.

Let us assume our items are integers drawn from some universe of size u ≥ n.

If the integers are not to be compressed:use an array.

Operations are made efficient by binary search with loop unrolling

with cut-off to SSE/AVX (SIMD) linear search on small segments.

If the keys are uniformly distributed, interpolation search can help:

O(log log n) time with high probability.

Let us also assume n is so big that wemust compress the set.

Page 6: Ordered Set Problems - pages.di.unipi.it

Sorted integer sets are ubiquitous

Inverted indexes

Databases

Semantic data

Geospatial data

Graph compression

E-Commerce

Page 7: Ordered Set Problems - pages.di.unipi.it

The Static Compressed Ordered Set Problem

Large research corpora describing different space/time trade-offs.

• Elias’ Gamma and Delta • Elias-Fano • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Quasi-Succinct • Partitioned Elias-Fano • Clustered Elias-Fano • Optimal Variable-Byte • DINT

~1970

2019

+ set intersection, union and decode

Page 8: Ordered Set Problems - pages.di.unipi.it

Partitioning by Cardinality

The problem of (almost all) such representations is thatAccess, Contains, Predecessor/Successor

are not natively supported, but we can justdecode sequentially.

Page 9: Ordered Set Problems - pages.di.unipi.it

Partitioning by Cardinality

The problem of (almost all) such representations is thatAccess, Contains, Predecessor/Successor

are not natively supported, but we can justdecode sequentially.

Solution 1Introduce some redundancy to accelerate queries:

the so-called skip pointers.

Page 10: Ordered Set Problems - pages.di.unipi.it

Partitioning by Cardinality

The problem of (almost all) such representations is thatAccess, Contains, Predecessor/Successor

are not natively supported, but we can justdecode sequentially.

Solution 1Introduce some redundancy to accelerate queries:

the so-called skip pointers.

3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98

B

Page 11: Ordered Set Problems - pages.di.unipi.it

Partitioning by Cardinality

The problem of (almost all) such representations is thatAccess, Contains, Predecessor/Successor

are not natively supported, but we can justdecode sequentially.

Solution 1Introduce some redundancy to accelerate queries:

the so-called skip pointers.

14 34 49 98Upperbounds

3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98

B

Page 12: Ordered Set Problems - pages.di.unipi.it

Partitioning by Cardinality

The problem of (almost all) such representations is thatAccess, Contains, Predecessor/Successor

are not natively supported, but we can justdecode sequentially.

Solution 1Introduce some redundancy to accelerate queries:

the so-called skip pointers.

BitsOffsetsUpperbounds

14 34 49 98Upperbounds

3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98

B

Page 13: Ordered Set Problems - pages.di.unipi.it

Partitioning by Cardinality

The problem of (almost all) such representations is thatAccess, Contains, Predecessor/Successor

are not natively supported, but we can justdecode sequentially.

Solution 1Introduce some redundancy to accelerate queries:

the so-called skip pointers.

BitsOffsetsUpperbounds

14 34 49 98Upperbounds

3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98

B

Solution 2Redesign the data structure.

Page 14: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Page 15: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Does this remind you of something?

Page 16: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

[Elias-Fano 1971-1975]

Does this remind you of something?

Page 17: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

[Elias-Fano 1971-1975]

√u

1 1 0 1

1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1

√u

[van Emde Boas 1974-1975]

summary

Does this remind you of something?

Page 18: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Assume a slice size of 23

Page 19: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Contains(x): i = x >> 3 search for x - (i << 3) in the i-th slice

Assume a slice size of 23

Page 20: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Contains(x): i = x >> 3 search for x - (i << 3) in the i-th slice

Assume a slice size of 23

x = 010101

Page 21: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Contains(x): i = x >> 3 search for x - (i << 3) in the i-th slice

Assume a slice size of 23

x = 010101010101

Page 22: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Contains(x): i = x >> 3 search for x - (i << 3) in the i-th slice

Assume a slice size of 23

x = 010101

x - 16 = 5010101

Page 23: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Contains(x): i = x >> 3 search for x - (i << 3) in the i-th slice

Assume a slice size of 23

x = 010101

Successor(x): i = x >> 3 search for successor of x - (i << 3) in the i-th slice (if i-th slice is empty or x - (i << 3) > max_value in i-th slice, then return first value on the right)

x - 16 = 5010101

Page 24: Ordered Set Problems - pages.di.unipi.it

Partitioning by Universe

Contains(x): i = x >> 3 search for x - (i << 3) in the i-th slice

Assume a slice size of 23

x = 010101

Successor(x): i = x >> 3 search for successor of x - (i << 3) in the i-th slice (if i-th slice is empty or x - (i << 3) > max_value in i-th slice, then return first value on the right)

Intersection between lists has to intersect only the slices in common between the lists.

x - 16 = 5010101

Page 25: Ordered Set Problems - pages.di.unipi.it

Bitmaps

Good old data structure for storing dense sets:x-th bit is set if integer x is in the set.

Page 26: Ordered Set Problems - pages.di.unipi.it

Bitmaps

Good old data structure for storing dense sets:x-th bit is set if integer x is in the set.

1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0

S = {0,1,5,7,8,10,11,14,18,21,22,28,29,30}

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Page 27: Ordered Set Problems - pages.di.unipi.it

Bitmaps

Good old data structure for storing dense sets:x-th bit is set if integer x is in the set.

1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0

S = {0,1,5,7,8,10,11,14,18,21,22,28,29,30}

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Contains: testing a bitSuccessor/Predecessor: __builtin_ctzllSelect: __builtin_ctzllMax: __builtin_clzllMin: __builtin_ctzllDecode: __builtin_ctzllInsertion: setting a bit Deletion: clearing a bit

Page 28: Ordered Set Problems - pages.di.unipi.it

Bitmaps

Good old data structure for storing dense sets:x-th bit is set if integer x is in the set.

1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0

S = {0,1,5,7,8,10,11,14,18,21,22,28,29,30}

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Contains: testing a bitSuccessor/Predecessor: __builtin_ctzllSelect: __builtin_ctzllMax: __builtin_clzllMin: __builtin_ctzllDecode: __builtin_ctzllInsertion: setting a bit Deletion: clearing a bit

Nothing is better than a bitmap for dense sets.

Page 29: Ordered Set Problems - pages.di.unipi.it

Roaring[Lemire et al. 2013]

Assume u = 232

216 216

216

216

≤ 216 spans of 216 values each

Page 30: Ordered Set Problems - pages.di.unipi.it

Roaring[Lemire et al. 2013]

Assume u = 232

216 216

216

216

Dense: cardinality > 4096Sparse: otherwise

Sparse Dense Sparse

Dense spans are represented with bitmaps of 216 bits.

Sparse spans are represented with sorted-arrays of 16-bit integers.

Ensure at most 16 bits x key(excluding overhead)

≤ 216 spans of 216 values each

Page 31: Ordered Set Problems - pages.di.unipi.it

Slicing

216 216

216

216

Dense: cardinality > 216/2Sparse Dense Sparse

Dense slices are represented with bitmaps of 216 or 28 bits.Sparse slices are represented with sorted-arrays of 8-bit integers.

S D

≤ 216 slices of 216 values each

≤ 28 slices of 28 values each

D S D S D

28 28 28 28 28 28 28

(ensure at most 2 bits x key)

Dense: cardinality ≥ 31(ensure at most 8 bits x key)

Assume u = 232

Page 32: Ordered Set Problems - pages.di.unipi.it

Intersection

• Dense vs. Dense (Bitmap vs. Bitmap):bitwise AND operations + (usually) automatic compiler vectorization

• Dense vs. Sparse (Bitmap vs. Array):Given the array A: check if bit A[i] is set in the bitmap

• Sparse vs. Sparse (Array vs. Array):Vectorized processing using _mm_cmpestrm and _mm_shuffle_epi8 SIMD instructions

Intersection between lists has to intersect only the slices in common between the lists.

Page 33: Ordered Set Problems - pages.di.unipi.it

Summing up

Partitioning by Cardinality (PC)

Partitioning by Universe(PU)

2 different paradigms

Page 34: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — Setting

C++ sources https://github.com/jermp/s_indexeshttps://github.com/jermp/dinthttps://github.com/ot/ds2i https://github.com/RoaringBitmap/CRoaring

Machine Intel i7-4790K CPU @4GHz, 32 GiB RAM, Linux 4.13.0

Compiler gcc 7.2.0 (with all optimizations: -march=native and -O3)

Datasets

Page 35: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — Setting

Datasets

Configurations

Page 36: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — Compression Effectiveness

bits per integer

PC-based methods, such as BIC and PEF, are best for space usage.Slicing (PU-based) stands in trade-off position.

Page 37: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — Sequential Decoding Time

Experimental Comparison — Compression Effectiveness

ns per integer

PU-based methods, are as fast as the fastest (vectorized) PC-based methods.

Page 38: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — Intersection Time

musec per intersection

PU-based methods outperform PC-based methods.

Page 39: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — Point Queries

Experimental Comparison — Compression Effectiveness

Access: ns per query

Successor: ns per query

Page 40: Ordered Set Problems - pages.di.unipi.it

Experimental Comparison — The Trade-Off Curve

Experimental Comparison — Compression Effectiveness

Density = 1/1000

Page 41: Ordered Set Problems - pages.di.unipi.it

Future Research Directions

The Dynamic Ordered Set Problem

The Static Ordered Set Problem

+ insertions / deletions

Page 42: Ordered Set Problems - pages.di.unipi.it

Future Research Directions

The Dynamic Ordered Set Problem

The Static Ordered Set Problem

TheoryFusion Trees

van Emde Boas TreesExponential Search Trees

Y-Fast Tries Dynamic Elias-Fano

PracticeRed-Black Trees

B-Trees

Memory management is thechallenge.

+ insertions / deletions

Page 43: Ordered Set Problems - pages.di.unipi.it

The Dynamic Ordered Set Problem — On-going Work

Insert

n = 1,000,000 32-bit keys uniformly distributed

Page 44: Ordered Set Problems - pages.di.unipi.it

The Dynamic Ordered Set Problem — On-going Work

Successor

n = 1,000,000 32-bit keys uniformly distributed

Page 45: Ordered Set Problems - pages.di.unipi.it

The Dynamic Ordered Set Problem — On-going Work

Heap usage

Page 46: Ordered Set Problems - pages.di.unipi.it

Any questions?

Thanks for your attention,time, patience!