cmu scs 15-721 (spring 2017) :: database compression · cmu 15-721 (spring 2017) observation . i/o...

57
Andy Pavlo // Carnegie Mellon University // Spring 2016 ADVANCED DATABASE SYSTEMS Lecture #11 – Database Compression 15-721 @Andy_Pavlo // Carnegie Mellon University // Spring 2017

Upload: others

Post on 25-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

Andy Pavlo // Carnegie Mellon University // Spring 2016

ADVANCED DATABASE SYSTEMS

Lecture #11 – Database Compression

15-721

@Andy_Pavlo // Carnegie Mellon University // Spring 2017

Page 2: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

TODAY’S AGENDA

Background Naïve Compression OLAP Columnar Compression OLTP Index Compression

2

Page 3: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

OBSERVATION

I/O is the main bottleneck if the DBMS has to fetch data from disk.

In-memory DBMSs are more complicated → Compressing the database reduces DRAM requirements

and processing.

Key trade-off is speed vs. compression ratio

→ In-memory DBMSs (always?) choose speed.

3

Page 4: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

REAL-WORLD DATA CHARACTERISTICS

Data sets tend to have highly skewed distributions for attribute values. → Example: Zipfian distribution of the Brown Corpus

Data sets tend to have high correlation between attributes of the same tuple. → Example: Zip Code to City, Order Date to Ship Date

4

Page 5: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DATABASE COMPRESSION

Goal #1: Must produce fixed-length values.

Goal #2: Allow the DBMS to postpone decompression as long as possible during query execution.

5

Page 6: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

LOSSLESS VS. LOSSY COMPRESSION

When a DBMS uses compression, it is always lossless because people don’t like losing data.

Any kind of lossy compression is has to be performed at the application level.

Some new DBMSs support approximate queries → Example: BlinkDB, SnappyData

6

Page 7: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COMPRESSION GRANUL ARIT Y

Choice #1: Block-level

→ Compress a block of tuples for the same table.

Choice #2: Tuple-level

→ Compress the contents of the entire tuple (NSM-only).

Choice #3: Attribute-level

→ Compress a single attribute value within one tuple. → Can target multiple attributes for the same tuple.

Choice #4: Column-level

→ Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

7

Page 8: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

ZONE MAPS

Pre-computed aggregates for blocks of data. DBMS can check the zone map first to decide whether it wants to access the block.

8

Zone Map

val 100 400 280 1400

type MIN MAX AVG SUM

5 COUNT

Original Data

val 100 200 300 400 400

SELECT * FROM table WHERE val > 600

Page 9: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

NAÏVE COMPRESSION

Compress data using a general purpose algorithm. Scope of compression is only based on the data provided as input. → LZO (1996), LZ4 (2011), Snappy (2011), Zstd (2015)

Considerations → Computational overhead → Compress vs. decompress speed.

9

Page 10: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

NAÏVE COMPRESSION

Choice #1: Entropy Encoding

→ More common sequences use less bits to encode, less common sequences use more bits to encode.

Choice #2: Dictionary Encoding

→ Build a data structure that maps data segments to an identifier. Replace those segments in the original data with a reference to the segments position in the dictionary data structure.

10

Page 11: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

MYSQL INNODB COMPRESSION

11

[1,2,4,8]

KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Compressed page0

mod log

Compressed page0

mod log

Compressed page1

mod log

Compressed page2

mod log

Page 12: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

MYSQL INNODB COMPRESSION

11

[1,2,4,8]

KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Compressed page0

mod log

Compressed page0

mod log

Compressed page1

mod log

Compressed page2

mod log

Updates

Page 13: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

MYSQL INNODB COMPRESSION

11

16 KB

[1,2,4,8]

KB

Source: MySQL 5.7 Documentation

Buffer Pool Disk Pages

Uncompressed

page0

Compressed page0

mod log

Compressed page0

mod log

Compressed page1

mod log

Compressed page2

mod log

Updates

Page 14: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

NAÏVE COMPRESSION

The data has to be decompressed first before it can be read and (potentially) modified. → This limits the “scope” of the compression scheme.

These schemes also do not consider the high-level meaning or semantics of the data.

12

Page 15: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

OBSERVATION

We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way. → Range predicates are more tricky…

13

SELECT * FROM users WHERE name = ‘Andy’

SELECT * FROM users WHERE name = XX

NAME SALARY Andy 99999 Dana 88888

NAME SALARY XX AA YY BB

Page 16: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COLUMNAR COMPRESSION

Null Suppression Run-length Encoding Bitmap Encoding Delta Encoding Incremental Encoding Mostly Encoding Dictionary Encoding

14

Page 17: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COMPRESSION VS. MSSQL INDEXES

The MSSQL columnar indexes were a second copy of the data (aka fractured mirrors). → The original data was still stored as in NSM format.

We are now talking about compressing the primary copy of the data. Many of the same techniques are applicable.

15

Page 18: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

NULL SUPPRESSION

Consecutive zeros or blanks in the data are replaced with a description of how many there were and where they existed. → Example: Oracle’s Byte-Aligned Bitmap Codes (BBC)

Useful in wide tables with sparse data.

16

DATABASE COMPRESSION SIGMOD RECORD 1993

Page 19: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

RUN-LENGTH ENCODING

Compress runs of the same value in a single column into triplets: → The value of the attribute. → The start position in the column segment. → The # of elements in the run.

Requires the columns to be sorted intelligently to maximize compression opportunities.

17

Page 20: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

BITMAP ENCODING

Store a separate Bitmap for each unique value for a particular attribute where an offset in the vector corresponds to a tuple. → Can use the same compression schemes that we talked

about for Bitmap indexes.

Only practical if the value cardinality is low.

18

MODEL 204 ARCHITECTURE AND PERFORMANCE High Performance Transaction Systems 1987

Page 21: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DELTA ENCODING

Recording the difference between values that follow each other in the same column. → The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

19

Original Data

time

12:01 12:00

12:03 12:02

12:04

temp

99.4 99.5

99.6 99.5

99.4

Page 22: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DELTA ENCODING

Recording the difference between values that follow each other in the same column. → The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

19

Original Data

time

12:01 12:00

12:03 12:02

12:04

temp

99.4 99.5

99.6 99.5

99.4

Compressed Data

time

+1 12:00

+1 +1

+1

temp

-0.1 99.5

+0.1 +0.1

-0.2

Page 23: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DELTA ENCODING

Recording the difference between values that follow each other in the same column. → The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

19

Original Data

time

12:01 12:00

12:03 12:02

12:04

temp

99.4 99.5

99.6 99.5

99.4

Compressed Data

time

+1 12:00

+1 +1

+1

temp

-0.1 99.5

+0.1 +0.1

-0.2

Page 24: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DELTA ENCODING

Recording the difference between values that follow each other in the same column. → The base value can be stored in-line or in a separate look-

up table. → Can be combined with RLE to get even better

compression ratios.

19

Original Data

time

12:01 12:00

12:03 12:02

12:04

temp

99.4 99.5

99.6 99.5

99.4

Compressed Data

time

(+1,4) 12:00

temp

-0.1 99.5

+0.1 +0.1

-0.2

Compressed Data

time

+1 12:00

+1 +1

+1

temp

-0.1 99.5

+0.1 +0.1

-0.2

Page 25: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This works best with sorted data.

20

Original Data

rob robbed robbing robot

Common Prefix

-

Page 26: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This works best with sorted data.

20

Original Data

rob robbed robbing robot

Common Prefix

- rob

Page 27: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This works best with sorted data.

20

Original Data

rob robbed robbing robot

Common Prefix

- rob robb rob

Page 28: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

INCREMENTAL ENCODING

Type of delta encoding whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated. This works best with sorted data.

20

Original Data

rob robbed robbing robot

Common Prefix

- rob robb rob

Compressed Data

rob bed ing ot

0 3 4 3

Prefix

Length

Suffix

Page 29: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

MOSTLY ENCODING

When the values for an attribute are “mostly” less than the largest size, you can store them as a smaller data type. → The remaining values that cannot be compressed are

stored in their raw form.

21

Source: Redshift Documentation

Original Data

int64

4 2

6 99999999

8

Compressed Data

mostly8

4 2

6 XXX

8

offset 3

value 99999999

Page 30: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DICTIONARY COMPRESSION

Replace frequent patterns with smaller codes. Most pervasive compression scheme in DBMSs.

Need to support fast encoding and decoding. Need to also support range queries.

22

DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES SIGMOD 2009

Page 31: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DICTIONARY COMPRESSION

When to construct the dictionary? What should the scope be of the dictionary? How do we allow for range queries? How do we enable fast encoding/decoding?

23

Page 32: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DICTIONARY CONSTRUCTION

Choice #1: All At Once

→ Compute the dictionary for all the tuples at a given point of time.

→ New tuples must use a separate dictionary or the all tuples must be recomputed.

Choice #2: Incremental

→ Merge new tuples in with an existing dictionary. → Likely requires re-encoding to existing tuples.

24

Page 33: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DICTIONARY SCOPE

Choice #1: Block-level

→ Only include a subset of tuples within a single table. → Potentially lower compression ratio, but can add new

tuples more easily.

Choice #2: Table-level

→ Construct a dictionary for the entire table. → Better compression ratio, but expensive to update.

Choice #3: Multi-Table

→ Can be either subset or entire tables. → Sometimes helps with joins and set operations.

25

Page 34: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

MULTI-AT TRIBUTE ENCODING

Instead of storing a single value per dictionary entry, store entries that span attributes. → I’m not sure any DBMS actually implements this.

26

Original Data Compressed Data

val2

101 202

101 202

101

val1

B A

C A

B

val1+val2

YY XX

ZZ XX

YY

val2

101 202

101

val1

B A

C

code

YY XX

ZZ

Page 35: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

ENCODING / DECODING

A dictionary needs to support two operations: → Encode: For a given uncompressed value, convert it into

its compressed form. → Decode: For a given compressed value, convert it back

into its original form.

No magic hash function will do this for us. We need two data structures to support operations in both directions.

27

Page 36: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

ORDER-PRESERVING COMPRESSION

The encoded values need to support sorting in the same order as original values.

28

Original Data

name Andrea Joy Andy Dana

Compressed Data

code 10 20 30 40

value Andrea Andy Dana Joy

name 10 40 20 30

Page 37: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

ORDER-PRESERVING COMPRESSION

The encoded values need to support sorting in the same order as original values.

28

SELECT * FROM users WHERE name LIKE ‘And%’

Original Data

name Andrea Joy Andy Dana

Compressed Data

code 10 20 30 40

value Andrea Andy Dana Joy

name 10 40 20 30

SELECT * FROM users WHERE name BETWEEN 10 AND 20

Page 38: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

ORDER-PRESERVING COMPRESSION

29

SELECT name FROM users WHERE name LIKE ‘And%’

Original Data

name Andrea Joy Andy Dana

Compressed Data

code 10 20 30 40

value Andrea Andy Dana Joy

name 10 40 20 30

Still have to perform seq scan

Page 39: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

ORDER-PRESERVING COMPRESSION

29

SELECT name FROM users WHERE name LIKE ‘And%’

Original Data

name Andrea Joy Andy Dana

Compressed Data

code 10 20 30 40

value Andrea Andy Dana Joy

name 10 40 20 30

SELECT DISTINCT name FROM users WHERE name LIKE ‘And%’

Still have to perform seq scan

Only need to access dictionary

Page 40: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

DICTIONARY IMPLEMENTATIONS

Hash Table:

→ Fast and compact. → Unable to support range and prefix queries.

B+Tree:

→ Slower than a hash table and takes more memory. → Can support range and prefix queries.

30

Page 41: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

SHARED-LEAVES TREES

31

DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES SIGMOD 2009

Decode Index

• • •

Encode Index

Decode Index

value aab

code 10

aae 20 aaf 30 aaz 40

value zzb

code 960

zzm 970 zzx 980 zzz 990

Sorted

Shared Leaf

Page 42: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

SHARED-LEAVES TREES

31

DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES SIGMOD 2009

Decode Index

• • •

Encode Index

Decode Index

value aab

code 10

aae 20 aaf 30 aaz 40

value zzb

code 960

zzm 970 zzx 980 zzz 990

Original

Value

Encoded

Value

Sorted

Shared Leaf

Page 43: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

SHARED-LEAVES TREES

31

DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES SIGMOD 2009

Decode Index

• • •

Encode Index

Decode Index

value aab

code 10

aae 20 aaf 30 aaz 40

value zzb

code 960

zzm 970 zzx 980 zzz 990

Original

Value

Encoded

Value

Encoded

Value

Original

Value

Sorted

Shared Leaf

Page 44: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

SHARED-LEAVES TREES

31

DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES SIGMOD 2009

Decode Index

• • •

Encode Index

Decode Index

value aab

code 10

aae 20 aaf 30 aaz 40

value zzb

code 960

zzm 970 zzx 980 zzz 990

Original

Value

Encoded

Value

Encoded

Value

Original

Value

Sorted

Shared Leaf

Incremental

Encoding

Page 45: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

OBSERVATION

An OLTP DBMS cannot use the OLAP compression techniques because we need to support fast random tuple access. → Compressing & decompressing “hot” tuples on-the-fly

would be too slow to do during a txn.

Indexes consume a large portion of the memory for an OLTP database…

33

Page 46: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

OLTP INDEX OVERHEAD

34

Tuples

Primary

Indexes

Secondary

Indexes

TPC-C 42.5% 33.5% 24.0%

Articles 64.8% 22.6% 12.6%

Voter 45.1% 54.9% 0%

57.5%

54.9%

35.2%

Page 47: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

HYBRID INDEXES

Split a single logical index into two physical indexes. Data is migrated from one stage to the next over time. → Dynamic Stage: New data, fast to update. → Static Stage: Old data, compressed + read-only.

All updates go to dynamic stage. Reads may need to check both stages.

35

REDUCING THE STORAGE OVERHEAD OF MAIN-MEMORY OLTP DATABASES WITH HYBRID INDEXES SIGMOD 2016

Page 48: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

HYBRID INDEXES

36

Dynamic

Index

Static

Index

Insert

Update

Delete

Bloom Filter

Merge

Page 49: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

HYBRID INDEXES

36

Dynamic

Index

Static

Index

Insert

Update

Delete

Read

Bloom Filter

Merge

Page 50: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

HYBRID INDEXES

36

Dynamic

Index

Static

Index

Insert

Update

Delete

Read

Bloom Filter

Merge

Read

Read

Page 51: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COMPACT B+TREE

37

20

10 35

6 12 23 38

Empty Slots

Page 52: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COMPACT B+TREE

37

12

6 12 23 38

Page 53: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COMPACT B+TREE

37

12

6 12 23 38

Pointers

Page 54: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

COMPACT B+TREE

37

12

6 12 23 38

Computed Offset

Page 55: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

HYBRID INDEXES

38

Source: Huanchen Zhang

50% Reads / 50% Writes 50 million Entries

5.1 5.0

1.7

6.2

12.6

2.0

0

4

8

12

16

Random Int

Mono-Inc Int

Email

Th

ro

ug

hp

ut (M

op

/se

c)

1.3

1.8

3.2

0.9 0.9

2.3

0

1

2

3

4

Random Int

Mono-Inc Int

Email M

em

or

y (G

B)

Original B+Tree Hybrid B+Tree

Page 56: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

PARTING THOUGHTS

Dictionary encoding is probably the most useful compression scheme because it does not require pre-sorting.

The DBMS can combine different approaches for even better compression.

It is important to wait as long as possible during query execution to decompress data.

39

Page 57: CMU SCS 15-721 (Spring 2017) :: Database Compression · CMU 15-721 (Spring 2017) OBSERVATION . I/O is the main bottleneck if the DBMS has to fetch data from disk. In-memory DBMSs

CMU 15-721 (Spring 2017)

NEXT CL ASS

Physical vs. Logical Logging

40