enhancing bitmap indices - purdue university · time (msec) wah uniform ab uniform wah landsat ab...

24
1 Enhancing Bitmap Indices Guadalupe Canahuate The Ohio State University Advisor: Hakan Ferhatosmanoglu

Upload: others

Post on 23-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

1

Enhancing Bitmap Indices

Guadalupe Canahuate

The Ohio State University

Advisor: Hakan Ferhatosmanoglu

Page 2: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

2

Introduction

n Bitmap indices:

¡ Widely used in data warehouses and

read-only domains

¡ Implemented in commercial DBMS such

as Oracle, Informix and Sybase

¡ Extremely efficient for execution of point

and range queries

Page 3: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

3

Applications

n Data warehouses

n Scientific data

¡ High-energy physics: Simulations are

continuously run, and notable events are stored

with all the details.

¡ Climate modeling: sensor data.

¡ Astro-physics: telescopes devoted for observations.

n Visualization applicationsSupernova Observatory

Page 4: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

4

Bitmap Indices

n Data is quantized into b categories using b bits

n Each tuple is encoded based on which category its

attribute belongs to:

¡ Bitmap Equality Encoding (BEE)

(Projection Index, Binning):

suited for point queries

¡ Bitmap Range Encoding (BRE):

suited for range queries

n Fast Bitwise Operations over bit vectors (AND, OR,

NOT, XOR)

1100102

1001003

1110011

321321

REEE

Value

Page 5: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

5

Bitmap Compression

n Run-length encoders

¡ Runs of 0s à (0, run-count)

¡ Byte-aligned Bitmap Code (BBC) [Oracle ‘94]

¡ Word-Aligned Hybrid Code (WAH) [LBNL ‘02]

¡ No need to decompress for query processing

¡ Full scan of the bitmap is needed

Page 6: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

6

Our Goal

n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)

n Improve compression

¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing

¡ Handle Missing Data (EDBT’06) n Query support / semantics

¡ Handle data updates (SSDBM’07)n Appends of new data

Page 7: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

7

Our Goal

n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)

n Improve compression

¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing

¡ Handle Missing Data (EDBT’06) n Query support / semantics

¡ Handle data updates (SSDBM’07)n Appends of new data

Page 8: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

8

Improving Compression by Data

Reorganization

n NP-Complete

n Most TSP heuristics are ineffective

n Minimize the hamming distance of adjacent numbers

n Gray-code: A space filling-curve for hamming space

n A scalable, in-place algorithm to gray-sort the tuples

Submitted to TKDE

Page 9: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

9

Reordering Framework

Column

Ordering

Row

Ordering

Run Length

Encoding

Input Bitmaps

Compressed

Bitmaps

Reorder (A, start, end, b)

1: i ß start

2: j ß end

3: while i < j do

4: Decrement j until S(j,Cols[b]) = 0

5: Increment i until S(i,Cols[b]) = 1

6: if i < j then

7: Swap the ith and jth tuples on the table

8: end if9: end while

10: if b < no_of_bits then

11: Cols[b+1] = nextCol();

12: Reorder (A, start, j, b+1)

13: Reorder (A, j+1, end, b+1)

14: Reverse (j+1, end)

15: end if

EE and RE

Bitmaps

Col Ordering

Criteria

Word-Aligned

Hybrid (WAH)

Gray Code

Ordering

Lossless

compression

Page 10: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

10

n Using Bitmap Data

¡ Number of Set Bits

n Static: Most 1s Asc/Desc

n Dynamic: Most 1s in the Longest Run

¡ Compressibility

n Static:

n Dynamic: Weighted Sum of Compressibility over the Gray code segments

n Using Query Workloads

¡ Most frequent accessed columns first

Column Ordering Criteria

iis

nC −=

2

Page 11: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

11

0

10000

20000

30000

40000

50000

60000

70000

80000

1 11 21 31 41 51 61 71 81 91 101 111

Column

Siz

e (

32-b

it W

ord

s)

0

5000

10000

15000

20000

25000

30000

35000

40000

1 11 21 31 42 52 63 75 85 97 107 117

Column

Siz

e (

32-b

it W

ord

s)

Experimental Results

n 2-10x improvement

in compression

over already

compressed

bitmaps

n 4-7x improvement

in query execution

time

0

5000

10000

15000

20000

25000

30000

35000

40000

1 11 21 31 42 52 63 75 85 97 107 117

Column

Siz

e (

32-b

it W

ord

s)

WAH

Gray Code

Gray Code + Col Order

Page 12: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

12

Our Goal

n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)

n Improve compression

¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing

¡ Handle Missing Data (EDBT’06) n Query support / semantics

¡ Handle data updates (SSDBM’07)n Appends of new data

Page 13: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

13

Compression and Direct

Access

n Bitmaps need to be compressed:¡ Row numbers do not longer correspond to

the bit position in the bitmap

n Cannot pin-point a row¡ Need to scan the bitmap

n Enable direct access over the compressed bitmaps

n Trade-off accuracy vs. efficiency¡ No false negatives

VLDB 2006

Page 14: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

14

Direct Access Over

Compressed Bitmaps

n Hashing/Bloom Filters¡ A 2m bit array indexed using k independent

hash functions

¡ False positives can happen, but false negatives cannot

¡ Only the set bits are inserted into the AB

¡ Three levels of encoding:n Per table, per attribute, per bitmap column

¡ Precision estimation

Page 15: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

15

0

200

400

600

800

1000

1200

1400

1600

0 2000 4000 6000 8000 10000

# of Rows QueriedE

xe

c.

Tim

e (

ms

ec

)

WAH Uniform

AB Uniform

WAH Landsat

AB Landsat

WAH HEP

AB HEP

Direct Access Over

Compressed Bitmaps

n Execution time depends on the number of rows queried, not in the size of the bitmaps

n For queries over less than ~15% of the rows, execution time is up to 3 orders of magnitude faster than WAH

n Always set the array

size to be smaller than

WAH bitmap

Page 16: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

16

Our Goal

n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)

n Improve compression

¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing

¡ Handle Missing Data (EDBT’06) n Query support / semantics

¡ Handle data updates (SSDBM’07)n Appends of new data

Page 17: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

17

Handling missing data

n Incomplete databases are common in all research and industry domains

n Index performance degrades in the presence of missing data

n Missing data map to a distinguished category

n Query Semantics:¡ Missing is a match

n In a patient database retrieve the records whose first name is “John” and middle name “Paul”.

¡ Missing is not a matchn In a census database retrieve the interviewee that

answered “C” in question 4 and “A” in question 8

EDBT 2006

Page 18: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

18

Handling Missing Data

Range EncodedRange EncodedEquality EncodedEquality Encoded

0

0

0

0

1

0

0

0

0

1

B1,5

0

1

0

0

0

0

1

0

0

0

B1,0

0

1

0

1

0

0

1

0

0

0

B1,1

1

1

0

1

0

0

1

0

1

0

B1,2

1

1

1

1

0

0

1

1

1

0

B1,3

1

1

1

1

0

1

1

1

1

0

B1,4

1

1

1

1

1

1

1

1

1

1

B1,5

00100210

00001missing9

0100038

0001017

0000056

1000045

00001missing4

0100033

0010022

0000051

B1,4B1,3B1,2B1,1B1,0ValueRecord

Page 19: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

19

Query Processing with Missing Data

Range EncodedEquality Encodedv1 ≤ Ai ≤ v2 =

Missing is a Match

Missing is not a Match

Page 20: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

20

Our Goal

n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)

n Improve compression

¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing

¡ Handle Missing Data (EDBT’06) n Query support / semantics

¡ Handle data updates (SSDBM’07)n Appends of new data

Page 21: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

21

Handling Data Updates

n Look for this project in the poster session… J

SSDBM 2007

Page 22: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

22

Conclusion

n Enhance bitmap indices:¡ Improve performance

n Better compression

n Direct access over compressed bitmap

¡ Overcome limitationsn Handle updates efficiently

n Support more types of queries

n Extend the applicability of bitmaps to more domains that can benefit from bitmaps good performance

Page 23: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

23

Ongoing Work

n Support new types of queries

¡ Similarity Queries

n Integrate bitmaps to the coastal monitoring system (NSF Cyber Infrastructure Project)

n Bitmaps on emerging hardware technologies

Page 24: Enhancing Bitmap Indices - Purdue University · Time (msec) WAH Uniform AB Uniform WAH Landsat AB Landsat WAH HEP AB HEP Direct Access Over Compressed Bitmaps n Execution time depends

24

Questions and Comments

n Thank you!

Email: [email protected]