approximate encoding for direct access and query processing over compressed bitmaps tan apaydin –...
TRANSCRIPT
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps
Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State UniversityHakan Ferhatosmanoglu – The Ohio State UniversityAli Saman Tosun – University of Texas at San Antonio
Presentation Outline
Motivation Goal Approximate Bitmaps (AB) encoding AB example Theoretical analysis Experiments and Results Conclusion
Motivation
Bitmap indices Data warehouses Scientific data Visualization applications Bitwise operations
Bitmap Compression Run-length encoders
Word Aligned Hybrid (WAH) Byte-aligned Bitmap Code (BBC)
Motivation
The row numbers do not longer correspond to the bit position in the bitmap
Queries over few particular rows As expensive as queries asking for all the rows
Commonly, users are only interested in a small subset of the dataset at a time.
For example: A query over the transactions of the last 7 days Spatial queries over objects in a specific
geographical area
Motivation
Visualization applications Millions of different readings ordered by
their geographic location Users ask range queries over some of
the readings for a given area The answers are highlighted in the
screen Several degrees of resolution make
approximate answers acceptable
Our Goal
Enable direct access over any subset of the bitmap
Achieve effective compression Maintain bitwise operations for query
execution Trade-off efficiency vs. accuracy
No false negatives
The approach
Our solution is inspired by Bloom Filters A 2m bit array indexed using k
independent hash functions A data object is inserted by setting the k
positions in the array corresponding to the hash values of the object
False positives can happen, but false negatives cannot
Approximate Bitmaps (AB)
A bloom filter-like structure Only the set bits are inserted into the AB Three levels of encoding:
Per table, per attribute, per bitmap column Parameters:
The hash string mapping function, F The k hash functions, {H1(x),…,Hk(x)} The size of the AB, n = αs = 2m
Precision in terms of α and k, ~(1-(1-e-k/α)k)
AB Example
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories.
Bitmap Table Size: 72 bits Number of set bits = 24. F(i,j) = concatenate(i,j) = x H1(x) = x mod 32 m = 5 AB Size: 25 = 32 bits
AB Example - Insertion
Initially all bits in the AB are zero To insert set bit in (1,1)
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
0123456789
10111213141516171819202122232425262728293031
00000000000000000000000000000000
AB Example - Insertion
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
0123456789
10111213141516171819202122232425262728293031
00000000000100000000000000000000
To insert set bit in (1,1) x = 11 H(11) = 11 mod 32 = 11 AB(11) = 1
AB Example - Insertion
To insert set bit in (5,4) x = 54 H(54) = 54 mod 32 = 22 AB(22) = 1
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
0123456789
10111213141516171819202122232425262728293031
00000000000100000000001000000000
AB Example - Insertion
After all insertions
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
0123456789
10111213141516171819202122232425262728293031
01110100100100101101001001001100
AB Example - Analysis
The underlined positions are false positives
Only 8 out of the 48 zeros are set in the AB
0123456789
10111213141516171819202122232425262728293031
01110100100100101101001001001100
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
Estimated Precision: α = ABSize/Set Bits α = 32/24 = 1.33 k = 1 FP = (1-e-k/α) P = 1-FP P = 1-(1-e-1/1.33) P = 47%
AB Example - Retrieval
Consider this query, asking for 4 rows
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
0123456789
10111213141516171819202122232425262728293031
01110100100100101101001001001100
This a range query over 4 rows, where the third attribute falls into C1 or C2
Row 4: (4,7): H(47) = 15
AB(15)=0
(4,8): H(48) = 16 AB(16)=1
Row 5: (5,7): H(57) = 25
AB(25)=1
Stop
AB Example - Retrieval
Consider this query, asking for 4 rows
1 2 3 4 5 6 7 8 9A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 0 0 0 0 1 0 0 12 0 1 0 0 1 0 0 1 03 0 0 1 1 0 0 1 0 04 0 0 1 0 0 1 0 0 15 1 0 0 1 0 0 1 0 06 1 0 0 0 1 0 1 0 07 0 1 0 0 1 0 0 1 08 0 0 1 0 0 1 0 0 1
0123456789
10111213141516171819202122232425262728293031
01110100100100101101001001001100
Row 6: (6,7): H(67) = 3
AB(67)=1 Stop
Approx Query Answer: {1,1,1,0}
Exact Answer: {0,1,1,0}
Approximate Bitmaps (AB) – Mapping Function F
F maps each cell in the bitmap table to a unique string (the hashing string)
For one AB per table and one AB per attribute, the bit in row i column j is identified by F(i,j) = i << w || j, where w is large enough to
accommodate all j For one AB per column, the bit in row i is
identified by F(i,j) = i
Approximate Bitmaps (AB) – Hash Functions
Single Hash Function Called once and the result is divided into pieces. Each piece considered as the value of a different hash
function. Secure Hash Algorithm (SHA), developed by National
Institute of Standards and Technology (NIST)
Multiple Hash Functions Independent hash functions For large number, similar performance
Hash Function H0 H1 H2 ... H9Bits 159..144 143..128 127..112 ... 15..0SHA Output 0100100010001010 1000010100100001 0111100011100010 ... 0000010101110011
Approximate Bitmaps (AB) – FP Rate
FP Rate: Probability that all k bits are set by another data object
n is the size of the AB s is the number of set bits n = αs, α = n/s
0.00001
0.0001
0.001
0.01
0.1
1
1 3 5 7 9 11 13 15 17 19
k
FP
Rat
e
a=4a=8a=16a=32
0
0.2
0.4
0.6
0.8
1
1 3 5 7 9 11 13 15 17 19
alpha
FP
Rat
e
k=1k=2k=3k=4k=5
kkk
n
kskks
een
FP
11
111
Approximate Bitmaps (AB) – Size
In terms of α: n = αs m = ceil(log2(αs))
One AB per dataset: s = |A|*N
One AB per attribute: s = N
One AB per column: s depends on the data distribution
Experimental Setup
Three datasets:
Rows Attributes Columns
Uniform 100,000 2 100
Landsat 275,465 60 900
HEP 2,173,762 6 66
Query by sampling (randomly selecting the columns queried)
Varying the number of rows queried from 100 to 10K
Experimental Results - Size
Always use the max α that produces a smaller or comparable AB than WAH
Uniform
0
100
200
300
400
500
600
700
800
900
1,000
2 4 8 16alpha
Bit
map
Siz
e (
KW
ord
s)
WAHPer DatasetPer AttributePer Column
HEP
0
10000
20000
30000
40000
50000
60000
2 4 8 16alpha
Bit
map
Siz
e (K
Wo
rds)
WAHPer DatasetPer AttributePer Column
Landsat
0
10000
20000
30000
40000
50000
60000
70000
2 4 8 16alpha
Bit
map
Siz
e (
KW
ord
s) WAH
Per DatasetPer AttributePer Column
Experimental Results - Precision
Precision vs. # of Hash Functions
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10k
Prec
isio
n
uniform, α=16
landsat, α=8
hep, α=4
hep, α=8
As α increases, the precision increases steadily and is very close to 1 for larger α
Precision increases as k increases up to the optimum point
Because large number of hash functions produces more collisions
Experimental Results – Exec Time
0
200
400
600
800
1000
1200
1400
1600
0 2000 4000 6000 8000 10000
# of Rows QueriedE
xec.
Tim
e (m
sec) WAH Uniform
AB UniformWAH LandsatAB LandsatWAH HEPAB HEP
Execution time of the AB depends on the number of rows queried, not in the number of rows in the dataset
For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH
Conclusion
AB encoding approximates the bitmaps using multiple hashing of the set bits
Allows efficient retrieval of any subset of rows and columns
Trade-off between bitmap size and precision Three levels of encoding Approximate query answers are given
without database access