bitmap indexing. bitmap index ubitmap index: specialized index that takes advantage wread-mostly...
TRANSCRIPT
![Page 1: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/1.jpg)
Bitmap Indexing
![Page 2: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/2.jpg)
Bitmap Index Bitmap index: specialized index that takes
advantage Read-mostly data: data produced from scientific
experiments can be appended in large groups Fast operations
“Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT
They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently
used ones can be cached in memory
![Page 3: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/3.jpg)
Bitmap Index
Can be useful for stable columns with few values
Bitmap: String of bits: 0 (no match) or 1 (match) One bit for each row
Bitmap index record Column value Bitmap DBMS converts bit position into row identifier.
![Page 4: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/4.jpg)
Bitmap Index Example
RowId FacSSN … FacRank 1 098-55-1234 Asst 2 123-45-6789 Asst 3 456-89-1243 Assc 4 111-09-0245 Prof 5 931-99-2034 Asst 6 998-00-1245 Prof 7 287-44-3341 Assc 8 230-21-9432 Asst 9 321-44-5588 Prof 10 443-22-3356 Assc 11 559-87-3211 Prof 12 220-44-5688 Asst
FacRank Bitmap Asst 110010010001 Assc 001000100100 Prof 000101001010
Faculty Table
Bitmap Index on FacRank
![Page 5: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/5.jpg)
Compressing Bitmaps:Run Length Encoding
Bit vector for Assc: 0010000000010000100
Runs: 2, 8, 4 Can we do: 10 1000 100? Is this
unambiguous? Fixed bits per run (max=4)
0010 1000 0100 Variable bits per run
1010 11101000 110100 11101000 is broken into: 1110 (#bits),
1000 (value)• i.e., 4 bits are required, value is 8
![Page 6: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/6.jpg)
Operation-efficient Compression Methods
Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 0000100
Compressed:12, 0, 0, 0, 0, 1000, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1000
Store very short sequences as-is
AND/OR/COUNT operations:Can uncompress on the fly
Based on variations of Run Length Compression
![Page 7: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/7.jpg)
Indexing High-Dimensional Data
Typically, high-dimensional datasets are collections of points, not regions. E.g., Feature vectors in multimedia applications. Very sparse
Nearest neighbor queries are common. R-tree becomes worse than sequential scan for most
datasets with more than a dozen dimensions.
As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful. In any given data set, advisable to empirically test contrast.
![Page 8: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/8.jpg)
External Sorting
![Page 9: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/9.jpg)
Why Sort? A classic problem in computer science! Data requested in sorted order
e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree
index. Sorting useful for eliminating duplicate
copies in a collection of records (Why?) Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of
RAM. why not virtual memory?
![Page 10: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/10.jpg)
2-Way Sort: Requires 3 Buffers
Pass 1: Read a page, sort it, write it. only one buffer page is used
Pass 2, 3, …, etc.: three buffer pages used.
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
DiskDisk
![Page 11: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/11.jpg)
Two-Way External Merge Sort
Each pass we read + write each page in file.
N pages in the file => the number of passes
So total cost is:
Idea: Divide and
conquer: sort subfiles and merge
log2 1N
2 12N Nlog
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,62,6 4,9 7,8 1,3 2
2,34,6
4,7
8,91,35,6 2
2,3
4,46,7
8,9
1,23,56
1,22,3
3,4
4,56,6
7,8
![Page 12: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/12.jpg)
General External Merge Sort
To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce
sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs.
N B/
B Main memory buffers
INPUT 1
INPUT B-1
OUTPUT
DiskDisk
INPUT 2
. . . . . .
. . .
More than 3 buffer pages. How can we utilize them?
![Page 13: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/13.jpg)
Cost of External Merge Sort
Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page
file: Pass 0: = 22 sorted runs of 5 pages
each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages
each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages
1 1 log /B N B
108 5/
22 4/
![Page 14: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/14.jpg)
Number of Passes of External Sort
N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4
![Page 15: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/15.jpg)
Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree
index on sorting column(s). Idea: Can retrieve records in order by
traversing leaf pages. Is this a good idea? Cases to consider:
B+ tree is clustered Good idea! B+ tree is not clustered Could be a very
bad idea!
![Page 16: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/16.jpg)
Clustered B+ Tree Used for Sorting
Cost: root to the left-most leaf, then retrieve all leaf pages (data pages)
If leaves contain pointers? Additional cost of retrieving data records: each page fetched just once.
Always better than external sorting!
(Directs search)
Data Records
Index
Data Entries("Sequence set")
![Page 17: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/17.jpg)
Unclustered B+ Tree Used for Sorting
Each data entry contains rid of a data record. In general, one I/O per data record!
(Directs search)
Data Records
Index
Data Entries("Sequence set")
![Page 18: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/18.jpg)
Query Processing
![Page 19: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/19.jpg)
Relational Operations We will consider how to implement:
Selection ( ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from
relation. Join ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union ( ) Tuples in reln. 1 and in reln. 2. Aggregation (SUM, MIN, etc.) and GROUP BY
Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.
![Page 20: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/20.jpg)
Schema for Examples
Bids: Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages. Buyers:
Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Buyers(id: integer, name: string, rating: integer, age: real)Bids (bid: integer, pid: integer, day: dates, product: string)
![Page 21: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/21.jpg)
21
Access Path Algorithm + data structure used to
locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key
attributes of hash index are specified in condition
B+ tree: equality or range search; a prefix of the search key attributes are specified in condition
Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified in condition
![Page 22: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/22.jpg)
Access Paths A tree index matches (a conjunction of)
terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the
selection a=5 AND b=3, and a=5 AND b>6, but not b=3.
A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND
b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.
![Page 23: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/23.jpg)
23
Access Paths Supported by B+
tree Example: Given a B+ tree whose search key
is the sequence of attributes a2, a1, a3, a4 Access path for search a1>5 a2=3.0 a3=‘x’ (R):
find first entry having a2=3.0 a1>5 a3=‘x’ and scan leaves from there until entry having a2>3.0 . Select satisfying entries
Access path for search a2=3.0 a3 >‘x’ (R): locate first entry having a2=3.0 and scan leaves until entry having a2>3.0 . Select satisfying entries
No access path for search a1>5 a3 =‘x’ (R)
![Page 24: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/24.jpg)
24
Choosing an Access Path Selectivity of an access path refers to its
cost Higher selectivity means lower cost (#pages)
If several access paths cover a query, DBMS should choose the one with greatest selectivity
Size of domain of attribute is a measure of the selectivity of domain
Example: CrsCode=‘CS305’ Grade=‘B’ - a B+ tree with search key CrsCode is more selective than a B+ tree with search key Grade
![Page 25: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/25.jpg)
25
Computing Selection
No index on attr: If rows unsorted, cost = M
• Scan all data pages to find rows satisfying the condition
If rows sorted on attr, cost = log2 M + (cost of scan)• Use binary search to locate first data page containing row
in which (attr = value)• Scan further to get all rows satisfying (attr op value)
condition: (attr op value)
![Page 26: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/26.jpg)
26
Computing Selection
B+ tree index on attr (for equality or range search): Locate first index entry corresponding to a row
in which (attr = value); cost = depth of tree Clustered index - rows satisfying condition
packed in sequence in successive data pages; scan those pages; cost depends on number of qualifying rows
Unclustered index - index entries with pointers to rows satisfying condition packed in sequence in successive index pages; scan entries and sort pointers to identify table data pages with qualifying rows, each page (with at least one such row) fetched once
condition: (attr op value)
![Page 27: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/27.jpg)
27
Unclustered B+ Tree Index
B+ Tree
Index entriessatisfyingcondition
Data File
data page
![Page 28: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/28.jpg)
28
Computing Selection
Hash index on attr (for equality search only): Hash on value; cost 1.2 (to account for
possible overflow chain) to search the (unique) bucket containing all index entries or rows satisfying condition• Unclustered index - sort pointers in index
entries to identify data pages with qualifying rows, each page (containing at least one such row) fetched once
condition: (attr = value)
![Page 29: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/29.jpg)
29
Complex Selections
Conjunctions: a1 =x a2 <y a3=z (R) Use most selective access path Use multiple access paths
Disjunction: (a1 =x or a2 <y) and (a3=z) (R) DNS (disjunctive normal form) (a1 =x a3 =z) or (a2 < y a3=z) Use file scan if one disjunct requires file scan If better access paths exist, and combined
selectivity is better than file scan, use the better access paths, else use a file scan
![Page 30: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/30.jpg)
Two Approaches to General Selections
First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we
estimate will require the fewest page I/Os. Terms that match this index reduce the number of tuples
retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.
Consider day<8/9/94 AND pid=5 AND bid=3. A B+ tree index on day can be used; then, pid=5 and bid=3 must be checked for each retrieved tuple. Similarly, a hash index on <pid, bid> could be used; day<8/9/94 must then be checked.
![Page 31: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/31.jpg)
Intersection of Rids Second approach (if we have 2 or more
matching indexes that use pointers to data entries): Get sets of rids of data records using each matching
index. Then intersect these sets of rids Retrieve the records and apply any remaining terms. Consider day<8/9/94 AND pid=5 AND bid=3. If we
have a B+ tree index on day and an index on bid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying bid=3 using the second, intersect, retrieve records and check pid=5.
![Page 32: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/32.jpg)
Projection
The expensive part is removing duplicates. SQL systems don’t remove duplicates unless the keyword DISTINCT
is specified in a query. Sorting Approach: Sort on <bid, pid> and remove
duplicates. (Can optimize this by dropping unwanted information while sorting.)
Hashing Approach: Hash on <bid, pid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates.
If there is an index with both B.pid and B.bid in the search key, scan the index!
SELECT DISTINCT B.bid, B.pidFROM Bids B
![Page 33: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/33.jpg)
33
Projection based on Sorting: Duplicate Elimination
Needed in computing projection and union
Sort-based projection: Sort rows of relation at cost of 2M Log B-1M Eliminate unwanted columns in first scan
(no cost): Log B-1M’ (M’ is projected file size) Eliminate duplicates on completion of last
merge step (no cost) Final Cost: M + 2M’ Log B-1M’
![Page 34: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/34.jpg)
34
Duplicate Elimination Hash-based projection
Phase 1: input rows, project, hash remaining columns using an B-1 output hash function, and create B-1 buckets on disk, cost = M (original file size) + M’ (projected file size)
Phase 2: sort each bucket to eliminate duplicates, cost (assuming a bucket fits in memory) = 2M’
Total cost = M+3M’
![Page 35: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/35.jpg)
Projection Based on Hashing Partitioning phase: Read R using one input buffer.
For each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers. Result is B-1 partitions (of tuples with no unwanted fields).
2 tuples from different partitions guaranteed to be distinct. Duplicate elimination phase: For each partition,
read it and build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates. If partition does not fit in memory, can apply hash-based
projection algorithm recursively to this partition. Cost: For partitioning, read R, write out each tuple,
but with fewer fields. This is read in next phase.
![Page 36: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be](https://reader036.vdocuments.us/reader036/viewer/2022062321/56649eb45503460f94bbcbfd/html5/thumbnails/36.jpg)
Discussion of Projection Sort-based approach is the standard; better
handling of skew and result is sorted. If an index on the relation contains all wanted
attributes in its search key, can do index-only scan. Apply projection techniques to data entries (much
smaller!)
If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, can do even better: Retrieve data entries in order (index-only scan), discard
unwanted fields, compare adjacent tuples to check for duplicates.