bitmap indexing. bitmap index ubitmap index: specialized index that takes advantage wread-mostly...

Bitmap Indexing

Bitmap Index Bitmap index: specialized index that takes

advantage Read-mostly data: data produced from scientific

experiments can be appended in large groups Fast operations

“Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT

They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently

used ones can be cached in memory

Bitmap Index

Can be useful for stable columns with few values

Bitmap: String of bits: 0 (no match) or 1 (match) One bit for each row

Bitmap index record Column value Bitmap DBMS converts bit position into row identifier.

Bitmap Index Example

RowId FacSSN … FacRank 1 098-55-1234 Asst 2 123-45-6789 Asst 3 456-89-1243 Assc 4 111-09-0245 Prof 5 931-99-2034 Asst 6 998-00-1245 Prof 7 287-44-3341 Assc 8 230-21-9432 Asst 9 321-44-5588 Prof 10 443-22-3356 Assc 11 559-87-3211 Prof 12 220-44-5688 Asst

FacRank Bitmap Asst 110010010001 Assc 001000100100 Prof 000101001010

Faculty Table

Bitmap Index on FacRank

Compressing Bitmaps:Run Length Encoding

Bit vector for Assc: 0010000000010000100

Runs: 2, 8, 4 Can we do: 10 1000 100? Is this

unambiguous? Fixed bits per run (max=4)

0010 1000 0100 Variable bits per run

1010 11101000 110100 11101000 is broken into: 1110 (#bits),

1000 (value)• i.e., 4 bits are required, value is 8

Operation-efficient Compression Methods

Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 0000100

Compressed:12, 0, 0, 0, 0, 1000, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1000

Store very short sequences as-is

AND/OR/COUNT operations:Can uncompress on the fly

Based on variations of Run Length Compression

Indexing High-Dimensional Data

Typically, high-dimensional datasets are collections of points, not regions. E.g., Feature vectors in multimedia applications. Very sparse

Nearest neighbor queries are common. R-tree becomes worse than sequential scan for most

datasets with more than a dozen dimensions.

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful. In any given data set, advisable to empirically test contrast.

External Sorting

Why Sort? A classic problem in computer science! Data requested in sorted order

e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree

index. Sorting useful for eliminating duplicate

copies in a collection of records (Why?) Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of

RAM. why not virtual memory?

2-Way Sort: Requires 3 Buffers

Pass 1: Read a page, sort it, write it. only one buffer page is used

Pass 2, 3, …, etc.: three buffer pages used.

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

DiskDisk

Two-Way External Merge Sort

Each pass we read + write each page in file.

N pages in the file => the number of passes

So total cost is:

Idea: Divide and

conquer: sort subfiles and merge

log2 1N

2 12N Nlog

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,7

8,91,35,6 2

2,3

4,46,7

8,9

1,23,56

1,22,3

3,4

4,56,6

7,8

General External Merge Sort

To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce

sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs.

N B/

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2

. . . . . .

. . .

More than 3 buffer pages. How can we utilize them?

Cost of External Merge Sort

Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page

file: Pass 0: = 22 sorted runs of 5 pages

each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages

each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages

1 1 log /B N B

108 5/

22 4/

Number of Passes of External Sort

N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree

index on sorting column(s). Idea: Can retrieve records in order by

traversing leaf pages. Is this a good idea? Cases to consider:

B+ tree is clustered Good idea! B+ tree is not clustered Could be a very

bad idea!

Clustered B+ Tree Used for Sorting

Cost: root to the left-most leaf, then retrieve all leaf pages (data pages)

If leaves contain pointers? Additional cost of retrieving data records: each page fetched just once.

Always better than external sorting!

(Directs search)

Data Records

Index

Data Entries("Sequence set")

Unclustered B+ Tree Used for Sorting

Each data entry contains rid of a data record. In general, one I/O per data record!

(Directs search)

Data Records

Index

Data Entries("Sequence set")

Query Processing

Relational Operations We will consider how to implement:

Selection ( ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from

relation. Join ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union ( ) Tuples in reln. 1 and in reln. 2. Aggregation (SUM, MIN, etc.) and GROUP BY

Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.

Schema for Examples

Bids: Each tuple is 40 bytes long, 100 tuples per

page, 1000 pages. Buyers:

Each tuple is 50 bytes long, 80 tuples per page, 500 pages.

Buyers(id: integer, name: string, rating: integer, age: real)Bids (bid: integer, pid: integer, day: dates, product: string)

21

Access Path Algorithm + data structure used to

locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key

attributes of hash index are specified in condition

B+ tree: equality or range search; a prefix of the search key attributes are specified in condition

Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified in condition

Access Paths A tree index matches (a conjunction of)

terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the

selection a=5 AND b=3, and a=5 AND b>6, but not b=3.

A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND

b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.

23

Access Paths Supported by B+

tree Example: Given a B+ tree whose search key

is the sequence of attributes a2, a1, a3, a4 Access path for search a1>5 a2=3.0 a3=‘x’ (R):

find first entry having a2=3.0 a1>5 a3=‘x’ and scan leaves from there until entry having a2>3.0 . Select satisfying entries

Access path for search a2=3.0 a3 >‘x’ (R): locate first entry having a2=3.0 and scan leaves until entry having a2>3.0 . Select satisfying entries

No access path for search a1>5 a3 =‘x’ (R)

24

Choosing an Access Path Selectivity of an access path refers to its

cost Higher selectivity means lower cost (#pages)

If several access paths cover a query, DBMS should choose the one with greatest selectivity

Size of domain of attribute is a measure of the selectivity of domain

Example: CrsCode=‘CS305’ Grade=‘B’ - a B+ tree with search key CrsCode is more selective than a B+ tree with search key Grade

25

Computing Selection

No index on attr: If rows unsorted, cost = M

• Scan all data pages to find rows satisfying the condition

If rows sorted on attr, cost = log2 M + (cost of scan)• Use binary search to locate first data page containing row

in which (attr = value)• Scan further to get all rows satisfying (attr op value)

condition: (attr op value)

26

Computing Selection

B+ tree index on attr (for equality or range search): Locate first index entry corresponding to a row

in which (attr = value); cost = depth of tree Clustered index - rows satisfying condition

packed in sequence in successive data pages; scan those pages; cost depends on number of qualifying rows

Unclustered index - index entries with pointers to rows satisfying condition packed in sequence in successive index pages; scan entries and sort pointers to identify table data pages with qualifying rows, each page (with at least one such row) fetched once

condition: (attr op value)

27

Unclustered B+ Tree Index

B+ Tree

Index entriessatisfyingcondition

Data File

data page

28

Computing Selection

Hash index on attr (for equality search only): Hash on value; cost 1.2 (to account for

possible overflow chain) to search the (unique) bucket containing all index entries or rows satisfying condition• Unclustered index - sort pointers in index

entries to identify data pages with qualifying rows, each page (containing at least one such row) fetched once

condition: (attr = value)

29

Complex Selections

Conjunctions: a1 =x a2 <y a3=z (R) Use most selective access path Use multiple access paths

Disjunction: (a1 =x or a2 <y) and (a3=z) (R) DNS (disjunctive normal form) (a1 =x a3 =z) or (a2 < y a3=z) Use file scan if one disjunct requires file scan If better access paths exist, and combined

selectivity is better than file scan, use the better access paths, else use a file scan

Two Approaches to General Selections

First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we

estimate will require the fewest page I/Os. Terms that match this index reduce the number of tuples

retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.

Consider day<8/9/94 AND pid=5 AND bid=3. A B+ tree index on day can be used; then, pid=5 and bid=3 must be checked for each retrieved tuple. Similarly, a hash index on <pid, bid> could be used; day<8/9/94 must then be checked.

Intersection of Rids Second approach (if we have 2 or more

matching indexes that use pointers to data entries): Get sets of rids of data records using each matching

index. Then intersect these sets of rids Retrieve the records and apply any remaining terms. Consider day<8/9/94 AND pid=5 AND bid=3. If we

have a B+ tree index on day and an index on bid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying bid=3 using the second, intersect, retrieve records and check pid=5.

Projection

The expensive part is removing duplicates. SQL systems don’t remove duplicates unless the keyword DISTINCT

is specified in a query. Sorting Approach: Sort on <bid, pid> and remove

duplicates. (Can optimize this by dropping unwanted information while sorting.)

Hashing Approach: Hash on <bid, pid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates.

If there is an index with both B.pid and B.bid in the search key, scan the index!

SELECT DISTINCT B.bid, B.pidFROM Bids B

33

Projection based on Sorting: Duplicate Elimination

Needed in computing projection and union

Sort-based projection: Sort rows of relation at cost of 2M Log B-1M Eliminate unwanted columns in first scan

(no cost): Log B-1M’ (M’ is projected file size) Eliminate duplicates on completion of last

merge step (no cost) Final Cost: M + 2M’ Log B-1M’

34

Duplicate Elimination Hash-based projection

Phase 1: input rows, project, hash remaining columns using an B-1 output hash function, and create B-1 buckets on disk, cost = M (original file size) + M’ (projected file size)

Phase 2: sort each bucket to eliminate duplicates, cost (assuming a bucket fits in memory) = 2M’

Total cost = M+3M’

Projection Based on Hashing Partitioning phase: Read R using one input buffer.

For each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers. Result is B-1 partitions (of tuples with no unwanted fields).

2 tuples from different partitions guaranteed to be distinct. Duplicate elimination phase: For each partition,

read it and build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates. If partition does not fit in memory, can apply hash-based

projection algorithm recursively to this partition. Cost: For partitioning, read R, write out each tuple,

but with fewer fields. This is read in next phase.

Discussion of Projection Sort-based approach is the standard; better

handling of skew and result is sorted. If an index on the relation contains all wanted

attributes in its search key, can do index-only scan. Apply projection techniques to data entries (much

smaller!)

If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, can do even better: Retrieve data entries in order (index-only scan), discard

unwanted fields, compare adjacent tuples to check for duplicates.

bitmap indexing. bitmap index ubitmap index: specialized index that takes advantage wread-mostly...

Documents