chapter 11 indexing & hashing. 2 n sophisticated database access methods n basic concerns:...

33
Chapter 11 Indexing & Hashing

Upload: charles-lindsey

Post on 01-Jan-2016

233 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

Chapter 11

Indexing & Hashing

Page 2: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

2

Indexing & Hashing Sophisticated database access methods

Basic concerns: access/insertion/deletion time, space overhead

Indexing

An index is specified on one (or more) field(s), called search key field, of the record, which is not necessarily unique

Different index structures associated with different search keys

Allows fast random access to records

Index record (forms an access path to the data record), is of the form < field value, pointer to record >

Page 3: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

3

Indexing Dense Index: For every unique search-key value, there is

an index record

Sparse Index: Index records are created for some search-key values

Sparse index is slower, but requires less space & overhead

Primary Index:

Defined on an ordered data file, ordered on a search key field & is usually the primary key.

A sequentially ordered file with a primary index is called index-sequential file

A binary search on the index yields a pointer to the record

Index value is the search-key value of the first data record in the block

Page 4: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

4

Figure. Dense index

Figure. Sparse index

Page 5: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

5

Figure. Primary index on the ordering key field of a file

Page 6: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

6

Primary Index Index Deletion

(Dense Index) delete the search-key value (of the deleted record) from the index file if the deleted record is the last record with the search-key value

(Sparse Index) if the deleted record is the last record with the search-key value & the search-key value v (of the deleted record) exists in the index file F, replace v by w, where w is the next search-key value in order; if v and w are in F, simply delete v

Index Insertion

(Dense Index) if the search-key value v (of the new record) does not exist in the index file, insert v

(Sparse Index) if a new data block B is created, then the 1st search-key value of B is inserted into the index file

Page 7: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

7

Multi-Level Indices

First-level index: the original index file

Second-level index: primary index to the original index file

(Rare) Third-level index: top level index (fit in one disk block)

Form a search tree, such as B-tree or B+-tree structures

Insertion/deletion of new indexes are not trivial in indexed files

Page 8: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

8

Figure. A two-level primary index

Page 9: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

9

Secondary Indices Defined on an unordered data file, i.e., not by the indexed field

order (can be defined on a candidate key/non-key field)

Each pointer often points to a bucket which consists of pointers to records with the same search-key value. The bucket

structure can be eliminated if the index is dense, and the search-key values form a primary key, i.e., unique

Advantages:

i. Improve the performance of queries that use candidate keys

ii. Eliminate extra pointers within the records

iii. Eliminate the need for scanning records sequentially

Disadvantages: overhead/modification

Types of Secondary Indices:

Dense: pointers in a bucket point to records w/ same search-key values

Sparse: a pointer in a bucket points to records w/ search-key values in the appropriate range

Page 10: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

10

Figure. A secondary index on a key field of a file.

Page 11: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

11

Figure. A secondary index on a non-key field implemented using a level of indirection

Page 12: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

12

B+-Tree (Multi-level) Indices Frequently used index structure in DB

Allow efficient insertion/deletion of new/existing search-key values

A balanced tree structure: all leaf nodes are at the same level (which may form a dense index)

Each node, corresponding to a disk block, has the format:

P1 K1 P2 … Pn-1 Kn-1 Pn

where Pi, 1 i n, is a pointer

Ki, 1 i n-1, is a search-key value &

Ki < Kj, i < j, i.e., search-key values are in order

P1 K1 … Ki-1 Pi Ki … Kn-1 Pn

In each leaf node, Pi points to either (i) a data record with search-key value Ki or (ii) a bucket of pointers, each points to a data record with search-key value Ki

X X XX < K1

Ki-1 X < Ki Kn-1 X

Page 13: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

13

B+-Tree (Multi-level) Indices Each leaf node is kept between half full & completely full, i.e.,

((n-1)/2, n-1) search-key values

Non-leaf nodes form a sparse index

Each non-leaf node (except the root) must have (n/2, n) pointers

No. of Block accesses required for searching a search-key value @leaf-node level is log n/2(K)

where K = no. of unique search-key values & n = no. of indices/node

Insertion into a full node causes a split into two nodes which may propagate to higher tree levels

Note: if there are n search-key values to be split, put the first ((n-1)/2 in the existing node & the remaining in a new node

A less than half full node caused by a deletion must be merged with neighboring nodes

Page 14: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

14

B+-Tree Algorithms Algorithm 1. Searching for a record with search-key value K, using a B+-Tree .

Begin

n block containing root node of B+-Tree ;

read block n;

while (n is not a leaf node of the B+-Tree) do

begin

q number of tree pointers in node n;

if K < n.K1 /* n.Ki refers to the ith search-key value in node n */

then n n.P1 /* n.Pi refers to the ith pointer in node n */

else if K n.Kq-1

then n n.Pq

else begin

search node n for an entry i such that n.Ki-1 K < n.Ki;

n n.Pi ;

end; /*ELSE*/

read block n;

end; /*WHILE*/

search block n for entry Ki with K = Ki; /*search leaf node*/

if found, then read data file block with address Pi and retrieve record else record with search-key value K is not in the data file;

end. /*Algorithm 1*/

Page 15: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

15

B+-Tree Algorithms Algorithm 2. Inserting a record with search-key value K in a B+-Tree of order p.

/* A B+-Tree of order p contains at most p-1 values an p pointers*/

Begin n block containing root node of B+-Tree ;

read block n;

set stack S to empty;

while (n is not a leaf node of the B+-Tree ) do

begin

push address of n on stack S; /* S holds parent nodes that are needed in case of split */

q number of tree pointers in node n;

if K < n.K1 /* n.Ki refers to the ith search-key value in node n */

then n n.P1 /* n.Pi refers to the ith pointer in node n */

else if K n.Kq-1

then n n.Pq

else begin

search node n for an entry i such that n.Ki-1 K < n.Ki;

n n.Pi ;

end; /* ELSE */

read block n;

end; /* WHILE */

search block n for entry Ki with K = Ki; /* search leaf node */

Page 16: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

16

Algorithm 2 Continueif found

then return /*record already in index file - no insertion is needed */

else

begin /* insert entry in B+-Tree to point to record */

create entry (P, K), where P points to file block containing new record;

if leaf node n is not full

then insert entry (P, K) in correct position in leaf node n

else

begin /* leaf node n is full – split */

copy n to temp; /* temp is an oversize leaf node to hold extra entry */

insert entry (P, K) in temp in correct position; /* temp now holds p+1 entries of the form (Pi, Ki) */

new a new empty leaf node for the tree;* j p/2

n first j entries in temp (up to entry (Pj, Kj));

n.Pnext new; /* Pnext points to the next leaf node*/

new remaining entries in temp;

* K Kj+1;

/* Now we must move (K, new) and insert in parent internal node. However, if parent is full, split may propagate */

finished false;

Page 17: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

17

Algorithm 2 continueRepeat

if stack S is empty, then /*no parent node*/

begin /* new root node is created for the B+-Tree */

root a new empty internal node for the tree;

* root <n, K, new>; /* set P1 to n & P2 to new */

finished true;

end

else

begin

n pop stack S;

if internal node n is not full, then

begin /* parent node not full - no split */

insert (K, new) in correct position in internal node n;

finished true

end

else

Page 18: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

18

Algorithm 2 continuebegin /* internal node n is full with p tree pointers – split */ copy n to temp; /* temp is an oversize internal node */

insert (K, new) in temp in correct position; /* temp has p+1 tree pointers */

new a new empty internal node for the tree;

* j ((p + 1)/2 n entries up to tree pointer Pj in temp;

/* n contains <P1, K1, P2, K2, .., Pj-1, Kj-1, Pj> */

new entries from tree pointer Pj+1 in temp;

/*new contains < Pj+1 , Kj+1, .., Kp-1, Pp, Kp, Pp+1 >*/

* K Kj;

/* now we must move (K, new) and insert in parent internal node */

end

end

until finished

end; /* ELSE */

end; /* ELSE */

end. /* Algorithm 2 */

Page 19: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

19

Hashing Uses dense index

Avoids accessing an index structure to locate data

Allocate search-key values to different buckets

(Static Hash Function) given a search-key value v, a hash function h computes (assigns) the address of the desired bucket (which contains a pointer to the record) for v

h: K B

where K: set of search-key values

B: set of (fixed) bucket addresses

The hash function maps a search-key value to a bucket b and perform a (linear) search of every record in b

An ideal hash function Uniform distribution of search-key values, i.e., same no. of

search-key values in each bucket Random distribution of search-key values, i.e., each search-key

value has the same possibility

Page 20: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

20

Dynamic (Extendable) Hash Function (EHF) Resolves the problems of static hashing

Allowing hash function to be modified dynamically, accommodating changes in DB size (no reserved buckets for future

growth)

Minimizing space overhead, i.e., bucket address table (b-a-t) is small

Allows buckets to be split or combined to maintain space efficiency

Buckets are created on demand, as records are inserted.

Result: low performance overhead (reorganization requires one bucket at a time)

Page 21: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

21

Dynamic (Extendable) Hash Function (EHF)

EHF uses i bits, which grows and shrinks with DB size, as an offset into b-a-t

i bits (which changes as file grows) of h(K) are required to determine the correct bucket for K

All entries of the i-bit b-a-t pointed to the same bucket j have a common hash prefix (chp) and bucket j is associated with an integer ij to denote the length

of the chp

No. of entries of b-a-t that point to bucket j = 2(i - ij)

Page 22: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

22Figure. General extendable hash structure

= 2

= 1

= 2

= 2

Page 23: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

23

Dynamic (Extendable) Hash Function (EHF)

Lookup K, a search-key value: locate the bucket pointed to by the b-a-t entry which is determined by the

first i high-order bits of h(K)

Insert a record r with search-key value K

1. Lookup K and locate bucket j

2. If j is not full, insert the info of K in j and r in the file

3. If j is full, create a new bucket z. There are two cases to be considered:

Page 24: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

24

Dynamic (Extendable) Hash Function (EHF)

(a) Case i = ij (only one entry in b-a-t points to j):

1.Increase i by 1, i.e., doubling the size of b-a-t. Each entry is replaced by 2 entries which contain the same pointer as the original entry

2.(For the b-a-t entry that causes the split) Set the 2nd entry created from the entry for j to point to z

3.Set ij = i(new) and iz = i(new)

4.Rehash the records in j based on (new) i and redistribute records in j and r

5.Re-attempt to insert r and repeat the whole process if r and all records in j have the same hash prefix

Page 25: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

25

Figure. Sample deposit file

Figure. Hash function for branch-name

Figure. Initial extendable hash structure (Each bucket can hold up to 2 records)

Figure. Hash structure after 3 insertions (Downtown, Round Hill, Perryridge)

DowntownRound Hill

Page 26: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

26

Figure. Sample deposit file Figure. Hash function for branch-name

Figure. Hash structure after four insertions

*

*

*

Page 27: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

27

Hashing Insert a record r with search-key value K

(b) Case i > ij (> 1 entry in b-a-t points to j):

1. iz = ij + 1 and ij = ij + 1

2. Adjust entries in b-a-t that point to j: set the first half of entries point to j and the remaining ones to z

3. Rehash and allocate records in j

4. Reattempt to insert r and repeat the whole process (of insertion) if r and all records in j have the same hash prefix

Delete a record r with search-key value K:1. Lookup K and locate bucket j

2. Remove K from j and r from the file. Remove j if j becomes empty

3. Adjust b-a-t if necessary

Disadvantages Lookup involves additional level of indirection (must access b-a-t) Additional complexity in implementation

Page 28: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

28

Figure. Sample deposit file Figure. Hash function for branch-name

Figure. Extendable hash structure for the deposit file

Page 29: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

29

Figure. Sample account file

Figure. Hash function for branch-name

Figure. Initial extendable hash structure.

Page 30: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

30

Figure 11.28 Hash structure after four insertions

Page 31: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

31Figure 11.29 Hash structure after seven insertions

Redwood A-222 700 0011

Round Hill A-305 350 1101

Figure 11.29 Hash structure after nine insertions

Page 32: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

32

Figure 11.30 Extendable hash structure for the account file

Page 33: Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing

33

Indexing & Hashing Expected types of queries is critical to the choice

between indexing and hashing

Comparison

For query with an equality comparison of an attribute, hashing is preferable

For query with a range of values specified, indexing is preferable

Most DB systems use indexing - difficult to find a good hash function that preserves order to support

range queries