more on indexes

1

More on IndexesSecondary Indexes

B-Trees

Source: our textbook, slides by Hector Garcia-Molina

2

Secondary Indexes Sometimes we want multiple indexes on a

relation. Ex: search Candies(name,manf) both by name and

by manufacturer Typically the file would be sorted using the

key (ex: name) and the primary index would be on that field.

The secondary index is on any other attribute (ex: manf).

Secondary index also facilitates finding records, but cannot rely on them being sorted

3

Sparse Secondary Index? No! Since records are not sorted on

that key, cannot predict the location of a record from the location of any other record.

Thus secondary indexes are always dense.

4

Sequencefield

5030

7020

4080

10100

6090

• Sparse index302080

100

90...

does not make sense!

5

Design of Secondary Indexes

Always dense, usually with duplicates Consists of key-pointer pairs ("key"

means search key, not relation key) Entries in index file are sorted by key Therefore second-level index is sparse

6

Secondary indexesSequencefield

5030

7020

4080

10100

6090

10203040

506070...

105090...

sparsesecond-

leveldensefirst-level

7

Secondary Index and Duplicate Keys

Scheme in previous diagram wastes space in the present of duplicate keys

If a search key value appears n times in the data file, then there are n entries for it in the index.

8

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10101020203040404040...

one option...

Problem:excess overhead!

• disk space• search time

9

Buckets To avoid repeating values, use a level of

indirection Put buckets between the secondary

index file and the data file One entry in index for each search key

K; its pointer goes to a location in a "bucket file", called the bucket for K

Bucket holds pointers to all records with search key K

10

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

102030405060...

buckets

11

Why “bucket” idea is useful

Indexes Recordsname: primary Emp

(name,dept,floor,...)dept: secondaryfloor: secondary

12

Query: SELECT name FROM Emp WHERE dept = 'Toy' AND floor = 2dept index Emp floor index

Toy 2

Intersect Toy dept bucket and floor 2

bucket to get set of matching Emp’sSaves disk I/O's

13

Summary of Indexes So Far

Advantages: simple index is sequential file, good for scans

Disadvantages either inserts are expensive or lose sequentiality (cf. next slide)

Instead use B-tree data structure to implement index

14

Example Index (sequential)

continuous

free space

102030

405060

708090

39313536

323834

33

overflow area(not sequential)

15

B-Trees Several related data structures Key features are:

automatically adjust number of levels of indexes as size of data file changes

storage on blocks is managed to keep every block between half full and full => no overflow blocks needed

We'll actually study B+ trees

16

B-Tree Structure an example of a balanced search tree:

every root-to-leaf path has same length each node (vertex) in the tree is a block,

which contains search keys and pointers parameter n, which is largest value so that

n+1 pointers and n keys fit in one block Ex: If block size is 4096 bytes, keys are 4 bytes,

and pointers are 8 bytes, then n = 340.

17

Constraints on B-Tree Nodes

Keys in leaf nodes are copies of keys from data file, in sorted order

Root contains between 2 and n+1 index node pointers

Each internal node contains between (n+1)/2 and n+1 index node pointers Each non-leaf node consists of

ptr1,key1,ptr2,key2,…,keym-1,ptrm

where ptri points to index node with keys between keyi-1 and keyi

18

Constraints (cont'd) Each leaf contains between

(n+1)/2 and n data record pointers, plus a "next leaf" pointer

Associated with each data record pointer is a key, and the pointer points to the data record with that key

19

In textbook’s notationn=3Leaf:

Non-leaf:

30 3530

30 35

30

20

Sample non-leaf

to keys to keys to keys to keys< 57 57 k<81 81k<95 95

57 81 95

21

Sample leaf node:From non-leaf node

to next leafin

sequence57 81 95

To re

cord

wi

th k

ey 5

7To

reco

rd

with

key

81

To re

cord

wi

th k

ey 8

5

22

Full nodemin. node

Non-leaf

Leaf

n=3

120

150

180

30

3 5 11 30 35

coun

ts e

ven

if nu

ll

23

Root

B-Tree Example n=3

100

120

150

180

30

3 5 11 30 35 100

101

110

120

130

150

156

179

180

200

24

Insert into B+tree(a) simple case

space available in leaf(b) leaf overflow(c) non-leaf overflow(d) new root

25

(a) Insert key = 32 n=33 5 11 30 31

30

100

32

26

(a) Insert key = 7 n=3

3 5 11 30 31

30

100

3 5 7

7

27

(c) Insert key = 160 n=3

100

120

150

180

150

156

179

180

200

160

180

160

179

28

(d) New root, insert 45 n=3

10 20 30

1 2 3 10 12 20 25 30 32 40 40 45

40

30new root

29

(a) Simple case - no example(b) Coalesce with neighbor (sibling)(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf

Deletion from B-tree

30

(b) Coalesce with sibling Delete 50

10 40 100

10 20 30 40 50

n=4

40

31

(c) Redistribute keys Delete 50

10 40 100

10 20 30 35 40 50

n=4

35

35

32

40 4530 3725 2620 2210 141 3

10 20 30 40

(d) Non-leaf coalese Delete 37

n=4

40

30

25

25

new root

33

B-tree deletions in practice

– Often, coalescing is not implemented Too hard and not worth it!

34

Applications of B-Trees B-tree is used to implement indexes The data record pointers in the leaves

correspond to the data record pointers in sequential indexes

Some example uses: B-tree search key is primary key for data file,

leaf pointers form a dense index on the file B-tree search key is primary key for data file,

leaf pointers form a sparse index on the file B-tree search key is not primary key, leaf

pointers form a dense index on the file

35

B-Trees with Duplicate Keys

Change definition of B-tree: If key K appears in an internal node,

then K is the smallest "new" key in the subtree S rooted at the pointer that follows K in the node

"New" means K does not appear in the part of the B-tree to the left of S but it does appear in S

Allow null key in certain situations

36

Example B-Tree with Duplicates

17

-- 37 437

2 3 5 7 13 13 17 23 23 23 23 37 41 43 47

37

Lookup in B-Trees Assume no duplicate keys. Assume B-tree is a dense index. To find the record with key K, search starting

at the root and ending at a leaf: if current node is not a leaf and has keys K1,

K2, …, Kn, find the smallest key, Ki, in the sequence that is ≤ K.

follow the (i+1)-st pointer to a node at the next level and repeat

when a leaf node is reached, find the key with value K and follow the associated pointer to the data record

38

Range Queries with B-Trees

Range query: a query in which a range of values is sought. Examples: SELECT * FROM R WHERE R.k > 40; SELECT * FROM R WHERE R.k >= 10 AND R.k <=

25; To find all keys in the range [a,b]:

Do a lookup on a: leads to leaf where a could be Search the leaf for all keys ≥ a If we find a key > b, we are done Else follow next-leaf pointer and continue

searching in the next leaf Continue until finding a key > b or no more

leaves

39

Efficiency of B-Trees B-trees allow lookup, insertion and deletion of

records with very few disk I/Os Number of disk I/Os is number of levels in the

B-tree plus cost of any reorganization If n is at least 10, then splitting/merging blocks

will be rare and usually limited to the leaves For typical sizes of keys, pointers, blocks and

files, 3 levels suffice (see next slide) Also can keep root block of B-tree in memory

40

Size of B-Tree Assume

4096 bytes per block 4 bytes per key (e.g., integer) 8 bytes per pointer no header info in the block

Then n = 340 (can keep n keys and n+1 pointers in a block)

Assume on average a block has 255 pointers Count:

one node at level 1 (the root) 255 nodes at level 2 255*255 = 65,025 nodes at level 3 (leaves) each leaf has 255 pointers, so total number of records

is more than 16 million

more on indexes

Documents