advanced data structures ntua spring 2007 b+-trees and external memory hashing

Advanced Data Structures

NTUA Spring 2007B+-trees and External

memory Hashing

Model of Computation

Data stored on disk(s) Minimum transfer unit:

a page(or block) = b bytes or B records

N records -> N/B = n pages

I/O complexity: in number of pages

CPU Memory

Disk

I/O complexity

An ideal index has space O(N/B), update overhead O(1) or O(logB(N/B)) and search complexity O(a/B) or O(logB(N/B) + a/B)

where a is the number of records in the answer

But, sometimes CPU performance is also important… minimize cache misses -> don’t waste CPU cycles

B+-treehttp://en.wikipedia.org/wiki/B-tree

Records must be ordered over an attribute,SSN, Name, etc. Queries: exact match and range queries

over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”

B+-tree:properties

Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout)

Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries.

Two types of nodes: index (non-leaf) nodes and (leaf) data nodes; each node is stored in 1 page (disk based method)

[BM72] Rudolf Bayer and McCreight, E. M. Organization and Maintenance of Large Ordered Indexes. Acta Informatica 1, 173-189, 1972

Example

Root

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

57

81

95

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

Index node

Data node

57

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

From non-leaf node

to next leaf

in sequence

B+tree rules tree of order n(1) All leaves at same lowest level (balanced

tree)(2) Pointers in leaves point to records except for “sequence pointer”(3) Number of pointers/keys for B+tree

Non-leaf(non-root) n n-1 n/2 n/2- 1

Leaf(non-root) n n-1

Root n n-1 2 1

Max Max Min Min ptrs keys ptrs keys

(n-1)/2(n-1)/2

Insert into B+tree

(a) simple case space available in leaf

(b) leaf overflow(c) non-leaf overflow(d) new root

(a) Insert key = 32 n=43 5 11

30

31

30

100

32

(a) Insert key = 7 n=4

3 5 11

30

31

30

100

3 5

7

7

(c) Insert key = 160 n=4

10

0

120

150

180

150

156

179

180

200

160

18

0

160

179

(d) New root, insert 45 n=4

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

30new root

Insertion Find correct leaf L. Put data entry onto L.

If L has enough space, done! Else, must split L (into L and a new node L2)

Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L.

This can happen recursively To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height.

Tree growth: gets wider or one level taller at top.

(a) Simple case - no example

(b) Coalesce with neighbor (sibling)

(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf

Deletion from B+tree

(a) Simple case Delete 30

10

40

100

10

20

30

40

50

n=5

(b) Coalesce with sibling Delete 50

10

40

100

10

20

30

40

50

n=5

40

(c) Redistribute keys Delete 50

10

40

100

10

20

30

35

40

50

n=5

35

35

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Non-leaf coalesce Delete 37

n=5

40

30

25

25

new root

Deletion

Start at root, find leaf L where entry belongs. Remove the entry.

If L is at least half-full, done! If L has only d-1 entries,

Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L

or sibling) from parent of L. Merge could propagate to root, decreasing height.

Complexity

Optimal method for 1-d range queries: Tree height: logd(N/d)

Space: O(N/d) Updates: O(logd(N/d))

Query:O(logd(N/d) + a/d)

d = B/2

Example

Root

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Range[32, 160]

Other issues Internal node architecture [Lomet01]:

Reduce the overhead of tree traversal. Prefix compression: In index nodes store only the prefix

that differentiate consecutive sub-trees. Fanout is increased.

Cache sensitive B+-tree Place keys in a way that reduces the cache faults during

the binary search in each node. Eliminate pointers so a cache line contains more keys for

comparison.

References

[BM72] Rudolf Bayer and McCreight, E. M. Organization and Maintenance of Large Ordered Indexes. Acta Informatica 1, 173-189, 1972

[L01] David B. Lomet: The Evolution of Effective B-tree: Page Organization and Techniques: A Personal Account. SIGMOD Record 30(3): 64-69 (2001)

http://www.acm.org/sigmod/record/issues/0109/a1-lomet.pdf[B-Y95] Ricardo A. Baeza-Yates: Fringe Analysis Revisited. ACM Comput. Surv.

27(1): 111-119 (1995)

Selection Queries

Yes! Hashing static hashing dynamic hashing

B+-tree is perfect, but....

to answer a selection query (ssn=10) needs to traverse a full path.

In practice, 3-4 block accesses (depending on the height of the tree, buffering)

Any better approach?

Hashing

Hash-based indexes are best for equality selections. Cannot support range searches.

Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees.

Static Hashing # primary pages fixed, allocated sequentially, never de-

allocated; overflow pages if needed. h(k) MOD N= bucket to which data entry with key k

belongs. (N = # of buckets)

h(key) mod N

hkey

Primary bucket pages Overflow pages

10

N-1

Static Hashing (Contd.) Buckets contain data entries. Hash fn works on search key field of record r. Use

its value MOD N to distribute values over range 0 ... N-1. h(key) = (a * key + b) usually works well. a and b are constants… more later.

Long overflow chains can develop and degrade performance. Extensible and Linear Hashing: Dynamic

techniques to fix this problem.

Extensible Hashing Situation: Bucket (primary page) becomes full. Why

not re-organize file by doubling # of buckets? Reading and writing all pages is expensive!

Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! Directory much smaller than file, so doubling it

is much cheaper. Only one page of data entries is split. No overflow page!

Trick lies in how hash function is adjusted!

Example

13*00

0110

11

2

2

1

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C10*

1* 7*

4* 12* 32* 16*

5*

we denote r by h(r).

• Directory is array of size 4.• Bucket for record r has entry with index = `global depth’ least

significant bits of h(r);– If h(r) = 5 = binary 101, it is in bucket pointed to by 01.– If h(r) = 7 = binary 111, it is in bucket pointed to by 11.

Handling Inserts Find bucket where record belongs. If there’s room, put it there. Else, if bucket is full, split it:

increment local depth of original page allocate new page with new local depth re-distribute records from original page. add entry for the new page to the

directory

Example: Insert 21, then 19, 15

13*00

0110

11

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

2Bucket D

DATA PAGES

10*

1* 7*

24* 12* 32* 16*

15*7* 19*

5* 21 = 10101 19 = 10011 15 = 01111

1221*

24* 12* 32*16*

Insert h(r)=20 (Causes Doubling)

00

01

10

11

2 2

2

2

LOCAL DEPTH

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

1* 5* 21*13*

10*

15* 7* 19*

(`split image'of Bucket A)

20*

3

Bucket A24* 12*

of Bucket A)

3

Bucket A2(`split image'

4* 20*12*

2

Bucket B1* 5* 21*13*

10*

2

19*

2

Bucket D15* 7*

3

32*16*

LOCAL DEPTH

000

001

010

011

100

101

110

111

3

GLOBAL DEPTH

3

32*16*

20 = 10100

Points to Note 20 = binary 10100. Last 2 bits (00) tell us r belongs in

either A or A2. Last 3 bits needed to tell which. Global depth of directory: Max # of bits needed to tell

which bucket an entry belongs to. Local depth of a bucket: # of bits used to determine if

an entry belongs to this bucket. When does bucket split cause directory doubling?

Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page.

Linear Hashing

This is another dynamic hashing scheme, alternative to Extensible Hashing.

Motivation: Ext. Hashing uses a directory that grows by doubling… Can we do better? (smoother growth)

LH: split buckets from left to right, regardless of which one overflowed (simple, but it works!!)

Linear Hashing (Contd.) Directory avoided in LH by using overflow pages. (chaining

approach) Splitting proceeds in `rounds’. Round ends when all NR initial

(for round R) buckets are split. Current round number is Level. Search: To find bucket for data entry r, find hLevel(r):

If hLevel(r) in range `Next to NR’ , r belongs here. Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR;

must apply hLevel+1(r) to find out. Family of hash functions:

h0, h1, h2, h3, …. hi+1 (k) = hi(k) or hi+1 (k) = hi(k) + 2i-1N0

Linear Hashing: ExampleInitially: h(x) = x mod N (N=4 here)Assume 3 records/bucketInsert 17 = 17 mod 4 1

Bucket id 0 1 2 3

4 8 5 9 6 7 11

13

Linear Hashing: ExampleInitially: h(x) = x mod N (N=4 here)Assume 3 records/bucketInsert 17 = 17 mod 4 1

Bucket id 0 1 2 3

4 8 5 9 6 7 1113

Overflow for Bucket 1

Split bucket 0, anyway!!

Linear Hashing: ExampleTo split bucket 0, use another function h1(x): h0(x) = x mod N , h1(x) = x mod (2*N)

17

0 1 2 3

4 8 5 9 6 7 11

13

Split pointer

Linear Hashing: ExampleTo split bucket 0, use another function h1(x): h0(x) = x mod N , h1(x) = x mod (2*N)

17

Bucket id 0 1 2 3 4

8 5 9 6 7 11 4

13

Split pointer

Linear Hashing: ExampleTo split bucket 0, use another function h1(x): h0(x) = x mod N , h1(x) = x mod (2*N) Bucket id 0 1 2 3 4

8 5 9 6 7 11 413

17

Linear Hashing: Example h0(x) = x mod N , h1(x) = x mod (2*N)

Insert 15 and 3 Bucket id 0 1 2 3 4

8 5 9 6 7 11 413

17

Linear Hashing: Example

h0(x) = x mod N , h1(x) = x mod (2*N) Bucket id 0 1 2 3 4 5

8 9 6 7 11 4 13 515

3

17

Linear Hashing: Search

h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the split ones) Bucket id 0 1 2 3 4 5

8 9 6 7 11 4 13 515

3

17

Linear Hashing: Search

Algorithm for Search: Search(k) 1 b = h0(k) 2 if b < split-pointer then 3 b = h1(k) 4 read bucket b and search there

References

[Litwin80] Witold Litwin: Linear Hashing: A New Tool for File and Table Addressing. VLDB 1980: 212-223

http://www.cs.bu.edu/faculty/gkollios/ada01/Papers/linear-hashing.PDF

[B-YS-P98] Ricardo A. Baeza-Yates, Hector Soza-Pollman: Analysis of Linear Hashing Revisited. Nord. J. Comput. 5(1): (1998)

advanced data structures ntua spring 2007 b+-trees and external memory hashing

Documents

insert key

b records n records

b trees

b bytes

leaf splits

middle key

d nonleaf coalescedelete

n pagesio complexity