hashingsaksagan.ceng.metu.edu.tr/.../week7_hashing_section2.pdf–extendible hashing ceng 351 12...

27
CENG 351 1 Hashing

Upload: others

Post on 16-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 1

Hashing

Page 2: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 2

Motivation

• The primary goal is to locate the desired record in

a single access of disk.

– Sequential search: O(N)

– B+ trees: O(logk N)

– Hashing: O(1)

• In hashing, the key of a record is transformed into

an address and the record is stored at that address.

• Hash-based indexes are best for equality selections.

Cannot support range searches.

• Static and dynamic hashing techniques exist.

Page 3: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 3

Hash-based Index

• Data entries are kept in buckets (an abstract term)

• Each bucket is a collection of one primary page and

zero or more overflow pages.

• Given a search key value, k, we can find the bucket

where the data entry k* is stored as follows:

– Use a hash function, denoted by h

– The value of h(k) is the address for the desired bucket.

h(k) should distribute the search key values uniformly over

the collection of buckets

Page 4: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 4

Design Factors

• Bucket size: the number of records that can

be held at the same address.

• Loading factor: the ratio of the number of

records put in the file to the total capacity of

the buckets.

• Hash function: should evenly distribute the

keys among the addresses.

• Overflow resolution technique.

Page 5: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 5

Hash Functions

• Key mod N:

– N is the size of the table, better if it is prime.

• Folding:

– e.g. 123|456|789: add them and take mod.

• Truncation:

– e.g. 123456789 map to a table of 1000 addresses by picking 3 digits of the key.

• Squaring:

– Square the key and then truncate

• Radix conversion:

– e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.

Page 6: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 6

Static Hashing• Primary Area: # primary pages fixed, allocated

sequentially, never de-allocated; (say M buckets).

– A simple hash function: h(k) = f(k) mod M

• Overflow area: disjoint from the primary area. It

keeps buckets which hold records whose key maps

to a full bucket.

– Adding the address of an overflow bucket to a primary

area bucket is called chaining.

• Collision does not cause a problem as long as there

is still room in the mapped bucket. Overflow occurs

during insertion when a record is hashed to the

bucket that is already full.

Page 7: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 7

Example• Assume f(k) = k. Let M = 5. So, h(k) = k mod 5

• Bucket factor = 3 records.

Insert: 12, 35, 44, 60, 6, 46,57,33,62,17

0

1

2

3

4

Primary area

overflow

12

35

44

60

46

57

33

62 17

6

Page 8: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 8

Load Factor (Packing density)

• To limit the amount of overflow we allocate more

space to the primary area than we need (i.e. the

primary area will be, say, 70% full)

• Load Factor =

=> Lf =

# of records in the file

# of spaces in primary area

n

M * Bkfr

Page 9: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 9

Effects of Lf and Bkfr

• Performance can be enhanced by the choice

of bucket size and load factor.

• In general, a smaller load factor means

– less overflow and a faster fetch time;

– but more wasted space.

• A larger Bkfr means

– less overflow in general,

– but slower fetch.

Page 10: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 10

Insertion and Deletion

• Insertion: New records are inserted at the end of the chain.

• Deletion: Two ways are possible:

1. Mark the record to be deleted

2. Consolidate sparse buckets when deleting records.

– In the 2nd approach:

• When a record is deleted, fill its place with the last record in the chain of the current bucket.

• Deallocate the last bucket when it becomes empty.

Page 11: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 11

Problem of Static Hashing

• The main problem with static hashing: the number of

buckets is fixed:

– Long overflow chains can develop and degrade performance.

– On the other hand, if a file shrinks greatly, a lot of bucket

space will be wasted.

• There are some other hashing techniques that allow dynamically

growing and shrinking hash index. These include:

– linear hashing

– extendible hashing

Page 12: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 12

Linear Hashing

• It maintains a constant load factor.

• Thus avoids reorganization.

• It does so, by incrementally adding new

buckets to the primary area.

• In linear hashing the last bits in the hash

number are used for placing the records.

Page 13: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 13

Example

000 8 16 32

001 17 25

010 34 50

011 11 27

100 28

101 5

110 14

111 55 15

Last 3 bitse.g.

34: 100010

28: 011100

08: 001000

13: 001101

21: 010101

37: 100101

12: 001100

Insert: 13, 21, 37,12

Lf = 14/24

= 58%

13 21 37

Lf = 15/24

= 63%

Lf = 16/24

= 67%

Lf = 17/24

= 70%

Page 14: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 14

Insertion of records

• To expand the table: split an existing bucket

denoted by k digits into two buckets using

the last k+1 digits.

• e.g.

000

1000

0000

Page 15: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 15

Expanding the table

0000 16 32

001 17 25

010 34 50

011 11 27

100 28 12

101 5 13 21

110 14

111 55 15

1000 8

37

Boundary

value

Page 16: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 16

0000 16 32

0001 17

0010 34 50

011 11 27

100 28 12

101 5 13 21

110 14

111 55 15

1000 8

1001 25

1010 26

Boundary

value

37

k = 3

Hash # 1000: uses last 4 digits

Hash # 1101: uses last 3 digits

Page 17: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 17

Fetching a record

• Calculate the hash function.

• Look at the last k digits.

– If it’s less than the boundary value, the location

is in the bucket labeled with the last k+1 digits.

– Otherwise it is in the bucket labeled with the

last k digits.

• Follow overflow chains as with static

hashing.

Page 18: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 18

Insertion

• Search for the correct bucket into which to place the new record.

• If the bucket is full, allocate a new overflow bucket.

• If there are now Lf*Bkfr records more than needed for the given Lf,

– Add one more bucket to the primary area.

– Distribute the records from the bucket chain at the boundary value between the original area and the new primary area buckets

– Add 1 to the boundary value.

Page 19: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 19

Deletion

• Read in a chain of records.

• Replace the deleted record with the last record in the chain.

– If the last overflow bucket becomes empty, deallocate it.

• When the number of records is Lf * Bkfr less than the number

needed for Lf, contract the primary area by one bucket.

Compressing the table is exact opposite of expanding it:

• Keep the total # of records in the file and buckets in primary

area.

• When we have Lf * Bkfr fewer records than needed,

consolidate the last bucket with the bucket which shares the

same last k digits.

Page 20: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 20

Extendible Hashing

• Hash function returns b bits

• Only the prefix i bits are used to hash the item

• There are 2i entries in the bucket address table

• Let ij be the length of the common hash prefix for data bucket j, there

is 2(i-ij) entries in bucket address table points to j

Extendable Hashingi

i2

bucket2

i3

bucket3

i1

bucket1

Data bucket

Bucket address table

Length of common hash prefixHash prefix

Page 21: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 21

Splitting a bucket: Case 1

• Splitting (Case 1: ij=i)

– Only one entry in bucket address table points to data

bucket j

– i++; split data bucket j to j, z; ij=iz=i; rehash all items

previously in j;

Extendable Hashing

3

3

2

1

3

000

001

010

011

100

101

110

111

2

00

01

10

11

2

2

1

Page 22: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 22

Splitting: Case 2

• Splitting (Case 2: ij< i)

– More than one entry in bucket address table point to

data bucket j

– split data bucket j to j, z; ij = iz = ij +1; Adjust the

pointers previously point to j to j and z; rehash all items

previously in j;

Extendable Hashing

2

2

1

3

000

001

010

011

100

101

110

111

2

2

2

3

000

001

010

011

100

101

110

1112

Page 23: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 23

Example

• Suppose the hash function is h(x) = x mod 8 and each

bucket can hold at most two records. Show the extendable

hash structure after inserting 1, 4, 5, 7, 8, 2, 20.

Extendable Hashing

1 4 5 7 8 2 20

001 100 101 111 000 010 100

1

4

00

1

11

4

5

1

0

1

00

01

10

11

1

8

1

2

4

5

2

7

2

Page 24: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 24

Example

3

000

001

010

011

100

101

110

111

1

8

2

4

20

2

7

2

2

3

5

3

1

8

2

2

2

2

4

5

2

7

2

00

01

10

11

inserting 1, 4, 5, 7, 8, 2, 20

1 4 5 7 8 2 20

001 100 101 111 000 010 100

Extendable Hashing

Page 25: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 25

Comments on Extendible Hashing

• If directory fits in memory, equality search answered with one disk access.

– A typical example: a100MB file with 100 bytes/entry and a page size of 4K contains 1,000,000 records (as data entries) but only about 25,000 directory elements

chances are high that directory will fit in memory.

• If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow large.

– But this kind of skew can be avoided with a well-tuned hashing function

Page 26: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 26

Comments on Extendible Hashing

• Delete: If removal of data entry makes

bucket empty, can be merged with a “buddy”

bucket. If each directory element points to

same bucket as its split image, can halve

directory.

Page 27: Hashingsaksagan.ceng.metu.edu.tr/.../week7_Hashing_section2.pdf–extendible hashing CENG 351 12 Linear Hashing •It maintains a constant load factor. •Thus avoids reorganization

CENG 351 27

Summary

• Hash-based indexes: best for equality searches,

cannot support range searches.

• Static Hashing can lead to long overflow chains.

• Extendible Hashing avoids overflow pages by

splitting a full bucket when a new data entry is to

be added to it.

– Directory to keep track of buckets, doubles periodically.

– Can get large with skewed data; additional I/O if this

does not fit in main memory.