1 cse 326: data structures: hash tables lecture 12: monday, feb 3, 2003

39
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

Upload: herbert-chapman

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

1

CSE 326: Data Structures: Hash Tables

Lecture 12: Monday, Feb 3, 2003

Page 2: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

2

Review: Hash Tables

• A hash table = implementation of a dictionary

Main idea: use an array – direct access, in time O(1)

Problem: keys are not integersSolution: use a hash function, h(key)=index

• No hash function is perfect collisions.• Two ways to deal with collisions

Page 3: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

3

3

2

1

0

6

5

4

a d

e b

c

Review: Hashing with Separate Chaining

• Put a little dictionary at each entry– choose type as appropriate

– common case is unordered linked list (chain)

• Properties– performance degrades with

length of chains can be greater than 1

h(a) = h(d)h(e) = h(b)

What was??

Page 4: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

4

Review: Closed HashingProblem with separate chaining:

Memory consumed by pointers – 32 (or 64) bits per key!

What if we only allow one Key at each entry?– two objects that hash to the same spot can’t

both go there– first one there gets the spot– next one must go in another spot

• Properties 1– performance degrades with difficulty of

finding right spot

a

c

e3

2

1

0

6

5

4

h(a) = h(d)h(e) = h(b)

d

b

Page 5: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

5

Review: Closed Hashing

• Given an item X, try cells h0(X), h1(X), h2(X), …, hi(X)

• hi(X) = (Hash(X) + F(i)) mod TableSize – Define F(0) = 0

• F is the collision resolution function. Some possibilities:– Linear: F(i) = i – Quadratic: F(i) = i2 – Double Hashing: F(i) = iHash2(X)

Page 6: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

6

Deletion with Separate Chaining

Why is this slide blank?

Page 7: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

7

0

1

2

73

2

1

0

6

5

4

delete(2)

0

1

73

2

1

0

6

5

4

find(7)

Where is it?!

Deletion in Closed Hashing

What should we do instead?

Page 8: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

8

0

1

2

73

2

1

0

6

5

4

delete(2)

0

1

#

73

2

1

0

6

5

4

find(7)Indicates deleted value:if you find it, probe again

Lazy Deletion

But now what is the problem?

Page 9: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

9

The Squished Pigeon Principle

• An insert using Closed Hashing cannot work with a load factor of 1 or more.– Quadratic probing can fail if > ½

– Linear probing and double hashing slow if > ½

– Lazy deletion never frees space

• Separate chaining becomes slow once > 1– Eventually becomes a linear search of long chains

• How can we relieve the pressure on the pigeons?

REHASH!

Page 10: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

10

Rehashing Example

Separate chaining h1(x) = x mod 5 rehashes to h2(x) = x mod 11

=1

=5/11

1 2 3 4

1 2 3 4 5 6 7 8 9 10

0

0

25 3752

8398

25 37 83 52 98

Page 11: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

11

Rehashing Amortized Analysis

• Consider sequence of n operationsinsert(3); insert(19); insert(2); …

• What is the max number of rehashes?• What is the total time?

– let’s say a regular hash takes time a, and rehashing an array contain k elements takes time bk.

• Amortized time = (an+b(2n-1))/n = O( 1 )

log n

log

(1 2 4 8 ... ) 2

(2 1)

ni

i o

an b n an b

an b n

Page 12: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

12

Rehashing without Stretching

• Suppose input is a mix of inserts and deletes– Never more than TableSize/2 active keys

– Rehash when =1 (half the table must be deletions)

• Worst-case sequence:– T/2 inserts, T/2 deletes, T/2 inserts, Rehash,

T/2 deletes, T/2 inserts, Rehash, …

• Rehashing at most doubles the amount of work – still O(1)

Page 13: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

13

Case Study

• Spelling dictionary– 50,000 words

– static

– arbitrary(ish) preprocessing time

• Goals– fast spell checking

– minimal storage

• Practical notes– almost all searches are

successful

– words average about 8 characters in length

– 50,000 words at 8 bytes/word is 400K

– pointers are 4 bytes

– there are many regularities in the structure of English words

Why?

Page 14: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

14

Solutions

• Solutions– sorted array + binary search– separate chaining– open addressing + linear probing

Page 15: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

15

Storage

• Assume words are strings and entries are pointers to strings

Array +binary search Separate chaining

Closed hashing

n pointers

table size + 2n pointers =n/ + 2n n/ pointers

Page 16: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

16

Analysis

• Binary search– storage: n pointers + words = 200K+400K = 600K– time: log2n 16 probes per access, worst case

• Separate chaining - with = 1 – storage: n/ + 2n pointers + words = 200K+400K+400K =

1GB– time: 1 + /2 probes per access on average = 1.5

• Closed hashing - with = 0.5– storage: n/ pointers + words = 400K + 400K = 800K

– time: probes per access on average = 1.5

1

11

2

1

50K words, 4 bytes @ pointer

Page 17: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

17

Approximate Hashing• Suppose we want to reduce the space

requirements for a spelling checker, by accepting the risk of once in a while overlooking a misspelled word

• Ideas?

Page 18: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

18

Approximate Hashing

Strategy:– Do not store keys, just a bit indicating cell is in

use– Keep low so that it is unlikely that a

misspelled word hashes to a cell that is in use

Page 19: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

19

Example

• 50,000 English words• Table of 500,000 cells, each 1 bit

– 8 bits per byte

• Total memory: 500K/8 = 62.5 K– versus 800 K separate chaining, 600 K open addressing

• Correctly spelled words will always hash to a used cell

• What is probability a misspelled word hashes to a used cell?

Page 20: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

20

Rough Error Calculation

• Suppose hash function is optimal - hash is a random number

• Load factor 0.1– Lower if several correctly spelled words hash to

the same cell

• So probability that a misspelled word hashes to a used cell is 10%

Page 21: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

21

Exact Error Calculation

• What is expected load factor?

used cells (Probability a cell is used)(table size)

table size table sizeProbability a cell is used = 1 - (Prob. cell not used)

=1-(Prob. 1st word doesn't use cell)...(Prob. last word doesn't use cell)

=

number wor

50,000

ds1-((table

499,9991

50

size - 1)/ta

0,0000.0

ble

9

s z

5

i e)

Page 22: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

22

Puzzler

• Suppose you have a HUGE hash table, that you often need to re-initialize to “empty”. How can you do this in small constant time, regardless of the size of the table?

Page 23: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

23

A Random Hash…• Extensible hashing

– Hash tables for disk-based databases – minimizes number disk accesses

• Minimal perfect hash function– Hash a given set of n keys into a table of size n with no collisions

– Might have to search large space of parameterized hash functions to find

– Application: compilers

• One way hash functions– Used in cryptography

– Hard (intractable) to invert: given just the hash value, recover the key

Page 24: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

24

Databases

• A database is a set of records, each a tuple of values– E.g.: [ name, ss#, dept., salary ]

• How can we speed up queries that ask for all employees in a given department?

• How can we speed up queries that ask for all employees whose salary falls in a given range?

Page 25: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

25

Hash Tables on Secondary Storage (Disks)

Main differences:

• One bucket = one block, hence may hold multiple keys

• Open chaining: use overflow blocks when needed

• Closed chaining never used

Page 26: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

26

• Assume 1 bucket (block) stores 2 keys + pointers

• h(e)=0

• h(b)=h(f)=1

• h(g)=2

• h(a)=h(c)=3

Hash Table Example

e

b

f

g

a

c

0

1

2

3

Page 27: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

27

• Search for a:

• Compute h(a)=3

• Read bucket 3

• 1 disk access

Searching in a Hash Table

e

b

f

g

a

c

0

1

2

3

Page 28: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

28

• Place in right bucket, if space

• E.g. h(d)=2

Insertion in Hash Table

e

b

f

g

d

a

c

0

1

2

3

Page 29: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

29

• Create overflow block, if no space• E.g. h(k)=1

• More over-flow blocksmay be needed

Insertion in Hash Table

e

b

f

g

d

a

c

0

1

2

3

k

Page 30: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

30

Hash Table Performance

• Excellent, if no overflow blocks

• Degrades considerably when number of keys exceeds the number of buckets (I.e. many overflow blocks).

Page 31: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

31

Extensible Hash Table

• Allows has table to grow, to avoid performance degradation

• Assume a hash function h that returns numbers in {0, …, 2k – 1}

• Start with n = 2i << 2k , only look at first i most significant bits

Page 32: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

32

Extensible Hash Table

• E.g. i=1, n=2i=2, k=4

• Note: we only look at the first bit (0 or 1)

0(010)

1(011)

i=1 1

1

01

Page 33: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

33

Insertion in Extensible Hash Table

• Insert 11100(010)

1(011)

1(110)

i=1 1

1

01

Page 34: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

34

Insertion in Extensible Hash Table

• Now insert 1010

• Need to extend table, split blocks

• i becomes 2

0(010)

1(011)

1(110), 1(010)

i=1 1

1

01

Page 35: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

35

Insertion in Extensible Hash Table

0(010)

10(11)

10(10)

i=2 1

2

00011011

11(10) 2

Page 36: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

36

Insertion in Extensible Hash Table

• Now insert 0000, then 0101

• Need to split block

0(010)

0(000), 0(101)

10(11)

10(10)

i=2 1

2

00011011

11(10) 2

Page 37: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

37

Insertion in Extensible Hash Table

• After splitting the block00(10)

00(00)

10(11)

10(10)

i=2

2

2

00011011

11(10) 2

01(01) 2

Page 38: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

38

Extensible Hash Table

• How many buckets (blocks) do we need to touch after an insertion ?

• How many entries in the hash table do we need to touch after an insertion ?

Page 39: 1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003

39

Performance Extensible Hash Table

• No overflow blocks: access always O(1)– More precisely: exactly one disk I/O

• BUT:– Extensions can be costly and disruptive– After an extension table may no longer fit in

memory