Can’t provide fast insertion/removal and fast lookup atthe same time
Vectors, Linked Lists, Stack, Queues, Deques
4
Data Structures - CSCI 102
Copyright © William C. Cheng
Data Structure Limitations
Provide consistently fast operations, but must maintainan internal ordering
Binary Search Trees, Heaps
What if we didn’t care about the ordering of the elementsat all?
How can we further improve the performance of lookup,add & removal?
Each value in the table has a unique key
For operations where we only care about fastadd/remove/search, not fast traversal, we create a tablestructure to optimize for fast lookup
5
Data Structures - CSCI 102
Copyright © William C. Cheng
Lookup Tables
The key is used as a short identifier to lookup an entirevalue in the table
Your student ID is used to look up your student record(e.g. name, GPA, etc.)
Example
Search(key)See if a particular value identified by key is in thetable
What kind of operations do we need to perform on a lookuptable?
6
Data Structures - CSCI 102
Copyright © William C. Cheng
Lookup Tables
Insert(key,value)Insert a new value identified by key into the table
Remove(key)Remove the value identified by key from the table
We don’t care as much about traversal (visiting allelements) in this scenario
Let’s assume ID is a unique integer
We want to keep a directory of all the students at USC andbe able to look them up by their student ID
7
Data Structures - CSCI 102
Copyright © William C. Cheng
Sample Object
struct Student {string name;double gpa;int id;
};
Student data[4999];
If we can guarantee that student IDs will always range from0 to N (e.g. 0 to 4999), we could just store them in an array:
8
Data Structures - CSCI 102
Copyright © William C. Cheng
Direct Address Table
int id = 3285;Student s = data[id];
Then when we want to grab a particular student, we knowStudent N is at index N:
Data Structures - CSCI 102
Direct Address Table
StudentObjects
John Doe3.20
Jane Doe2.62
Some Guy
Name
3.7
GPA
4
ID
0
1
2
3
4
5
4999
9
Copyright © William C. Cheng
StudentIDs
Data
0
24
Direct Addressing
10
Data Structures - CSCI 102
Copyright © William C. Cheng
Direct Address Table
Maps keys directly to the indexes in an arrayUnused array indexes need to be marked
O(1) worst case
Generally use NULLOperations are fast
Key RestrictionsDirect Addressing Issues
11
Data Structures - CSCI 102
Copyright © William C. Cheng
Direct Address Table
Array Size
Keys must fall into a nice, uniform rangeKeys must be numeric
If there are N possible keys, then data[] must be ofsize NOur array could get HUGEWhat if we’re only using a small numbers of keys?Tons of space is wasted
How can we get around these limitations?
Hash Functions
12
Data Structures - CSCI 102
Copyright © William C. Cheng
Hash Functions
A function that maps key values to array indexesInput records all have a unique keyThe hash function maps key to an array indexRecords are stored at data[hash(key)]Ideally every unique key also has unique hash(key)
Direct Addressing essentially uses a hash function thatdoes nothing
int directAddressHash(int studentId) {return studentId;
}
13
Copyright © William C. Cheng
Data Structures - CSCI 102
Hash Tables
StudentObjects
John Doe
Jane Doe
Some Guy
3.2
2.6
3.7
0
2
4
NameGPAID
hash(4)
hash(0)
hash(2)
Data
StudentIDs
(Keys)
0
24
HashFunction
How can we avoid having to make our array gigantic tohold all possible keys?
Hash Functions
15
Data Structures - CSCI 102
Copyright © William C. Cheng
Hash Tables
Simple solution: use modular arithmeticSize of the backing array is no longer dependent onthe number of unique keysint modularHash(int studentId) {
return studentId % ARRAY_SIZE;}
int directAddressHash(int studentId) {return studentId;
}
Recall direct addressing:
FastHashing is supposed to be faster than a binary searchtree. hash(key) needs to be O(1)
What makes a good hash function?
16
Data Structures - CSCI 102
Copyright © William C. Cheng
Hash Functions
DeterministicIf we have a key K, then hash(K) must always givethe same result
Uniform distributionThe hash function should uniformly distribute keysacross all of the available indexes in the storage array
Making a good hash function is hard
For strings, use things like ASCII letter codes
Map your data into the set of natural numbersMaking a hash function
N = {0, 1, 2, ...}
17
Data Structures - CSCI 102
Copyright © William C. Cheng
Hash Functions
Prime table sizes tend to yield better resultsPrime numbers are your friend
E.g. make sure "get" and "gets" hash differentlyHandle variants of the same pattern
Try to be independent of any patterns that may exist inthe data
You won’t usually have to write your own, but you shouldknow what the default hash function does
Hash Tables do not maintain any ordering of theirinternal elements
Hashing Issues
19
Data Structures - CSCI 102
Copyright © William C. Cheng
Hash Tables
Creating a perfect hash function is almost impossible
When two distinct keys generate the same hash valueit’s called a collision
Collisions
hash(K1) == hash(K2)
If we try to insert a new element and there’s a collision,keep probing the hash table until we find a vacant space
Open Addressing
23
Data Structures - CSCI 102
Copyright © William C. Cheng
Collision Handling
If a collision occurs, use a deterministic algorithm tocalculate the next array index to check (based on theinitial hash result)
Probing
All data is stored directly in the hash table. No extra datastructures are needed.
Start with an empty Hash Table
25
Data Structures - CSCI 102
Copyright © William C. Cheng
Open Addressing (Linear Probing)
Data0
1
2
3
4
26
Copyright © William C. Cheng
Student
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "John Doe" with ID = 123
Data0
1
2
3
4
John Doe
2.8
123
Name
GPA
ID
27
Copyright © William C. Cheng
Student
1
2
3
4
John Doe
2.8
123
Name
GPA
ID
hash(123) = 1
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "John Doe" with ID = 123
hash(123) = 1
Data0
28
Copyright © William C. Cheng
Student
1
2
3
4
John Doe
2.8
123
Name
GPA
ID
hash(123) = 1
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "John Doe" with ID = 123
hash(123) = 1data[1] is empty, no collision
Data0
29
Copyright © William C. Cheng
Student
Data0
1
2
3
4
John Doe2.8123
John Doe
2.8
123
Name
GPA
ID
hash(123) = 1
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "John Doe" with ID = 123
hash(123) = 1data[1] is empty, no collision
store it there
Data Structures - CSCI 102
Open Addressing (Linear Probing)Hash Table contains one item
Data0
1
2
3
4
30
Copyright © William C. Cheng
John Doe2.8123
31
Copyright © William C. Cheng
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202
Data0
1
2
3
4
John Doe2.8123
StudentJane Doe
3.4
202
Name
GPA
ID
32
Copyright © William C. Cheng
hash(202) = 3
Data0
1
2
3
4
John Doe2.8123
StudentJane Doe
3.4
202
Name
GPA
ID
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202
hash(202) = 3
33
Copyright © William C. Cheng
hash(202) = 3
Data0
1
2
3
4
John Doe2.8123
StudentJane Doe
3.4
202
Name
GPA
ID
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202
hash(202) = 3data[3] is empty, no collision
34
Copyright © William C. Cheng
hash(202) = 3
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202
hash(202) = 3data[3] is empty, no collision
store it there
Student
Name
Jane Doe
GPA
3.4
ID
202
35
Copyright © William C. Cheng
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202
Data Structures - CSCI 102
Open Addressing (Linear Probing)Hash Table contains two items
36
Copyright © William C. Cheng
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202Student
Some Guy
3.5
401
Name
GPA
ID
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401
37
Copyright © William C. Cheng
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202Student
Some Guy
3.5
401
Name
GPA
ID
hash(401) = 1
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401
hash(401) = 1
38
Copyright © William C. Cheng
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202Student
Some Guy
3.5
401
Name
GPA
ID
hash(401) = 1
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401
hash(401) = 1data[1] is non-empty, collision!
39
Copyright © William C. Cheng
hash(401) = 1
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202Student
Some Guy
3.5
401
Name
GPA
ID
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401
hash(401) = 1data[1] is non-empty, collision!
hash(401)+1 = 2
40
Copyright © William C. Cheng
Data0
1
2
3
4
John Doe2.8123
Jane Doe3.4202Student
Some Guy
3.5
401
Name
GPA
ID
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401
hash(401) = 1data[1] is non-empty, collision!
hash(401)+1 = 2data[2] is empty, no collision
hash(401) = 1
hash(401)+1 = 2data[2] is empty, no collision
41
Copyright © William C. Cheng
Data0
1
2
3
4
John Doe2.8123
Some Guy3.5401
Jane Doe3.4202
hash(401) = 1
hash()
Data Structures - CSCI 102
Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401
hash(401) = 1
data[1] is non-empty, collision!
store it there
Student
Name
Some Guy
GPA
3.5
ID
401
Data0
1
2
3
4
123
Some Guy3.5401
Jane Doe3.4
202
42
Copyright © William C. Cheng
Data Structures - CSCI 102
Open Addressing (Linear Probing)Hash Table contains three items
John Doe2.8
Search(key)What is the Big O of each of these operations?
48
Data Structures - CSCI 102
Copyright © William C. Cheng
Open Addressing (Linear Probing)
Insert(key,value)
Remove(key)
Average: O(1), Worst Case: O(N)
Average: O(1), Worst Case: O(N)
Average: O(1), Worst Case: O(N)
How big is the table?
load factor = (# of elements) / (size of array)
Operations depend on the table’s load factor
How many slots are taken already?
"Utilization"
Each slot in the Hash Table can now contain a list ofelements instead of a single element
Chaining
50
Data Structures - CSCI 102
Copyright © William C. Cheng
Collision Handling
When multiple items hash to the same slot, they areplaced in the list at that slot
This requires the overhead of an extra list for each slot thatcontains one or more elements
2.8123
Jane Doe3.4202
51
Copyright © William C. Cheng
Data0
1
2
3
4
Data Structures - CSCI 102
ChainingHash Table contains two items
John Doe
StudentSome Guy
3.5
401
Name
GPA
ID
52
Copyright © William C. Cheng
Data0
1
2
3
4
Data Structures - CSCI 102
ChainingInsert "Some Guy" with ID = 401
John Doe
2.8123
Jane Doe3.4202
2.8123
Jane Doe3.4
202
StudentSome Guy
3.5
401
Name
GPA
ID
53
Copyright © William C. Cheng
Data0
1
2
3
4
hash(401) = 1
hash()
Data Structures - CSCI 102
ChainingInsert "Some Guy" with ID = 401
hash(401) = 1
John Doe
StudentSome Guy
3.5
401
Name
GPA
ID
54
Copyright © William C. Cheng
Data0
1
2
3
4
hash(401) = 1
hash()
Data Structures - CSCI 102
ChainingInsert "Some Guy" with ID = 401
hash(401) = 1data[1] is non-empty, collision!
John Doe
2.8123
Jane Doe3.4
202
StudentSome Guy
3.5
401
Name
GPA
ID
55
Copyright © William C. Cheng
Data0
1
2
3
4
hash(401) = 1
hash()
Data Structures - CSCI 102
ChainingInsert "Some Guy" with ID = 401
hash(401) = 1data[1] is non-empty, collision!Chaining says to add the newentry to the list at data[1]
John Doe
2.8123
Jane Doe3.4
202
StudentSome Guy
3.5
401
Name
GPA
ID
56
Copyright © William C. Cheng
Data0
1
2
3
4
hash()
Data Structures - CSCI 102
ChainingInsert "Some Guy" with ID = 401
hash(401) = 1data[1] is non-empty, collision!Chaining says to add the newentry to the list at data[1]
Insert Some Guy in the list at data[1]
hash(401) = 1
John Doe2.8123
Jane Doe3.4
202
57
Copyright © William C. Cheng
Data0
1
2
3
4
2.8123
Jane Doe3.4202
Data Structures - CSCI 102
ChainingHash Table contains three items
Some Guy3.5401
John Doe
63
Data Structures - CSCI 102
Copyright © William C. Cheng
Chaining
Search(key)What is the Big O of each of these operations?
Insert(key,value)
Remove(key)
Average: O(1), Worst Case: O(N)
Average: O(1), Worst Case: O(1)
Average: O(1), Worst Case: O(N)
Operations depend on the average length of a chain (exceptfor insert)
If a malicious user knows what hash function you’reusing, they can intentionally cause your worst-casebehavior
The Problem
66
Data Structures - CSCI 102
Copyright © William C. Cheng
Collision Handling
When the Hash Table is created, randomly choose ahash function independent of the keys that are going tobe stored
No single input gives worst-case behavior(just like randomized Quicksort)
Universal Hashing
Like chaining, but each element in the hash table holdsanother hash table with a different hash function
Multi-Level Hashing
67
Data Structures - CSCI 102
Copyright © William C. Cheng
Collision Handling
If the set of possible keys is static (never changes), wecan develop a perfect multi-level hash to give O(1) worstcase performance
e.g. The reserved keywords in a programminglanguage are a static set of keys
Perfect Hashing
By hashing multiple times, we can greatly decrease theodds of a collision
Hash Tables generally do provide a way for you toretrieve a list of the known keys
Just keep in mind there is no guaranteed ordering ofthe keys
Other Notes
68
Data Structures - CSCI 102
Copyright © William C. Cheng
Hash Tables
C++ currently has no built-in hash tableThere’s a proposal for unordered_map in the STL is onthe tableGoogle Sparse Hash provides C++ hash tablesBoost C++ Libraries provides hash tableshttp://www.boost.org/