09 buckets

8/4/2019 09 Buckets

1/40

Comp 122, Spring 2004

Keys into Buckets:

Lower bounds, Linear-time sort, & Hashing

8/4/2019 09 Buckets

2/40

linsort - 2 Lin / DeviComp 122

Comparison-based Sorting

Comparison sort

Only comparison of pairs of elements may be used to gainorder information about a sequence.

Hence, a lower bound on the number of comparisons will be alower bound on the complexity of any comparison-basedsorting algorithm.

All our sorts have been comparison sorts The best worst-case complexity so far is(n lg n)

(merge sort and heapsort).

We prove a lower bound of(n lg n)for anycomparison sort: merge sort and heapsort are optimal.

The idea is simple: there are n! outcomes, so we need atree with n! leaves, and therefore lg(n!) =

8/4/2019 09 Buckets

3/40


Decision Tree

For insertion sort operating on three elements.

1:2

2:3 1:3

1:3 2:31,2,3

1,3,2 3,1,2

2,1,3

2,3,1 3,2,1

>

>

>>

Contains 3! = 6 leaves.

Simply unroll all loops

for all possible inputs.

Node i:j means

compare A[i] to A[j].

Leaves show outputs;

No two paths go to

same leaf!

8/4/2019 09 Buckets

4/40


Decision Tree (Contd.)

Execution of sorting algorithm corresponds to tracing a

path from root to leaf. The tree models all possible execution traces.

At each internal node, a comparison aiaj is made. Ifaiaj, follow left subtree, else follow right subtree. View the tree as if the algorithm splits in two at each node,

based on information it has determined up to that point.

When we come to a leaf, ordering a(1)a (2) a (n)is established.

A correct sorting algorithm must be able to produce anypermutation of its input.

Hence, each of the n! permutations must appear at one or moreof the leaves of the decision tree.

8/4/2019 09 Buckets

5/40


A Lower Bound for Worst Case

Worst case no. of comparisons for a sorting

algorithm is

Length of the longest path from root to any of the

leaves in the decision tree for the algorithm.

Which is the height of its decision tree.

A lower bound on the running time of any

comparison sort is given by

A lower bound on the heights of all decision trees inwhich each permutation appears as a reachable leaf.

8/4/2019 09 Buckets

6/40


Optimal sorting for three elements

Any sort of six elements has 5 internal nodes.

1:2

2:3 1:3

1:3 2:31,2,3

1,3,2 3,1,2

2,1,3

2,3,1 3,2,1

>

>

>>

There must be a worst-case path of length 3.

8/4/2019 09 Buckets

7/40linsort - 7 Lin / DeviComp 122

A Lower Bound for Worst Case

Proof:

Suffices to determine the height of a decision tree. The number of leaves is at least n!(# outputs)

The number of internal nodes n!1

The height is at least lg (n!1) = (n lg n)QED

Theorem 8.1:

Any comparison sort algorithm requires (n lg n) comparisons in theworst case.

8/4/2019 09 Buckets


Beating the lower bound

We can beat the lower bound if we dont baseour sort on comparisons:

Counting sort for keys in [0..k], k=O(n)

Radix sortfor keys with a fixed number of digits Bucket sort for random keys (uniformly distributed)

8/4/2019 09 Buckets


Counting Sort

Assumption: we sort integers in {0, 1, 2, , k}. Input:A[1..n] {0, 1, 2, , k}n.

ArrayA and values n and kare given.

Output:B[1..n] sorted. AssumeB is alreadyallocated and given as a parameter.

Auxiliary Storage:C[0..k] counts

Runs in linear time ifk = O(n).

8/4/2019 09 Buckets


Counting-Sort (A, B, k)

CountingSort(A,B,k)

1. fori 1 to k

2. doC[i] 0

3. forj 1 to length[A]

4. doC[A[j]]

C[A[j]] + 15. fori 2 to k

6. doC[i] C[i] + C[i1]

7. forjlength[A] downto 1

8. doB[C[A[ j ]]] A[j]9. C[A[j]] C[A[j]]1

O(k) init counts

O(k) prefix sum

O(n) count

O(n) reorder

8/4/2019 09 Buckets

11/40


Radix Sort

Used to sort on card-sorters:

Do a stable sort on each column,one column at a time.

The human operator is

part of the algorithm!

Key idea:sort on the least significant digit first and

on the remaining digits in sequential order. The sortingmethod used to sort each digit must be stable.

If we start with the most significant digit, wellneed extra storage.

8/4/2019 09 Buckets

12/40


An Example

392 631 928 356

356 392 631 392

446 532 532 446

928 495 446 495

631 356 356 532

532 446 392 631

495 928 495 928

Input After sorting

on LSD

After sorting

on middledigit

After sorting

on MSD

8/4/2019 09 Buckets

13/40


Radix-Sort(A, d)

Correctness of Radix Sort

By induction on the number of digits sorted.

Assume that radix sort works for d1 digits.

Show that it works for ddigits.Radix sort ofddigits radix sort of the low-order d

1 digits followed by a sort on digit d.

RadixSort(A, d)

1. for i 1 to d2. do use a stable sort to sort array A on digit i

8/4/2019 09 Buckets

14/40


Algorithm Analysis

Each pass over n d-digit numbers then takes time(n+k). (Assuming counting sort is used for each pass.)

There are d passes, so the total time for radix sort is(d(n+k)).

When dis a constant and k= O(n), radix sort runs inlinear time.

Radix sort, if uses counting sort as the intermediate

stable sort, does not sort in place. If primary memory storage is an issue, quicksort or other sorting methods

may be preferable.

8/4/2019 09 Buckets

15/40


Bucket Sort

Assumes input is generated by a random process

that distributes the elements uniformly over [0, 1).

Idea:

Divide [0, 1) into n equal-sized buckets.

Distribute the n input values into the buckets.

Sort each bucket.

Then go through the buckets in order, listing elements

in each one.

8/4/2019 09 Buckets

16/40


An Example

8/4/2019 09 Buckets

17/40


Bucket-Sort (A)

BucketSort(A)

1. n length[A]

2. fori 1 to n3. do insertA[i] into listB[ nA[i]]

4. fori0ton15. do sort listB[i] with insertion sort

6. concatenate the listsB[i]s together in order7. return the concatenated lists

Input:A[1..n], where 0 A[i] < 1 for all i.

Auxiliary array:B[0..n1] of linked lists, each list initially empty.

8/4/2019 09 Buckets

18/40


Analysis

Relies on no bucket getting too many values. All lines except insertion sorting in line 5 take O(n)

altogether.

Intuitively, if each bucket gets a constant number of

elements, it takes O(1) time to sort each bucket O(n)sort time for all buckets.

We expect each bucket to have few elements, since

the average is 1 element per bucket. But we need to do a careful analysis.

8/4/2019 09 Buckets

19/40


AnalysisContd.

RV ni= no. of elements placed in bucket

B[i].

Insertion sort runs in quadratic time. Hence, time for

bucket sort is:

1

0

2

1

0

2

1

0

2

1

0

2

)][][(])[()(

n)expectatiooflinearity(by)]([)(

)()()]([

havewen,expectatio

oflinearityusingandsidesbothofnsexpectatioTaking

)()()(

n

i

i

n

i

i

n

i

i

n

i

i

XaEaXEnEOn

nOEn

nOnEnTE

nOnnT

(8.1)

8/4/2019 09 Buckets

20/40


AnalysisContd.

Claim: E[ni2] = 21/n.

Proof:

Define indicator random variables.

Xij = I{A[j] falls in bucket i}

Pr{A[j] falls in bucket i} = 1/n.

ni =

n

j

ijX1

(8.2)

8/4/2019 09 Buckets

21/40


AnalysisContd.

njkj

nkikij

n

jij

n

j njkj

nk

ikijij

ik

n

j

n

k

ij

n

j

iji

XXEXE

XXX

XXE

XEnE

1 11

2

1 1 1

2

1 1

2

1

2

n.expectatiooflinearityby,][][

E

][

(8.3)

8/4/2019 09 Buckets

22/40


AnalysisContd.

2

2

22

1

11

][][][variables.

randomtindependenareand,Since

:for][

1

11

110

}bucketinfalls][Pr{1

}bucketinfalltdoesn'][Pr{0][

nnn

XEXEXXE

XXkj

kjXXE

n

nn

ijA

ijAXE

ikijikij

ikij

ikij

ij

8/4/2019 09 Buckets

23/40


AnalysisContd.

)(

)()(

)/12()()]([

.1

2

11

1)1(

1

11][

1

0

2

1 1 1 2

2

n

nOn

nOnnTE

n

n

n

nnn

nn

nnnE

n

i

n

j njjk

nki

Substituting (8.2) in (8.1), we have,

(8.3) is hence,

8/4/2019 09 Buckets

24/40

Comp 122, Spring 2004

Hash Tables1

8/4/2019 09 Buckets

25/40


Dictionary

Dictionary:

Dynamic-set data structure for storing items indexedusing keys.

Supports operations Insert, Search, and Delete.

Applications: Symbol table of a compiler.

Memory-management tables in operating systems.

Large-scale distributed systems.

Hash Tables: Effective way of implementing dictionaries.

Generalization of ordinary arrays.

8/4/2019 09 Buckets

26/40


Direct-address Tables

Direct-address Tables are ordinary arrays.

Facilitate direct addressing.

Element whose key is kis obtained by indexing into

the kth position of the array.

Applicable when we can afford to allocate an array

with one position for every possible key.

i.e. when the universe of keys Uis small.

Dictionary operations can be implemented to take

O(1) time.

Details in Sec. 11.1.

8/4/2019 09 Buckets

27/40


Hash Tables

Notation:

UUniverse of all possible keys.

KSet of keys actually stored in the dictionary.

|K| = n.

When U is very large, Arrays are not practical.

|K|

8/4/2019 09 Buckets

28/40


Hashing

Hash function h: Mapping from Uto the slots of a

hash table T[0..m1].

h : U {0,1,, m1}

With arrays, key kmaps to slotA[k].

With hash tables, key kmaps or hashes to slot

T[h[k]].

h[k] is the hash value of key k.

8/4/2019 09 Buckets

29/40


Hashing

0

m1

h(k1)

h(k4)

h(k2)=h(k5)

h(k3)

U(universe of keys)

K

(actual

keys)

k1

k2

k3

k5

k4

collision

8/4/2019 09 Buckets

30/40


Issues with Hashing

Multiple keys can hash to the same slot

collisions are possible.

Design hash functions such that collisions are

minimized.

But avoiding collisions is impossible. Design collision-resolution techniques.

Search will cost (n) time in the worst case.

However, all operations can be made to have anexpected complexity of(1).

8/4/2019 09 Buckets

31/40


Methods of Resolution

Chaining:

Store all elements that hash to the same

slot in a linked list.

Store a pointer to the head of the linked

list in the hash table slot. Open Addressing:

All elements stored in hash table itself.

When collisions occur, use a systematic(consistent) procedure to store elements

in free slots of the table.

k2

0

m1

k1 k4

k5 k6

k7 k3

k8

8/4/2019 09 Buckets

32/40


Collision Resolution by Chaining

0

m1

h(k1)=h(k4)

h(k2)=h(k5)=h(k6)

h(k3)=h(k7)

U(universe of keys)

K

(actual

keys)

k1

k2

k3

k5

k4

k6

k7k8

h(k8)

X

X

X

8/4/2019 09 Buckets

33/40


k2

Collision Resolution by Chaining

0

m1

U(universe of keys)

K

(actual

keys)

k1

k2

k3

k5

k4

k6

k7k8

k1 k4

k5 k6

k7 k3

k8

8/4/2019 09 Buckets

34/40


Hashing with Chaining

Dictionary Operations:

Chained-Hash-Insert (T, x)

Insertx at the head of list T[h(key[x])].

Worst-case complexityO(1).

Chained-Hash-Delete (T, x) Deletex from the list T[h(key[x])].

Worst-case complexityproportional to length of list with

singly-linked lists. O(1) with doubly-linked lists.

Chained-Hash-Search (T, k) Search an element with key kin list T[h(k)].

Worst-case complexityproportional to length of list.

l i h i d h h

8/4/2019 09 Buckets

35/40


Analysis on Chained-Hash-Search

Load factor=n/m = average keys per slot.

mnumber of slots. nnumber of elements stored in the hash table.

Worst-case complexity:(n) + time to compute h(k).

Average depends on how h distributes keys among m slots. Assume

Simple uniform hashing.

Any key is equally likely to hash into any of the m slots,independent of where any other key hashes to.

O(1) time to compute h(k).

Time to search for an element with key kis (|T[h(k)]|).

Expected length of a linked list = load factor = = n/m.

8/4/2019 09 Buckets

36/40


Expected Cost of an Unsuccessful Search

Proof:

Any key not already in the table is equally likely to hashto any of the m slots.

To search unsuccessfully for any key k, need to search to

the end of the list T[h(k)], whose expected length is .

Adding the time to compute the hash function, the total

time required is (1+).

Theorem:

An unsuccessful search takes expected time (1+).

E d C f S f l S h

8/4/2019 09 Buckets

37/40


Expected Cost of a Successful Search

Proof:

The probability that a list is searched is proportional to the numberof elements it contains.

Assume that the element being searched for is equally likely to beany of the n elements in the table.

The number of elements examined during a successful search foran elementx is 1 more than the number of elements that appearbeforex inxs list.

These are the elements insertedafterxwas inserted.

Goal:

Find the average, over the n elementsx in the table, ofhow many elementswere inserted intoxs list afterx was inserted.

Theorem:

A successful search takes expected time (1+).

E d C f S f l S h

8/4/2019 09 Buckets

38/40


Expected Cost of a Successful Search

Proof (contd):

Letxibe the ith element inserted into the table, and let ki = key[xi].

Define indicator random variablesXij = I{h(ki) = h(kj)}, for all i,j.

Simple uniform hashing Pr{h(ki) = h(kj)} = 1/m

E[Xij] = 1/m.

Expected number of elements examined in a successful search is:

Theorem:

A successful search takes expected time (1+).

n

i

n

ij

ijXn

E1 1

11

No. of elements inserted afterxi into the same slot asxi.

P f C d

8/4/2019 09 Buckets

39/40


ProofContd.

n

m

n

nnn

nm

innm

innm

mn

XEn

X

n

E

n

i

n

i

n

i

n

i

n

ij

n

i

n

ij

ij

n

i

n

ij

ij

221

2

11

2

)1(11

11

)(11

11

1

][11

11

2

1 1

1

1 1

1 1

1 1

(linearity of expectation)

Expected total time for a successful search= Time to compute hash function + Time

to search

= O(2+/2/2n) = O(1+ ).

E d C I i

8/4/2019 09 Buckets

40/40

Li / D i

Expected CostInterpretation

Ifn = O(m), then =n/m = O(m)/m = O(1).

Searching takes constant time on average. Insertion is O(1) in the worst case.

Deletion takes O(1) worst-case time when lists are doubly

linked. Hence, all dictionary operations take O(1) time on

average with hash tables with chaining.

09 buckets

Documents