09 buckets

Upload: maulik-s-hakani

Post on 07-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 09 Buckets

    1/40

    Comp 122, Spring 2004

    Keys into Buckets:

    Lower bounds, Linear-time sort, & Hashing

  • 8/4/2019 09 Buckets

    2/40

    linsort - 2 Lin / DeviComp 122

    Comparison-based Sorting

    Comparison sort

    Only comparison of pairs of elements may be used to gainorder information about a sequence.

    Hence, a lower bound on the number of comparisons will be alower bound on the complexity of any comparison-basedsorting algorithm.

    All our sorts have been comparison sorts The best worst-case complexity so far is(n lg n)

    (merge sort and heapsort).

    We prove a lower bound of(n lg n)for anycomparison sort: merge sort and heapsort are optimal.

    The idea is simple: there are n! outcomes, so we need atree with n! leaves, and therefore lg(n!) =

  • 8/4/2019 09 Buckets

    3/40

    linsort - 3 Lin / DeviComp 122

    Decision Tree

    For insertion sort operating on three elements.

    1:2

    2:3 1:3

    1:3 2:31,2,3

    1,3,2 3,1,2

    2,1,3

    2,3,1 3,2,1

    >

    >

    >>

    Contains 3! = 6 leaves.

    Simply unroll all loops

    for all possible inputs.

    Node i:j means

    compare A[i] to A[j].

    Leaves show outputs;

    No two paths go to

    same leaf!

  • 8/4/2019 09 Buckets

    4/40

    linsort - 4 Lin / DeviComp 122

    Decision Tree (Contd.)

    Execution of sorting algorithm corresponds to tracing a

    path from root to leaf. The tree models all possible execution traces.

    At each internal node, a comparison aiaj is made. Ifaiaj, follow left subtree, else follow right subtree. View the tree as if the algorithm splits in two at each node,

    based on information it has determined up to that point.

    When we come to a leaf, ordering a(1)a (2) a (n)is established.

    A correct sorting algorithm must be able to produce anypermutation of its input.

    Hence, each of the n! permutations must appear at one or moreof the leaves of the decision tree.

  • 8/4/2019 09 Buckets

    5/40

    linsort - 5 Lin / DeviComp 122

    A Lower Bound for Worst Case

    Worst case no. of comparisons for a sorting

    algorithm is

    Length of the longest path from root to any of the

    leaves in the decision tree for the algorithm.

    Which is the height of its decision tree.

    A lower bound on the running time of any

    comparison sort is given by

    A lower bound on the heights of all decision trees inwhich each permutation appears as a reachable leaf.

  • 8/4/2019 09 Buckets

    6/40

    linsort - 6 Lin / DeviComp 122

    Optimal sorting for three elements

    Any sort of six elements has 5 internal nodes.

    1:2

    2:3 1:3

    1:3 2:31,2,3

    1,3,2 3,1,2

    2,1,3

    2,3,1 3,2,1

    >

    >

    >>

    There must be a worst-case path of length 3.

  • 8/4/2019 09 Buckets

    7/40linsort - 7 Lin / DeviComp 122

    A Lower Bound for Worst Case

    Proof:

    Suffices to determine the height of a decision tree. The number of leaves is at least n!(# outputs)

    The number of internal nodes n!1

    The height is at least lg (n!1) = (n lg n)QED

    Theorem 8.1:

    Any comparison sort algorithm requires (n lg n) comparisons in theworst case.

  • 8/4/2019 09 Buckets

    8/40linsort - 8 Lin / DeviComp 122

    Beating the lower bound

    We can beat the lower bound if we dont baseour sort on comparisons:

    Counting sort for keys in [0..k], k=O(n)

    Radix sortfor keys with a fixed number of digits Bucket sort for random keys (uniformly distributed)

  • 8/4/2019 09 Buckets

    9/40linsort - 9 Lin / DeviComp 122

    Counting Sort

    Assumption: we sort integers in {0, 1, 2, , k}. Input:A[1..n] {0, 1, 2, , k}n.

    ArrayA and values n and kare given.

    Output:B[1..n] sorted. AssumeB is alreadyallocated and given as a parameter.

    Auxiliary Storage:C[0..k] counts

    Runs in linear time ifk = O(n).

  • 8/4/2019 09 Buckets

    10/40linsort - 10 Lin / DeviComp 122

    Counting-Sort (A, B, k)

    CountingSort(A,B,k)

    1. fori 1 to k

    2. doC[i] 0

    3. forj 1 to length[A]

    4. doC[A[j]]

    C[A[j]] + 15. fori 2 to k

    6. doC[i] C[i] + C[i1]

    7. forjlength[A] downto 1

    8. doB[C[A[ j ]]] A[j]9. C[A[j]] C[A[j]]1

    O(k) init counts

    O(k) prefix sum

    O(n) count

    O(n) reorder

  • 8/4/2019 09 Buckets

    11/40

    linsort - 11 Lin / DeviComp 122

    Radix Sort

    Used to sort on card-sorters:

    Do a stable sort on each column,one column at a time.

    The human operator is

    part of the algorithm!

    Key idea:sort on the least significant digit first and

    on the remaining digits in sequential order. The sortingmethod used to sort each digit must be stable.

    If we start with the most significant digit, wellneed extra storage.

  • 8/4/2019 09 Buckets

    12/40

    linsort - 12 Lin / DeviComp 122

    An Example

    392 631 928 356

    356 392 631 392

    446 532 532 446

    928 495 446 495

    631 356 356 532

    532 446 392 631

    495 928 495 928

    Input After sorting

    on LSD

    After sorting

    on middledigit

    After sorting

    on MSD

  • 8/4/2019 09 Buckets

    13/40

    linsort - 13 Lin / DeviComp 122

    Radix-Sort(A, d)

    Correctness of Radix Sort

    By induction on the number of digits sorted.

    Assume that radix sort works for d1 digits.

    Show that it works for ddigits.Radix sort ofddigits radix sort of the low-order d

    1 digits followed by a sort on digit d.

    RadixSort(A, d)

    1. for i 1 to d2. do use a stable sort to sort array A on digit i

  • 8/4/2019 09 Buckets

    14/40

    linsort - 14 Lin / DeviComp 122

    Algorithm Analysis

    Each pass over n d-digit numbers then takes time(n+k). (Assuming counting sort is used for each pass.)

    There are d passes, so the total time for radix sort is(d(n+k)).

    When dis a constant and k= O(n), radix sort runs inlinear time.

    Radix sort, if uses counting sort as the intermediate

    stable sort, does not sort in place. If primary memory storage is an issue, quicksort or other sorting methods

    may be preferable.

  • 8/4/2019 09 Buckets

    15/40

    linsort - 15 Lin / DeviComp 122

    Bucket Sort

    Assumes input is generated by a random process

    that distributes the elements uniformly over [0, 1).

    Idea:

    Divide [0, 1) into n equal-sized buckets.

    Distribute the n input values into the buckets.

    Sort each bucket.

    Then go through the buckets in order, listing elements

    in each one.

  • 8/4/2019 09 Buckets

    16/40

    linsort - 16 Lin / DeviComp 122

    An Example

  • 8/4/2019 09 Buckets

    17/40

    linsort - 17 Lin / DeviComp 122

    Bucket-Sort (A)

    BucketSort(A)

    1. n length[A]

    2. fori 1 to n3. do insertA[i] into listB[ nA[i]]

    4. fori0ton15. do sort listB[i] with insertion sort

    6. concatenate the listsB[i]s together in order7. return the concatenated lists

    Input:A[1..n], where 0 A[i] < 1 for all i.

    Auxiliary array:B[0..n1] of linked lists, each list initially empty.

  • 8/4/2019 09 Buckets

    18/40

    linsort - 18 Lin / DeviComp 122

    Analysis

    Relies on no bucket getting too many values. All lines except insertion sorting in line 5 take O(n)

    altogether.

    Intuitively, if each bucket gets a constant number of

    elements, it takes O(1) time to sort each bucket O(n)sort time for all buckets.

    We expect each bucket to have few elements, since

    the average is 1 element per bucket. But we need to do a careful analysis.

  • 8/4/2019 09 Buckets

    19/40

    linsort - 19 Lin / DeviComp 122

    AnalysisContd.

    RV ni= no. of elements placed in bucket

    B[i].

    Insertion sort runs in quadratic time. Hence, time for

    bucket sort is:

    1

    0

    2

    1

    0

    2

    1

    0

    2

    1

    0

    2

    )][][(])[()(

    n)expectatiooflinearity(by)]([)(

    )()()]([

    havewen,expectatio

    oflinearityusingandsidesbothofnsexpectatioTaking

    )()()(

    n

    i

    i

    n

    i

    i

    n

    i

    i

    n

    i

    i

    XaEaXEnEOn

    nOEn

    nOnEnTE

    nOnnT

    (8.1)

  • 8/4/2019 09 Buckets

    20/40

    linsort - 20 Lin / DeviComp 122

    AnalysisContd.

    Claim: E[ni2] = 21/n.

    Proof:

    Define indicator random variables.

    Xij = I{A[j] falls in bucket i}

    Pr{A[j] falls in bucket i} = 1/n.

    ni =

    n

    j

    ijX1

    (8.2)

  • 8/4/2019 09 Buckets

    21/40

    linsort - 21 Lin / DeviComp 122

    AnalysisContd.

    njkj

    nkikij

    n

    jij

    n

    j njkj

    nk

    ikijij

    ik

    n

    j

    n

    k

    ij

    n

    j

    iji

    XXEXE

    XXX

    XXE

    XEnE

    1 11

    2

    1 1 1

    2

    1 1

    2

    1

    2

    n.expectatiooflinearityby,][][

    E

    ][

    (8.3)

  • 8/4/2019 09 Buckets

    22/40

    linsort - 22 Lin / DeviComp 122

    AnalysisContd.

    2

    2

    22

    1

    11

    ][][][variables.

    randomtindependenareand,Since

    :for][

    1

    11

    110

    }bucketinfalls][Pr{1

    }bucketinfalltdoesn'][Pr{0][

    nnn

    XEXEXXE

    XXkj

    kjXXE

    n

    nn

    ijA

    ijAXE

    ikijikij

    ikij

    ikij

    ij

  • 8/4/2019 09 Buckets

    23/40

    linsort - 23 Lin / DeviComp 122

    AnalysisContd.

    )(

    )()(

    )/12()()]([

    .1

    2

    11

    1)1(

    1

    11][

    1

    0

    2

    1 1 1 2

    2

    n

    nOn

    nOnnTE

    n

    n

    n

    nnn

    nn

    nnnE

    n

    i

    n

    j njjk

    nki

    Substituting (8.2) in (8.1), we have,

    (8.3) is hence,

  • 8/4/2019 09 Buckets

    24/40

    Comp 122, Spring 2004

    Hash Tables1

  • 8/4/2019 09 Buckets

    25/40

    linsort - 25 Lin / DeviComp 122

    Dictionary

    Dictionary:

    Dynamic-set data structure for storing items indexedusing keys.

    Supports operations Insert, Search, and Delete.

    Applications: Symbol table of a compiler.

    Memory-management tables in operating systems.

    Large-scale distributed systems.

    Hash Tables: Effective way of implementing dictionaries.

    Generalization of ordinary arrays.

  • 8/4/2019 09 Buckets

    26/40

    linsort - 26 Lin / DeviComp 122

    Direct-address Tables

    Direct-address Tables are ordinary arrays.

    Facilitate direct addressing.

    Element whose key is kis obtained by indexing into

    the kth position of the array.

    Applicable when we can afford to allocate an array

    with one position for every possible key.

    i.e. when the universe of keys Uis small.

    Dictionary operations can be implemented to take

    O(1) time.

    Details in Sec. 11.1.

  • 8/4/2019 09 Buckets

    27/40

    linsort - 27 Lin / DeviComp 122

    Hash Tables

    Notation:

    UUniverse of all possible keys.

    KSet of keys actually stored in the dictionary.

    |K| = n.

    When U is very large, Arrays are not practical.

    |K|

  • 8/4/2019 09 Buckets

    28/40

    linsort - 28 Lin / DeviComp 122

    Hashing

    Hash function h: Mapping from Uto the slots of a

    hash table T[0..m1].

    h : U {0,1,, m1}

    With arrays, key kmaps to slotA[k].

    With hash tables, key kmaps or hashes to slot

    T[h[k]].

    h[k] is the hash value of key k.

  • 8/4/2019 09 Buckets

    29/40

    linsort - 29 Lin / DeviComp 122

    Hashing

    0

    m1

    h(k1)

    h(k4)

    h(k2)=h(k5)

    h(k3)

    U(universe of keys)

    K

    (actual

    keys)

    k1

    k2

    k3

    k5

    k4

    collision

  • 8/4/2019 09 Buckets

    30/40

    linsort - 30 Lin / DeviComp 122

    Issues with Hashing

    Multiple keys can hash to the same slot

    collisions are possible.

    Design hash functions such that collisions are

    minimized.

    But avoiding collisions is impossible. Design collision-resolution techniques.

    Search will cost (n) time in the worst case.

    However, all operations can be made to have anexpected complexity of(1).

  • 8/4/2019 09 Buckets

    31/40

    linsort - 31 Lin / DeviComp 122

    Methods of Resolution

    Chaining:

    Store all elements that hash to the same

    slot in a linked list.

    Store a pointer to the head of the linked

    list in the hash table slot. Open Addressing:

    All elements stored in hash table itself.

    When collisions occur, use a systematic(consistent) procedure to store elements

    in free slots of the table.

    k2

    0

    m1

    k1 k4

    k5 k6

    k7 k3

    k8

  • 8/4/2019 09 Buckets

    32/40

    linsort - 32 Lin / DeviComp 122

    Collision Resolution by Chaining

    0

    m1

    h(k1)=h(k4)

    h(k2)=h(k5)=h(k6)

    h(k3)=h(k7)

    U(universe of keys)

    K

    (actual

    keys)

    k1

    k2

    k3

    k5

    k4

    k6

    k7k8

    h(k8)

    X

    X

    X

  • 8/4/2019 09 Buckets

    33/40

    linsort - 33 Lin / DeviComp 122

    k2

    Collision Resolution by Chaining

    0

    m1

    U(universe of keys)

    K

    (actual

    keys)

    k1

    k2

    k3

    k5

    k4

    k6

    k7k8

    k1 k4

    k5 k6

    k7 k3

    k8

  • 8/4/2019 09 Buckets

    34/40

    linsort - 34 Lin / DeviComp 122

    Hashing with Chaining

    Dictionary Operations:

    Chained-Hash-Insert (T, x)

    Insertx at the head of list T[h(key[x])].

    Worst-case complexityO(1).

    Chained-Hash-Delete (T, x) Deletex from the list T[h(key[x])].

    Worst-case complexityproportional to length of list with

    singly-linked lists. O(1) with doubly-linked lists.

    Chained-Hash-Search (T, k) Search an element with key kin list T[h(k)].

    Worst-case complexityproportional to length of list.

    l i h i d h h

  • 8/4/2019 09 Buckets

    35/40

    linsort - 35 Lin / DeviComp 122

    Analysis on Chained-Hash-Search

    Load factor=n/m = average keys per slot.

    mnumber of slots. nnumber of elements stored in the hash table.

    Worst-case complexity:(n) + time to compute h(k).

    Average depends on how h distributes keys among m slots. Assume

    Simple uniform hashing.

    Any key is equally likely to hash into any of the m slots,independent of where any other key hashes to.

    O(1) time to compute h(k).

    Time to search for an element with key kis (|T[h(k)]|).

    Expected length of a linked list = load factor = = n/m.

  • 8/4/2019 09 Buckets

    36/40

    linsort - 36 Lin / DeviComp 122

    Expected Cost of an Unsuccessful Search

    Proof:

    Any key not already in the table is equally likely to hashto any of the m slots.

    To search unsuccessfully for any key k, need to search to

    the end of the list T[h(k)], whose expected length is .

    Adding the time to compute the hash function, the total

    time required is (1+).

    Theorem:

    An unsuccessful search takes expected time (1+).

    E d C f S f l S h

  • 8/4/2019 09 Buckets

    37/40

    linsort - 37 Lin / DeviComp 122

    Expected Cost of a Successful Search

    Proof:

    The probability that a list is searched is proportional to the numberof elements it contains.

    Assume that the element being searched for is equally likely to beany of the n elements in the table.

    The number of elements examined during a successful search foran elementx is 1 more than the number of elements that appearbeforex inxs list.

    These are the elements insertedafterxwas inserted.

    Goal:

    Find the average, over the n elementsx in the table, ofhow many elementswere inserted intoxs list afterx was inserted.

    Theorem:

    A successful search takes expected time (1+).

    E d C f S f l S h

  • 8/4/2019 09 Buckets

    38/40

    linsort - 38 Lin / DeviComp 122

    Expected Cost of a Successful Search

    Proof (contd):

    Letxibe the ith element inserted into the table, and let ki = key[xi].

    Define indicator random variablesXij = I{h(ki) = h(kj)}, for all i,j.

    Simple uniform hashing Pr{h(ki) = h(kj)} = 1/m

    E[Xij] = 1/m.

    Expected number of elements examined in a successful search is:

    Theorem:

    A successful search takes expected time (1+).

    n

    i

    n

    ij

    ijXn

    E1 1

    11

    No. of elements inserted afterxi into the same slot asxi.

    P f C d

  • 8/4/2019 09 Buckets

    39/40

    linsort - 39 Lin / DeviComp 122

    ProofContd.

    n

    m

    n

    nnn

    nm

    innm

    innm

    mn

    XEn

    X

    n

    E

    n

    i

    n

    i

    n

    i

    n

    i

    n

    ij

    n

    i

    n

    ij

    ij

    n

    i

    n

    ij

    ij

    221

    2

    11

    2

    )1(11

    11

    )(11

    11

    1

    ][11

    11

    2

    1 1

    1

    1 1

    1 1

    1 1

    (linearity of expectation)

    Expected total time for a successful search= Time to compute hash function + Time

    to search

    = O(2+/2/2n) = O(1+ ).

    E d C I i

  • 8/4/2019 09 Buckets

    40/40

    Li / D i

    Expected CostInterpretation

    Ifn = O(m), then =n/m = O(m)/m = O(1).

    Searching takes constant time on average. Insertion is O(1) in the worst case.

    Deletion takes O(1) worst-case time when lists are doubly

    linked. Hence, all dictionary operations take O(1) time on

    average with hash tables with chaining.