hashing best study

Upload: shanmu-technocrat

Post on 05-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Hashing Best Study

    1/40

    File Structures SNU-OOPSLA Lab. 1

    Chap12. Extendible Hashing

    SNU-OOPSLA-LAB

    File Structures by Folk, Zoellick and Riccardi

  • 7/31/2019 Hashing Best Study

    2/40

    File Structures SNU-OOPSLA Lab. 2

    Chapter Objectives

    Describe the problem solved by extendible hashing and relatedapproaches

    Explain how extendible hashing works; show how it combines

    trieswith conventional, static hashing Use the buffer, file, and index classes of previous chapters to

    implement extendible hashing, including deletion

    Review studies of extendible hashing performance

    Examine alternative approaches to the same problem, including

    dynamic hashing, linearhashing, and hashing schemes thatcontrol splitting by allowing for overflow buckets

  • 7/31/2019 Hashing Best Study

    3/40

    File Structures SNU-OOPSLA Lab. 3

    Contents

    12.1 Introduction

    12.2 How extendible hashing works

    12.3 Implementation 12.4 Deletion

    12.5 Extendible hashing performance

    12.6 Alternative approaches

  • 7/31/2019 Hashing Best Study

    4/40

    File Structures SNU-OOPSLA Lab. 4

    12.1 Introduction

    Dynamic files undergo a lot of growths

    Static hashing

    described in chapter 11 (direct hashing)

    typically worse than B-Tree for dynamic files

    eventually requires file reorganization

    Extendible hashing

    hashing for dynamic file

    Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)

  • 7/31/2019 Hashing Best Study

    5/40

    File Structures SNU-OOPSLA Lab. 5

    Overview(1)

    Direct access (hashing) files have static size, so not

    suitable for files whose size is unknown in advance

    Dynamic file structure is desired which retains the feature

    of fast retrieval by primary key, and which also expands

    and contracts as the number of records in the file

    fluctuates (without reorganizing the whole file)

    Similar motivation!

    Indexed-sequential File ==> B tree

    Hashing ==> Extendible Hashing

  • 7/31/2019 Hashing Best Study

    6/40

    File Structures SNU-OOPSLA Lab. 6

    Overview(2)

    Extendible Hashing

    Primary key H(key)Hashing function

    DirectoryIndex

    Extract first d digit

    File pointerTable look-up

  • 7/31/2019 Hashing Best Study

    7/40

    File Structures SNU-OOPSLA Lab. 7

    12.2 How Extendible Hashing works

    Idea from Tries file (radix searching)

    The branching factor of the tree is equal to the # of alternative

    symbols in each position of the key

    e.g.) Radix 26 trie - able, abrahms, adams, anderson,adnrews, baird

    Use the first n characters for branching

    a

    b

    b

    d

    n

    l

    r

    d e

    r

    able

    abrahms

    adams

    anderson

    andrews

    baird

  • 7/31/2019 Hashing Best Study

    8/40

    File Structures SNU-OOPSLA Lab. 8

    Extendible Hashing

    H maps keys to a fixed address space, with size the largestprime less than a power of 2 (65531 < 216)

    File pointers point to blocks of records known as buckets,where an entire bucket is read by one physical data transfer,buckets may be added to or removed from the file dynamically

    The d bits are used as an index in a directory array containing2d entries, which usually resides in primary memory

    The value d, the directory size(2d), and the number of bucketschange automatically as the file expands and contracts

  • 7/31/2019 Hashing Best Study

    9/40

    File Structures SNU-OOPSLA Lab. 9

    Extendible Hashing Example

    000001010011

    100101110111

    d=1

    d=3

    d=3

    d=2

    Directory with d=3 and 4 buckets

    B0

    B100

    B101

    B11

    H(key)=0

    H(key)=100

    H(key)=101

    H(key)=11

    d=3

  • 7/31/2019 Hashing Best Study

    10/40

    File Structures SNU-OOPSLA Lab. 10

    Turning the trie into a directory

    Using Trie for extendible hashing

    (1) Use Radix 2 Trie :

    Keys in A : beginning with 0

    Keys in B : beginning with 10

    Keys in C : beginning with 11

    (2) Retrieving from secondary storage the buckets containingkeys, instead of individual keys

    A

    B

    C

    01 0

    1

  • 7/31/2019 Hashing Best Study

    11/40

    File Structures SNU-OOPSLA Lab. 11

    Representation of Trie (1)

    Tree is not preferable (directory is not big) A flattened array

    1. Make a complete full binary tree

    2. Collapse it into the directory structure

    0

    1

    0

    1

    0

    1

    C

    A

    B

    00

    01

    10

    11

    A

    B

    C

  • 7/31/2019 Hashing Best Study

    12/40

    File Structures SNU-OOPSLA Lab. 12

    Representation of Trie(2)

    Directory is a complete binary tree

    Directory entry : a pointer to the associated bucket

    Given an address beginning with the bits 10, the 210

    directory entries

    Introduced for uniform distribution

  • 7/31/2019 Hashing Best Study

    13/40

    File Structures SNU-OOPSLA Lab. 13

    Retrieve a record

    Steps in retrieving a record with a given key

    find H(given key)

    extract first d bits of H(given key)

    use this value as an index into the directory to find a pointer

    use this pointer to read a bucket into primary memory

    locate the desired record within the bucket (scan)

  • 7/31/2019 Hashing Best Study

    14/40

    File Structures SNU-OOPSLA Lab. 14

    Expansion & Contraction(1)

    A pair of adjunct buckets with the same value of d which

    share a common value of the first d-1 bits of H(key) canbe combined if the average load < 50%, so all records

    would be able to fit into one bucket File contraction is the reverse of expansion; the directory

    can be compacted and d decremented whenever all pairsof pointers have the same values

  • 7/31/2019 Hashing Best Study

    15/40

    File Structures SNU-OOPSLA Lab. 15

    Expansion & Contraction(2)

    000001010

    011100101110111

    d=2

    Bucket B0overflows, then splits into B0and B1

    B00 H(key)=00..

    d=2B01 H(key)=01..

    d=3B100 H(key)=100..

    d=3

    B00 H(key)=101..

    d=2B00 H(key)=11..

    d=3

  • 7/31/2019 Hashing Best Study

    16/40

    File Structures SNU-OOPSLA Lab. 16

    Expansion & Contraction(3)

    0000d=2

    Bucket B100overflows, d increase to 4

    B00 H(key)=00..d=2

    B01 H(key)=01..

    d=4B1000H(key)=1000..

    d=4B1001H(key)=1001..

    d=3B101 H(key)=101..

    d=4

    00010010001101000101

    01100111100010011010

    10111100110111101111

    d=2B11 H(key)=11..

  • 7/31/2019 Hashing Best Study

    17/40

    File Structures SNU-OOPSLA Lab. 17

    Splitting to Handle Overflow (1)

    When overflow occurs

    e.g.1) Overflowing of bucket A Split A into A and D

    Come to use additional unused bits

    No need to expand the directory

    00

    01

    10

    11

    B

    C

    A

    D00

    01

    10

    11

    A

    B

    C

  • 7/31/2019 Hashing Best Study

    18/40

    File Structures SNU-OOPSLA Lab. 18

    Splitting to Handle Overflow(2)

    e.g. Overflowing of bucket B Do not have additional unused bits

    (need to expand the directory)

    1. Divide B using 3 bits of hash address

    2. Make a complete full binary tree

    3. Collapse it into the directory structure

    00

    01

    10

    11

    A

    B

    C

  • 7/31/2019 Hashing Best Study

    19/40

    File Structures SNU-OOPSLA Lab. 19

    A

    B

    C

    D

    0

    1 0

    1

    0

    1

    0

    10

    10

    1 0

    1

    0

    1 0

    1

    0

    1

    A

    B

    D

    C

    000

    001

    010

    011

    A

    100

    101

    110

    111

    C

    B

    D

    1. Result of overflow of bucket B

    3. Directory

    2. Complete Binary Tree

  • 7/31/2019 Hashing Best Study

    20/40

    File Structures SNU-OOPSLA Lab. 20

    Creating Address

    Function hash(KEY)

    Fold/Add hashing algorithm

    Do not MOD hashing value by address space since no fixedaddress space exists

    Output from the hash function for a number of keys

    bill 0000 0011 0110 1100lee 0000 0100 0010 1000

    pauline 0000 1111 0110 0101

    alan 0100 1100 1010 0010

    julie 0010 1110 0000 1001

    mike 0000 0111 0100 1101

    elizabeth 0010 1100 0110 1010

    mark 0000 1010 0000 0111

  • 7/31/2019 Hashing Best Study

    21/40

    File Structures SNU-OOPSLA Lab. 21

    Int Hash (char * key)

    {int sum = 0;

    int len = strlen(key);

    if (len % 2 == 1) len ++; // make len even

    for (int j = 0; j < len; j+2)

    sum = (sum + 100 * key[j] + key[j+1]) % 19937;

    return sum;

    }

    Figure 12.7 Function Hash (key) returns an integer hash value for keyfor a 15 bit

  • 7/31/2019 Hashing Best Study

    22/40

    File Structures SNU-OOPSLA Lab. 22

    Int MakeAddress (char * key, int depth)

    {

    int retval = 0;

    int hashVal = Hash(key);// reverse the bits

    for (int j = 0; j < depth; j++)

    {

    retval = retval > 1;

    }

    return retval;

    }

    Figure 12.9 Function MakeAddress(key,depth)

    Class Bucket: protected TextIndex

  • 7/31/2019 Hashing Best Study

    23/40

    File Structures SNU-OOPSLA Lab. 23

    Class Bucket: protected TextIndex

    {protected:

    Bucket (Directory & dir, int maxKeys = defaultMaxKeys);

    int Insert (char * key, int recAddr);

    int Remove(char * key); Bucket * Split ();

    int NewRange (int & newStart, int & newEnd);

    int Redistribute (Bucket & newBucket);

    int FindBuddy (); int TryCombine ();

    int Combine (Bucket * buddy, int buddyIndex);

    int Depth;

    Directory & Dir;

    int BucketAddr;

    friend class Directory;

    friend class BucketBuffer;

    }; Figure 12.10 Main members of class Bucket

  • 7/31/2019 Hashing Best Study

    24/40

    File Structures SNU-OOPSLA Lab. 24

    class Directory

    {public:

    Directory (..); ~Directory();

    int Open (..); int Create(); int Close();int Insert(); int Delete(); int Search();

    protected

    int DoubleSize();

    int Collape();

    int InsertBucket (.);

    int Find ();

    int StoreBucket();

    int LoadBucket()..

    }

    Figure 12.11 Definition of class Directory

  • 7/31/2019 Hashing Best Study

    25/40

    File Structures SNU-OOPSLA Lab. 25

    12.4 Deletion

    When to combine buckets Buddy buckets: the buckets are siblings and at the leaf level

    of the tree (Buddy means something like friend)

    e.g., B and D in page 19 are buddy buckets

    Examine the directory to see if we can make changesthere

    Shrink the directory if none of the buckets requires the depth

    of address information that is currently available in the

    directory

  • 7/31/2019 Hashing Best Study

    26/40

    File Structures SNU-OOPSLA Lab. 26

    Buddy Bucket Given a bucket with an address uvwxy, where u,v,

    w, x, and yhave values of either 0 or 1, the buddybucket, if it exists, has the value uvwxz, such that

    z = y XOR 1

    If enough keys are deleted, the contents of buddy

    buckets can be combined into a single bucket

  • 7/31/2019 Hashing Best Study

    27/40

    File Structures SNU-OOPSLA Lab. 27

    Collapsing the Directory

    Collapse condition If a single cell, downsizing is impossible

    If there is a pair of directory cells that do not both point to the

    same bucket, collapsing is impossible

    Allocating space

    Allocate half the size of the original

    Copy the bucket references shared by each cell pair to a single

    cell in the new directory

  • 7/31/2019 Hashing Best Study

    28/40

    File Structures SNU-OOPSLA Lab. 28

    12.5 Extendible Hashing Performance Time : O(1)

    If the directory can kept in RAM: a single access Otherwise: two accesses are necessary

    Space utilization of the bucket

    r (# of records), b (block size), N (# of Blocks)

    Utilization = r / bN

    Average utilization ==> 0.69

    Space utilization for the directory

    How large a directory should we expect to have,given an expected number of keys?

    Expected value for the directory size by Flajolet(1983)

    Estimated directory size =3.92 / b X r(1+1/b)

  • 7/31/2019 Hashing Best Study

    29/40

    File Structures SNU-OOPSLA Lab. 29

    Periodic and fluctuating With uniform distributed addresses, all the buckets tend to fill up at the

    same time -> split at the same time

    As buffer fills up : 90%

    After a concentrated series of splits : 50%

    r : # of records , b : block size N ~= 4/(b ln 2)

    Utilization = r / bN ~= ln 2 = 0.69

    Average utilization of 69%

    B tree space utilization Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %

    Space utilization for buckets

  • 7/31/2019 Hashing Best Study

    30/40

    File Structures SNU-OOPSLA Lab. 30

    12.6 Alternative Approaches(1):Dynamic Hashing

    Similar to dynamic extendible hashing Use a directory to track bucket addresses

    Extend the directory through the use of tries

    Start with a hash function that covers an address space of

    a fixed size When overflow occurs

    splits forming the leaves of a trie that grows down from theoriginal address node makes a trie

  • 7/31/2019 Hashing Best Study

    31/40

    File Structures SNU-OOPSLA Lab. 31

    Two kinds of nodes

    External node: reference a data bucket

    Internal node: point to two children index nodes

    When a node has split children, it changed from an external

    node to an internal node

    Two hash functions

    Apply the first hash function original address space

    if external node is found : search is completed if internal node is found : apply second hash function

    Alternative Approaches(2):Dynamic Hashing

  • 7/31/2019 Hashing Best Study

    32/40

    File Structures SNU-OOPSLA Lab. 32

    1 2 3 4

    41 2 3

    40 41

    41 3

    1

    410

    20 21 41

    411

    2

    Originaladdressspace

    Originaladdressspace

    Originaladdressspace

    (a)

    (b)

    (c)

  • 7/31/2019 Hashing Best Study

    33/40

    File Structures SNU-OOPSLA Lab. 33

    Dynamic Hashing vs. Extendible Hashing(1)

    Overflow handling

    Both schemes extend the hash function locally, as a binary search

    trie

    Both schemes use directory structure Dynamic hashing: a linked structure

    Extendible hashing: perfect tree expressible as an array

    Space Utilization

    both schemes is the same (space utilization : 69%)

  • 7/31/2019 Hashing Best Study

    34/40

    File Structures SNU-OOPSLA Lab. 34

    Dynamic Hashing and Extendible Hashing(2)

    Growth of directory Dynamic hashing: slower, more gradual growth

    Extendible hashing: extend directory by doubling it

    Actual size of an index node

    Dynamic hashing is lager than a directory cell in extendiblehashing (because of pointers)

    Page fault

    Dynamic hashing: more than one page fault (with linked structurefor the directory)

    Extendible hashing: single page fault

  • 7/31/2019 Hashing Best Study

    35/40

    File Structures SNU-OOPSLA Lab. 35

    Alternative Approaches(3): Linear Hashing

    Unlike extendible hashingand dynamic hashing, linear hashing does

    not use a directory.

    The actual address space is extended one bucket at a time as buckets

    overflow

    Because the extension of the address space does not necessarily

    correspond to the bucket that is overflowing,linear hashing necessarily involves the use of overflow buckets, even

    as the address space expands

    No directories: Avoid additional seek resulting from additional layer

    Use more bits of hashed value hd(k) : depth dhashing function (using function make_address)

  • 7/31/2019 Hashing Best Study

    36/40

    File Structures SNU-OOPSLA Lab. 36

    a b c d

    00 01 10 11

    a b c d A

    w

    00 01 10 11 100 101

    a b c d A B

    x

    a b c d A B C00 01 10 11 100 101 110

    x

    y

    (a) (b)

    (c) (d)

    (continued...)

    The growth of address space in linear hashing(1)

    000 01 10 11 100

  • 7/31/2019 Hashing Best Study

    37/40

    File Structures SNU-OOPSLA Lab. 37

    a b c d A B C D00 01 10 11 100 101 110 111

    x

    (e)

    The growth of address space in linear hashing(2)

    Alt ti A h (5)

  • 7/31/2019 Hashing Best Study

    38/40

    File Structures SNU-OOPSLA Lab. 38

    Alternative Approaches(5):Approaches to Controlling Splitting

    Postpone splitting: increase space utilization B-Tree: redistribution rather than splitting

    Hashing: placing records in chains of overflow buckets to

    postpone splitting

    Triggering event for splitting Linear hashing

    Every time any bucket overflows

    Not split overflowing bucket

    Litwin(1980): overall load factor of the file Below 2 seeks, 75% ~ 80% storage utilization

  • 7/31/2019 Hashing Best Study

    39/40

    File Structures SNU-OOPSLA Lab. 39

    Alternative Approaches(5):Approaches to Controlling Splitting

    Postpone splitting for extensible hashing

    Use chaining overflow bucket

    Avoid doubling directory space

    1.1 seek, 76% ~ 81% storage utilization

  • 7/31/2019 Hashing Best Study

    40/40

    Fil St t SNU OOPSLA L b 40

    Lets Review !!!

    12.1 Introduction

    12.2 How extendible hashing works

    12.3 Implementation

    12.4 Deletion

    12.5 Extendible hashing performance

    12.6 Alternative approaches