empirical studies of some hashing functions

4
Empirical studies of some hashing functions Olumide Owolabi Department of Mathematics and Computer Science, University of Port Harcourt, Port Harcourt, Nigeria Received 18 February 2002; revised 20 July 2002; accepted 14 September 2002 Abstract The best hash function for a particular data set can often be found by empirical studies. The studies reported here are aimed at discovering the most appropriate function for hashing Nigerian names. Five common hash functions—the division, multiplication, midsquare, radix conversion and random methods—along with two collision—handling techniques—linear probing and chaining—were initially tried out on three data sets, each consisting of about 1000 words. The first data set consists of Nigerian names, the second of English names, and the third of words with computing associations. The major finding is that the performance of these functions with the Nigerian names is comparable to those for the other data sets. Also, the superiority of the random and division methods over others is confirmed, even though the division method will often be preferred for its ease of computation. It is also demonstrated that chaining, as a technique for collision-handling, is to be preferred. The hash methods and collision-handling methods were further tested by using much larger data sets and long multiple word strings. These further tests confirmed the previous findings. q 2002 Elsevier Science B.V. All rights reserved. Keywords: Hashing; Hash function efficiency; Collision resolution; Probing; Chaining 1. Introduction Hashing constitutes a very important class of techniques for storing and accessing data in large information systems. Hash function methods, also known as scatter-storage techniques, locate a piece of data in a table by transforming the search key directly into a table address [5]. Computer-based information, usually organized as records with several fields, are usually identified by certain fields known as the key fields. In storing information about employees in a company, for example, the data may be organized into records with fields such as employee number, name, department, rank, duty and salary. We may choose to store and access the records based, for example, on the name field. This then becomes our key field. Several methods exist for accomplishing a task such as the one just stated. The records could be stored using sequential organization in which they are ordered according to the key field. To locate a desired record, the file is searched from the top until the desired record is encoun- tered. This method is, of course, simplistic and inefficient. Methods that improve on this basic approach include the various indexed and tree organizations [4,8]. The great advantage of the hash approach over these other methods is that the table address for a given record can be directly computed from the key. Suppose we have a table with N locations and we desire to store M records (M , N ) in the table, a hash function maps the M keys into the N table locations. The hash approach applies a function, h, to a key, k, to produce the table address: Address ¼ h(k ). Ideally, a hash function should so evenly distribute the records over the available address space such that no two records hash into the same table location. In such a case, every key can be located by looking into only one slot in the table. It frequently occurs, however, that several records map into the same table location. This phenomenon is referred to as collision. Thus, mechanisms referred to as collision-handling techniques exist alongside hashing func- tions to resolve collision cases [9]. A hash function that ensures that no collisions occur is known as a perfect hash function [3]. If the function can map the M keys into exactly M locations, it is known as a minimally perfect hash function [2,10]. So far, such functions have only been found to work for rather small static data sets [1,11]. Given that hash functions generally suffer from col- lisions, it then becomes necessary to find means of measuring their efficiencies. The goodness of a hash 0950-5849/03/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved. PII: S0950-5849(02)00174-X Information and Software Technology 45 (2003) 109–112 www.elsevier.com/locate/infsof E-mail address: [email protected] (O. Owolabi).

Upload: olumide-owolabi

Post on 05-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Empirical studies of some hashing functions

Olumide Owolabi

Department of Mathematics and Computer Science, University of Port Harcourt, Port Harcourt, Nigeria

Received 18 February 2002; revised 20 July 2002; accepted 14 September 2002

Abstract

The best hash function for a particular data set can often be found by empirical studies. The studies reported here are aimed at discovering

the most appropriate function for hashing Nigerian names. Five common hash functions—the division, multiplication, midsquare, radix

conversion and random methods—along with two collision—handling techniques—linear probing and chaining—were initially tried out on

three data sets, each consisting of about 1000 words. The first data set consists of Nigerian names, the second of English names, and the third

of words with computing associations. The major finding is that the performance of these functions with the Nigerian names is comparable to

those for the other data sets. Also, the superiority of the random and division methods over others is confirmed, even though the division

method will often be preferred for its ease of computation. It is also demonstrated that chaining, as a technique for collision-handling, is to be

preferred. The hash methods and collision-handling methods were further tested by using much larger data sets and long multiple word

strings. These further tests confirmed the previous findings.

q 2002 Elsevier Science B.V. All rights reserved.

Keywords: Hashing; Hash function efficiency; Collision resolution; Probing; Chaining

1. Introduction

Hashing constitutes a very important class of techniques

for storing and accessing data in large information systems.

Hash function methods, also known as scatter-storage

techniques, locate a piece of data in a table by transforming

the search key directly into a table address [5].

Computer-based information, usually organized as

records with several fields, are usually identified by certain

fields known as the key fields. In storing information about

employees in a company, for example, the data may be

organized into records with fields such as employee number,

name, department, rank, duty and salary. We may choose to

store and access the records based, for example, on the name

field. This then becomes our key field.

Several methods exist for accomplishing a task such as

the one just stated. The records could be stored using

sequential organization in which they are ordered according

to the key field. To locate a desired record, the file is

searched from the top until the desired record is encoun-

tered. This method is, of course, simplistic and inefficient.

Methods that improve on this basic approach include the

various indexed and tree organizations [4,8].

The great advantage of the hash approach over these

other methods is that the table address for a given record can

be directly computed from the key. Suppose we have a table

with N locations and we desire to store M records (M , N )

in the table, a hash function maps the M keys into the N table

locations. The hash approach applies a function, h, to a key,

k, to produce the table address: Address ¼ h(k ).

Ideally, a hash function should so evenly distribute the

records over the available address space such that no two

records hash into the same table location. In such a case,

every key can be located by looking into only one slot in the

table. It frequently occurs, however, that several records

map into the same table location. This phenomenon is

referred to as collision. Thus, mechanisms referred to as

collision-handling techniques exist alongside hashing func-

tions to resolve collision cases [9].

A hash function that ensures that no collisions occur is

known as a perfect hash function [3]. If the function can map

the M keys into exactly M locations, it is known as a minimally

perfect hash function [2,10]. So far, such functions have only

been found to work for rather small static data sets [1,11].

Given that hash functions generally suffer from col-

lisions, it then becomes necessary to find means of

measuring their efficiencies. The goodness of a hash

0950-5849/03/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved.

PII: S0 95 0 -5 84 9 (0 2) 00 1 74 -X

Information and Software Technology 45 (2003) 109–112

www.elsevier.com/locate/infsof

E-mail address: [email protected] (O. Owolabi).

function can be measured by the average number of table

slots examined—called probes—to locate any given record

[12]. The number of probes required to find an empty slot or

locate a table entry will depend, amongst other things, on

how full the table is.

The proportion of the hash table that is currently

occupied is called the load factor. This is the ratio the

number of entries to the table size ðM=NÞ: If it is relatively

sparsely occupied, i.e. if the load factor is low, then we can

expect to find an empty slot fairly soon. But if the load factor

is high, then collisions are more likely. This suggests that it

is desirable not to have the table too fully occupied.

Normally, a load factor of between 0.6 and 0.75 will

represent an acceptable figure for most applications [7,9].

As an indication of the importance of hashing methods in

information systems, a great deal of effort has been devoted

to quantifying the efficiency of hashing functions. Indi-

cations from these studies are that the efficiency of hashing

functions can depend to a large extent on the type of data

involved. Empirical studies are essential to evaluate the

goodness of hashing functions in any particular context [6,7,

9,12].

The work reported in this paper arose out of our efforts at

conducting research into large-scale information systems

particularly adapted to local needs. In a prototype

information retrieval system under development, hash

functions are employed to map names—which are largely

Nigerian—into table addresses. The aim in this project is to

study some common hash functions as well as collision-

handling methods to determine which combination will give

the best performance for our particular data sets.

Apart from Nigerian names, hash functions have to be

employed for other data in the system. We therefore decided

to test these functions on two other data sets—English

names and words with computing associations. In addition

to deciding what hash function/collision-handling method

combination is most suitable for the Nigerian names, this

study will also enable us to know whether or not we need to

employ a different combination for each data type in the

system.

Having introduced the subject of our study in this section

of the paper, Sections 2 and 3 describe the hash functions

and collision-handling techniques that were evaluated in

this work. These are followed by a discussion of the

experiments and results in Section 4, and then Section 5

concludes the paper.

2. Hash functions

In these studies we decided to examine the perform-

ances of the five most common hash functions. These

are: the division, multiplication, midsquare, radix and

random methods. Brief descriptions of these methods are

given below. It is required to map the set of keys into N

table locations. Hence, N is the size of the hash table. A

key string, s, is first converted to a numerical value, k,

which is then converted into an address between 0 and

N 2 1, inclusive. In each case the numerical value of a

key string is computed using the function:

f ðsÞ ¼Xn

i¼1

CiWi;

where n ¼ length (s ) is the number of characters in the

string, W the number of characters in the base alphabet,

and Ci is the ordinal position of character i in the

alphabet.

2.1. Division method

Given a key, k, the division method uses the remainder

modulo N as the hash address:

hðkÞ ¼ k mod N

Selecting an appropriate hash table size is important to the

success of this method. For example, a hash table size

divisible by two would yield even hash values for even keys,

and odd hash values for odd keys. This is an undesirable

property, as all keys would hash to even addresses if they

happened to be even. To obtain a good spread, the hash table

size should be a prime number not too close to a power of

two.

2.2. Multiplication method

With this method, the normalized hash address, in the

range 0–1, is first computed. The result is then applied to a

hash table of arbitrary size. The normalized hash address is

computed by multiplying the key value, k, by a constant c,

(c ¼ 0.618034 has been found to be a good choice [4]), and

then taking the fractional part. Multiplying the resulting

fraction by N gives a hash address between 0 and N 2 1.

2.3. Midsquare method

To properly hash up the bits in a given key, this method

squares the key value, k. Thereafter the n middle bits are

extracted from this value to form the hash address. The size

of the hash table will determine the value of n. For example,

10 bits will address up to 1024 hash table locations.

2.4. Radix conversion method

This method is based on the idea that if the same number

is expressed in two digital representations with radices that

are relatively prime to each other, then the respective digits

will have very little correlation. Consider the key, for

example, as a string of octal digits. Now regard this same

string as digits in a different base, 11 say. The resulting base

11 number is then converted to base 10. The hash address is

this number modulo N.

O. Owolabi / Information and Software Technology 45 (2003) 109–112110

2.5. The random method

This method uses a pseudo-random number generator to

generate a sequence of addresses, which are more or less

random. Pseudo-random number generators have the

property that when seeded with a particular value, the

sequence generated is deterministic. This property is

sometimes used in collision handling. The pseudo-random

number generator employed here is seeded with the key to

generate of a value between 0 and 1. Multiplying this by the

size of the hash table gives the hash address.

3. Collision-handling techniques

Collision-handling techniques fall into two general

classes: open addressing and separate chaining.

When a key hashes into a location already occupied by

another record, open addressing techniques work by

searching the table circularly till an empty slot is found.

The search for an empty slot is conducted by adding a

certain increment to the computed address each time. Thus

the search for a slot will visit the locations: hðkÞ; hðkÞ þ

f1; hðkÞ þ f2;…; hðkÞ þ fi;…

Open addressing can be performed by linear probing, in

which case fi ¼ i: It can also be done by quadratic probing,

in which case fi is a quadratic function. Alternatively, it can

be done randomly, in which case fi; i ¼ 1; 2;…; is a

sequence of pseudo-random numbers. Searching for a key

follows the same pattern. In all these cases, the sequence fi is

always deterministic for a given key; this ensures that the

key will always be found if it is in the table.

Separate chaining techniques, as opposed to open

addressing, utilize a storage area separate from the primary

table to accommodate colliding entries. Thus, when a key

hashes into a location already filled, this key is stored in the

next reserved location with a pointer to it stored in the

primary table. There is therefore no need to visit other table

locations in case of a collision; the new entry is simply

attached to the end of the list at the location, where a

collision has occurred.

Chaining can also be implemented by using bucket

addressing. In this case, locations in the reserved area are

grouped into buckets of a reasonable fixed size. When there

is a collision, an empty bucket is attached to the table

location where, the collision has occurred. Colliding entries

are inserted in this bucket until it is full. If a collision occurs

at a location with a full bucket, a new bucket is simply

attached to the last one.

4. Experiments and results

As has been previously stated in this paper, three separate

data sets were initially used in testing the five hash functions

described in Section 2. For collisions, simple chaining and

linear probing were used. The first data set is a collection of

Nigerian names, the second a collection of English names,

while the third is made up of terms with computing

associations. Each consists of about a thousand words. The

Nigerian names range in length from 3 to 12 characters with

an average of 6.15; the English names range in length from 3

to 13 with an average of 6.42; lastly the computing terms have

lengths from 12 to 17, and with an average of 7.45 characters.

For our 1000 words we have chosen a table size of 1537.

This figures satisfies the peculiarities of the different hash

functions. For example, with the mid-square function, we

compute the hash address from the middle 11 bits; this can

address up to 2048 locations. Multiplying this number by

3/4 and adding 1 gives a maximum of 1537 hash locations.

In our experiments, each hash function was in turn tested;

and for each function the collision-handling methods were

separately employed. This process was repeated for each of

the three data sets. For each data set, all the words were first

inserted into the table. Each word was then retrieved in turn,

summing up the total number of probes required to locate all

the words. From this figure, the average number of probes

was computed. The results are shown in Tables 1–3.

Table 2

Results for English names

Hash function Collision-handling method

Linear probing Separate chaining

Division 1.87 1.31

Multiplication 1.88 1.31

Midsquare 3.98 2.64

Radix 1.95 1.32

Random 1.20 1.14

Table 3

Results for computing words

Hash function Collision-handling method

Linear probing Separate chaining

Division 1.90 1.34

Multiplication 1.73 1.28

Midsquare 3.23 2.10

Radix 1.80 1.33

Random 1.20 1.14

Table 1

Results for Nigerian names

Hash function Collision-handling method

Linear probing Separate chaining

Division 2.50 1.36

Multiplication 1.81 1.33

Midsquare 2.84 1.77

Radix 2.00 1.32

Random 1.48 1.18

O. Owolabi / Information and Software Technology 45 (2003) 109–112 111

From Tables 1–3, it is apparent that the random method

of hashing requires the least number of probes on average to

locate a given key. It can thus be inferred that a good

pseudo-random number generator that can be seeded with a

key will be efficient. It is true, however, that it requires more

computational effort than some of the other hash functions.

We can conclude that the performances of the division,

multiplication and radix conversion method are at par. The

division method, however, has the advantage that it is

easiest to compute. These findings are consistent with those

of Lum [6] and Kohonen [5] in which they found that the

division and random methods give the best performance.

It is significant to note that with the chaining method of

collision-handling, the three data sets yield a more or less

uniform set of figures. Also, the hash functions have a more

impressive performance with chaining. Kohonen [5] also

found that separate chaining works best in most cases. With

this method of collision handling it does not appear that the

characteristics of the Nigerian names is significantly

different from the other data types. This might be due to

the fact all the data sets are drawn from the same alphabet.

Further tests were conducted on these hash functions and

collision-handling methods using much larger dictionaries

containing mixed types of strings. The hash table size was

also proportionately increased so as to keep the load factor

at about 0.65 when the table is fully loaded, since this was

the factor used in the initial tests. The results were very

similar to what we have in Tables 1–3.

Also, a dictionary with long multi-word strings with

lengths varying from 14 and 48, and averaging 27 characters

per string was used. Table 4 shows that it is in line with the

results for short single-word strings. This shows that string

length is not necessarily a significant factor with respect to

the efficiency of hashing functions.

A major finding of these studies, therefore, is that we

need not think of using special hash functions to handle

Nigerian names.

5. Conclusion

These preliminary studies have shown that it might not

be necessary to devise special functions for hashing

Nigerian names. The performances of the common hash

functions for Nigerian names are comparable to those for

other words drawn from the same alphabet.

In addition, the superiority of hash functions such as the

random and division methods over others has been

confirmed. Chaining is also shown to be a good choice for

handling collisions. These findings will be of tremendous

help in our work of designing information-processing

systems.

References

[1] M.D. Brain, A.L. Tharp, Near-perfect hashing for large word sets,

Software—Practice Experience 19 (1989) 967–978.

[2] C.C. Chang, A scheme for constructing ordered minimal perfect

hashing functions, Info. Sci. 39 (1986) 187–195.

[3] G.V. Cormack, R.N.S. Horspool, M. Kaisewerth, Practical perfect

hashing, Comput. J. 28 (1985) 54–58.

[4] D.E. Knuth, The art of computer programming, Sorting and

Searching, vol. 3, Addison Wesley, Reading, MA, 1973, pp.

506–542.

[5] T. Kohonen, Content—Assressable Memories, Springer, Berlin, 1987,

pp. 39–100.

[6] V.Y. Lum, General performance analysis of key-to-address trans-

formation methods using an abstract file concept, Commun. ACM 16

(1973) 603–612.

[7] J.D. Maurer, T.G. Lewis, Hash table methods, Comput. Surv. 7 (1975)

5–19.

[8] N.E. Miller, File structures using pascal, Benjamin/Cummings,

California, 1987, pp. 209–260.

[9] R. Morris, Scatter storage techniques, Commun. ACM 11 (1968)

38–44.

[10] R.W. Sebasta, M.A. Taylor, Minimal perfect hash functions for

reserved word lists, SIGPLAN Notices 20 (1985) 47–53.

[11] R. Sprugnoli, Perfect hashing functions: a single probe method for

static sets, Commun. ACM 20 (1977) 841–850.

[12] J.D. Ullman, A note on the efficiency of hashing functions, J. ACM 19

(1972) 569–575.

Table 4

Results for multi-word strings

Hash function Collision-handling method

Linear probing Separate chaining

Division 1.90 1.32

Multiplication 1.80 1.31

Midsquare 3.32 2.12

Radix 1.74 1.31

Random 1.21 1.15

O. Owolabi / Information and Software Technology 45 (2003) 109–112112