empirical studies of some hashing functions
TRANSCRIPT
Empirical studies of some hashing functions
Olumide Owolabi
Department of Mathematics and Computer Science, University of Port Harcourt, Port Harcourt, Nigeria
Received 18 February 2002; revised 20 July 2002; accepted 14 September 2002
Abstract
The best hash function for a particular data set can often be found by empirical studies. The studies reported here are aimed at discovering
the most appropriate function for hashing Nigerian names. Five common hash functions—the division, multiplication, midsquare, radix
conversion and random methods—along with two collision—handling techniques—linear probing and chaining—were initially tried out on
three data sets, each consisting of about 1000 words. The first data set consists of Nigerian names, the second of English names, and the third
of words with computing associations. The major finding is that the performance of these functions with the Nigerian names is comparable to
those for the other data sets. Also, the superiority of the random and division methods over others is confirmed, even though the division
method will often be preferred for its ease of computation. It is also demonstrated that chaining, as a technique for collision-handling, is to be
preferred. The hash methods and collision-handling methods were further tested by using much larger data sets and long multiple word
strings. These further tests confirmed the previous findings.
q 2002 Elsevier Science B.V. All rights reserved.
Keywords: Hashing; Hash function efficiency; Collision resolution; Probing; Chaining
1. Introduction
Hashing constitutes a very important class of techniques
for storing and accessing data in large information systems.
Hash function methods, also known as scatter-storage
techniques, locate a piece of data in a table by transforming
the search key directly into a table address [5].
Computer-based information, usually organized as
records with several fields, are usually identified by certain
fields known as the key fields. In storing information about
employees in a company, for example, the data may be
organized into records with fields such as employee number,
name, department, rank, duty and salary. We may choose to
store and access the records based, for example, on the name
field. This then becomes our key field.
Several methods exist for accomplishing a task such as
the one just stated. The records could be stored using
sequential organization in which they are ordered according
to the key field. To locate a desired record, the file is
searched from the top until the desired record is encoun-
tered. This method is, of course, simplistic and inefficient.
Methods that improve on this basic approach include the
various indexed and tree organizations [4,8].
The great advantage of the hash approach over these
other methods is that the table address for a given record can
be directly computed from the key. Suppose we have a table
with N locations and we desire to store M records (M , N )
in the table, a hash function maps the M keys into the N table
locations. The hash approach applies a function, h, to a key,
k, to produce the table address: Address ¼ h(k ).
Ideally, a hash function should so evenly distribute the
records over the available address space such that no two
records hash into the same table location. In such a case,
every key can be located by looking into only one slot in the
table. It frequently occurs, however, that several records
map into the same table location. This phenomenon is
referred to as collision. Thus, mechanisms referred to as
collision-handling techniques exist alongside hashing func-
tions to resolve collision cases [9].
A hash function that ensures that no collisions occur is
known as a perfect hash function [3]. If the function can map
the M keys into exactly M locations, it is known as a minimally
perfect hash function [2,10]. So far, such functions have only
been found to work for rather small static data sets [1,11].
Given that hash functions generally suffer from col-
lisions, it then becomes necessary to find means of
measuring their efficiencies. The goodness of a hash
0950-5849/03/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved.
PII: S0 95 0 -5 84 9 (0 2) 00 1 74 -X
Information and Software Technology 45 (2003) 109–112
www.elsevier.com/locate/infsof
E-mail address: [email protected] (O. Owolabi).
function can be measured by the average number of table
slots examined—called probes—to locate any given record
[12]. The number of probes required to find an empty slot or
locate a table entry will depend, amongst other things, on
how full the table is.
The proportion of the hash table that is currently
occupied is called the load factor. This is the ratio the
number of entries to the table size ðM=NÞ: If it is relatively
sparsely occupied, i.e. if the load factor is low, then we can
expect to find an empty slot fairly soon. But if the load factor
is high, then collisions are more likely. This suggests that it
is desirable not to have the table too fully occupied.
Normally, a load factor of between 0.6 and 0.75 will
represent an acceptable figure for most applications [7,9].
As an indication of the importance of hashing methods in
information systems, a great deal of effort has been devoted
to quantifying the efficiency of hashing functions. Indi-
cations from these studies are that the efficiency of hashing
functions can depend to a large extent on the type of data
involved. Empirical studies are essential to evaluate the
goodness of hashing functions in any particular context [6,7,
9,12].
The work reported in this paper arose out of our efforts at
conducting research into large-scale information systems
particularly adapted to local needs. In a prototype
information retrieval system under development, hash
functions are employed to map names—which are largely
Nigerian—into table addresses. The aim in this project is to
study some common hash functions as well as collision-
handling methods to determine which combination will give
the best performance for our particular data sets.
Apart from Nigerian names, hash functions have to be
employed for other data in the system. We therefore decided
to test these functions on two other data sets—English
names and words with computing associations. In addition
to deciding what hash function/collision-handling method
combination is most suitable for the Nigerian names, this
study will also enable us to know whether or not we need to
employ a different combination for each data type in the
system.
Having introduced the subject of our study in this section
of the paper, Sections 2 and 3 describe the hash functions
and collision-handling techniques that were evaluated in
this work. These are followed by a discussion of the
experiments and results in Section 4, and then Section 5
concludes the paper.
2. Hash functions
In these studies we decided to examine the perform-
ances of the five most common hash functions. These
are: the division, multiplication, midsquare, radix and
random methods. Brief descriptions of these methods are
given below. It is required to map the set of keys into N
table locations. Hence, N is the size of the hash table. A
key string, s, is first converted to a numerical value, k,
which is then converted into an address between 0 and
N 2 1, inclusive. In each case the numerical value of a
key string is computed using the function:
f ðsÞ ¼Xn
i¼1
CiWi;
where n ¼ length (s ) is the number of characters in the
string, W the number of characters in the base alphabet,
and Ci is the ordinal position of character i in the
alphabet.
2.1. Division method
Given a key, k, the division method uses the remainder
modulo N as the hash address:
hðkÞ ¼ k mod N
Selecting an appropriate hash table size is important to the
success of this method. For example, a hash table size
divisible by two would yield even hash values for even keys,
and odd hash values for odd keys. This is an undesirable
property, as all keys would hash to even addresses if they
happened to be even. To obtain a good spread, the hash table
size should be a prime number not too close to a power of
two.
2.2. Multiplication method
With this method, the normalized hash address, in the
range 0–1, is first computed. The result is then applied to a
hash table of arbitrary size. The normalized hash address is
computed by multiplying the key value, k, by a constant c,
(c ¼ 0.618034 has been found to be a good choice [4]), and
then taking the fractional part. Multiplying the resulting
fraction by N gives a hash address between 0 and N 2 1.
2.3. Midsquare method
To properly hash up the bits in a given key, this method
squares the key value, k. Thereafter the n middle bits are
extracted from this value to form the hash address. The size
of the hash table will determine the value of n. For example,
10 bits will address up to 1024 hash table locations.
2.4. Radix conversion method
This method is based on the idea that if the same number
is expressed in two digital representations with radices that
are relatively prime to each other, then the respective digits
will have very little correlation. Consider the key, for
example, as a string of octal digits. Now regard this same
string as digits in a different base, 11 say. The resulting base
11 number is then converted to base 10. The hash address is
this number modulo N.
O. Owolabi / Information and Software Technology 45 (2003) 109–112110
2.5. The random method
This method uses a pseudo-random number generator to
generate a sequence of addresses, which are more or less
random. Pseudo-random number generators have the
property that when seeded with a particular value, the
sequence generated is deterministic. This property is
sometimes used in collision handling. The pseudo-random
number generator employed here is seeded with the key to
generate of a value between 0 and 1. Multiplying this by the
size of the hash table gives the hash address.
3. Collision-handling techniques
Collision-handling techniques fall into two general
classes: open addressing and separate chaining.
When a key hashes into a location already occupied by
another record, open addressing techniques work by
searching the table circularly till an empty slot is found.
The search for an empty slot is conducted by adding a
certain increment to the computed address each time. Thus
the search for a slot will visit the locations: hðkÞ; hðkÞ þ
f1; hðkÞ þ f2;…; hðkÞ þ fi;…
Open addressing can be performed by linear probing, in
which case fi ¼ i: It can also be done by quadratic probing,
in which case fi is a quadratic function. Alternatively, it can
be done randomly, in which case fi; i ¼ 1; 2;…; is a
sequence of pseudo-random numbers. Searching for a key
follows the same pattern. In all these cases, the sequence fi is
always deterministic for a given key; this ensures that the
key will always be found if it is in the table.
Separate chaining techniques, as opposed to open
addressing, utilize a storage area separate from the primary
table to accommodate colliding entries. Thus, when a key
hashes into a location already filled, this key is stored in the
next reserved location with a pointer to it stored in the
primary table. There is therefore no need to visit other table
locations in case of a collision; the new entry is simply
attached to the end of the list at the location, where a
collision has occurred.
Chaining can also be implemented by using bucket
addressing. In this case, locations in the reserved area are
grouped into buckets of a reasonable fixed size. When there
is a collision, an empty bucket is attached to the table
location where, the collision has occurred. Colliding entries
are inserted in this bucket until it is full. If a collision occurs
at a location with a full bucket, a new bucket is simply
attached to the last one.
4. Experiments and results
As has been previously stated in this paper, three separate
data sets were initially used in testing the five hash functions
described in Section 2. For collisions, simple chaining and
linear probing were used. The first data set is a collection of
Nigerian names, the second a collection of English names,
while the third is made up of terms with computing
associations. Each consists of about a thousand words. The
Nigerian names range in length from 3 to 12 characters with
an average of 6.15; the English names range in length from 3
to 13 with an average of 6.42; lastly the computing terms have
lengths from 12 to 17, and with an average of 7.45 characters.
For our 1000 words we have chosen a table size of 1537.
This figures satisfies the peculiarities of the different hash
functions. For example, with the mid-square function, we
compute the hash address from the middle 11 bits; this can
address up to 2048 locations. Multiplying this number by
3/4 and adding 1 gives a maximum of 1537 hash locations.
In our experiments, each hash function was in turn tested;
and for each function the collision-handling methods were
separately employed. This process was repeated for each of
the three data sets. For each data set, all the words were first
inserted into the table. Each word was then retrieved in turn,
summing up the total number of probes required to locate all
the words. From this figure, the average number of probes
was computed. The results are shown in Tables 1–3.
Table 2
Results for English names
Hash function Collision-handling method
Linear probing Separate chaining
Division 1.87 1.31
Multiplication 1.88 1.31
Midsquare 3.98 2.64
Radix 1.95 1.32
Random 1.20 1.14
Table 3
Results for computing words
Hash function Collision-handling method
Linear probing Separate chaining
Division 1.90 1.34
Multiplication 1.73 1.28
Midsquare 3.23 2.10
Radix 1.80 1.33
Random 1.20 1.14
Table 1
Results for Nigerian names
Hash function Collision-handling method
Linear probing Separate chaining
Division 2.50 1.36
Multiplication 1.81 1.33
Midsquare 2.84 1.77
Radix 2.00 1.32
Random 1.48 1.18
O. Owolabi / Information and Software Technology 45 (2003) 109–112 111
From Tables 1–3, it is apparent that the random method
of hashing requires the least number of probes on average to
locate a given key. It can thus be inferred that a good
pseudo-random number generator that can be seeded with a
key will be efficient. It is true, however, that it requires more
computational effort than some of the other hash functions.
We can conclude that the performances of the division,
multiplication and radix conversion method are at par. The
division method, however, has the advantage that it is
easiest to compute. These findings are consistent with those
of Lum [6] and Kohonen [5] in which they found that the
division and random methods give the best performance.
It is significant to note that with the chaining method of
collision-handling, the three data sets yield a more or less
uniform set of figures. Also, the hash functions have a more
impressive performance with chaining. Kohonen [5] also
found that separate chaining works best in most cases. With
this method of collision handling it does not appear that the
characteristics of the Nigerian names is significantly
different from the other data types. This might be due to
the fact all the data sets are drawn from the same alphabet.
Further tests were conducted on these hash functions and
collision-handling methods using much larger dictionaries
containing mixed types of strings. The hash table size was
also proportionately increased so as to keep the load factor
at about 0.65 when the table is fully loaded, since this was
the factor used in the initial tests. The results were very
similar to what we have in Tables 1–3.
Also, a dictionary with long multi-word strings with
lengths varying from 14 and 48, and averaging 27 characters
per string was used. Table 4 shows that it is in line with the
results for short single-word strings. This shows that string
length is not necessarily a significant factor with respect to
the efficiency of hashing functions.
A major finding of these studies, therefore, is that we
need not think of using special hash functions to handle
Nigerian names.
5. Conclusion
These preliminary studies have shown that it might not
be necessary to devise special functions for hashing
Nigerian names. The performances of the common hash
functions for Nigerian names are comparable to those for
other words drawn from the same alphabet.
In addition, the superiority of hash functions such as the
random and division methods over others has been
confirmed. Chaining is also shown to be a good choice for
handling collisions. These findings will be of tremendous
help in our work of designing information-processing
systems.
References
[1] M.D. Brain, A.L. Tharp, Near-perfect hashing for large word sets,
Software—Practice Experience 19 (1989) 967–978.
[2] C.C. Chang, A scheme for constructing ordered minimal perfect
hashing functions, Info. Sci. 39 (1986) 187–195.
[3] G.V. Cormack, R.N.S. Horspool, M. Kaisewerth, Practical perfect
hashing, Comput. J. 28 (1985) 54–58.
[4] D.E. Knuth, The art of computer programming, Sorting and
Searching, vol. 3, Addison Wesley, Reading, MA, 1973, pp.
506–542.
[5] T. Kohonen, Content—Assressable Memories, Springer, Berlin, 1987,
pp. 39–100.
[6] V.Y. Lum, General performance analysis of key-to-address trans-
formation methods using an abstract file concept, Commun. ACM 16
(1973) 603–612.
[7] J.D. Maurer, T.G. Lewis, Hash table methods, Comput. Surv. 7 (1975)
5–19.
[8] N.E. Miller, File structures using pascal, Benjamin/Cummings,
California, 1987, pp. 209–260.
[9] R. Morris, Scatter storage techniques, Commun. ACM 11 (1968)
38–44.
[10] R.W. Sebasta, M.A. Taylor, Minimal perfect hash functions for
reserved word lists, SIGPLAN Notices 20 (1985) 47–53.
[11] R. Sprugnoli, Perfect hashing functions: a single probe method for
static sets, Commun. ACM 20 (1977) 841–850.
[12] J.D. Ullman, A note on the efficiency of hashing functions, J. ACM 19
(1972) 569–575.
Table 4
Results for multi-word strings
Hash function Collision-handling method
Linear probing Separate chaining
Division 1.90 1.32
Multiplication 1.80 1.31
Midsquare 3.32 2.12
Radix 1.74 1.31
Random 1.21 1.15
O. Owolabi / Information and Software Technology 45 (2003) 109–112112