hashing of databases based on indirect observations of hamming distances

8
664 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 42, NO 2, MARCH 1996 average delay in a network of queues is difficult to obtain, here we were able to use the fact that the service time is a constant to obtain the desired result. Our approach utilized i) a simple hypothetical priority discipline as a “shortcut” and ii) the queue-equivalence lemma of Morrison [lo]. This result is applicable, in particular, to a packet network where data packets are of fixed duration. Of course, most networks do not have a simple tree topology and involve in general a more complicated structure, where some packets are allowed to by-pass certain nodes and some packets exit the system before reaching the root of the tree. The solution developed here does not apply to these cases. A useful extension of our approach would be along the lines considered in [3], which addresses a network of discrete-time queues where the different queues have constant, but different service times. Such a system, again, arises in a packet radio network where the data capacity of the network varies from link to link. The solution for the delay in such systems would be very useful because it would permit the exact computation of delay for more realistic network models and would facilitate the solution to the optimal capacity allocation problem. In fact, an approximate solution for the case of unequal capacities was obtained in [3] by modeling the output of each link as Bernoulli with rate equal to the utilization of that link. The results of this approximation compared favorably with simulation, especially for large networks. REFERENCES L. Kleinrock, Communication Nets: Stochastic Message Flow and Delay. New York: McGraw-Hill, 1964. D. P. Bertsekas and R. Gallager, Data Networks. Englewcod Cliffs, NJ: Prentice-Hall, 1987. E. Modiano, J. Wieselthier, and A. Ephremides, “.An approach for the analysis of packet delay in an integrated mobile radio network,” in Proc. 1993 Con$ on Information Science and Systems, (Baltimore, MD, Apr. 1993), p. 138. A. M. Viterbi, “Approximate analysis of time-synchronous packet networks,” IEEE J. Sel. Areas Commun. vol. SAC-4, pp. 879-890, Sept. 1986. J. A. Morrison, “Two discrete-time ,queues in tandem,” IEEE Trans. Commun. vol. COM-27, pp. 563-573, Mar. 1979. 0. J. Boxma, “On a tandem queueing model with identical service times at both counters, I,” Adv. Appl. Prob., vol. 11, pp. 616-643, 1979. -, “On a tandem queueing model with identical service times at both counters, 11,” Adv. Appl. Prob., vol. 11, pp. -59, 1979. H. Dupuis and B. Hajek, “A simple formula for mean delay for independent regenerative sources,” Queueing Sysr., vol. 16, pp.195-239, 1994. 0. Kella, “Parallel and tandem fluid networks with independent Levy inputs,” Ann. Appl. Prob., vol. 3, pp. 682-695, 1993. J. A. Morrison, “A combinatorial lemma and its application to concen- trating trees of discrete-time queues,” Bell Sysf. Tech. J., vol. 57, no. 5, pp. 1645-1652, May-June 1978. G. D. Stamoulis and J. N. Tsitsiklis, “Efficient routing schemes for multiple broadcasts in hypercubes,” Rep. LIDS-P-1948, Laboratory for Information and Decision Systems, MIT, 1990. Hashing of Databases Based on Indirect Observations of Hamming Distances Vladimir B. Balakirsky Abstruct-We describe hashing of databases as a problem of infoma- tion and coding theory. It is shown that the triangle inequality for the Hamming distances between binary vectors may essentially decrease the computational efforts of a search for a pattern in a database. Introduction of the Lee distance in the space, which consists of the Hamming distances, leads to a new metric space where the triangle inequality can be effectively used. Index Terms-Hashing, searching for patterns, coding, decoding. I. INTRODUCTION One of important problems in computer science can be represented as follows: we have given a collection of items, and we wish to store these items and upon demand retrieve the items whose key values match given key values. A particular approach to the storage and retrieval problem is known as hashing when we use the key value of an item to compute an address for the storage of the item. Since the mapping between keys and addresses is not one-to-one, the events when different keys have the same address may take place; these events are known as collisions, and their resolution is of the main interest for a hashing scheme [1]-[4]. Similar ideas form the basis for the procedures which are referred to as external hashing [5]. It is known that the cost per storage increases as the access time decreases. Main memories usually have random access time. Since their size is limited by the cost requirements, databases are stored in the secpndary memory with a rather slow access [6]. The hashing technique can essentially reduce the total number of accesses required if the values of a hash function applied to the keys of each item of the database are stored in the main memory. To find the item with a certain key we calculate the value of the hash function for that key and access only the items whose key values have the same value of the hash function. In the present correspondence, we address the extemal hashing and formulate the following problem (to simplify formalization, the item and its key value are associated): we are given a collection of items which are binary vectors of the same length stored on the external memory, and a part of random- access memory (RAM) that can be filled with the values of a hash function corresponding to each item. For a given pattern, which is a binary vector of the same length as the items, we must find all the items located at the Hammng distance of less than a fixed threshold valne. We develop the external hashing in two directions. First, the value of the threshold can be positive, which puts some restnctions on the hash funchons allowed. This is because we want to get information about the Hamming distances between binary vectors based on the comparison of the values of the hash function co&espondmg to these vectors. To make it possible, we use the metric properties of the Hamming space and extend some approaches by Koshelev [7] to the Manuscnpt received February 10, 1994; revised October 20, 1995 The work was supported by a Scholarship from the Swedish Inshtute, Stockholm, Sweden The author is with the Department of Information Theory, University af Lund, P 0. Box 118, S-22100 Lund, Sweden ,on leave from The Data Security Association “Confident,” 193060 St Petersburg, Russia Publisher Item Identlfier S 0018-9448(96)01479-4 0018-9448/96$05.00 0 1996 IEEE

Upload: vb

Post on 22-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hashing of databases based on indirect observations of Hamming distances

664 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 42, NO 2, MARCH 1996

average delay in a network of queues is difficult to obtain, here we were able to use the fact that the service time is a constant to obtain the desired result. Our approach utilized i) a simple hypothetical priority discipline as a “shortcut” and ii) the queue-equivalence lemma of Morrison [lo]. This result is applicable, in particular, to a packet network where data packets are of fixed duration. Of course, most networks do not have a simple tree topology and involve in general a more complicated structure, where some packets are allowed to by-pass certain nodes and some packets exit the system before reaching the root of the tree. The solution developed here does not apply to these cases.

A useful extension of our approach would be along the lines considered in [3], which addresses a network of discrete-time queues where the different queues have constant, but different service times. Such a system, again, arises in a packet radio network where the data capacity of the network varies from link to link. The solution for the delay in such systems would be very useful because it would permit the exact computation of delay for more realistic network models and would facilitate the solution to the optimal capacity allocation problem. In fact, an approximate solution for the case of unequal capacities was obtained in [3] by modeling the output of each link as Bernoulli with rate equal to the utilization of that link. The results of this approximation compared favorably with simulation, especially for large networks.

REFERENCES

L. Kleinrock, Communication Nets: Stochastic Message Flow and Delay. New York: McGraw-Hill, 1964. D. P. Bertsekas and R. Gallager, Data Networks. Englewcod Cliffs, NJ: Prentice-Hall, 1987. E. Modiano, J. Wieselthier, and A. Ephremides, “.An approach for the analysis of packet delay in an integrated mobile radio network,” in Proc. 1993 Con$ on Information Science and Systems, (Baltimore, MD, Apr. 1993), p. 138. A. M. Viterbi, “Approximate analysis of time-synchronous packet networks,” IEEE J. Sel. Areas Commun. vol. SAC-4, pp. 879-890, Sept. 1986. J. A. Morrison, “Two discrete-time ,queues in tandem,” IEEE Trans. Commun. vol. COM-27, pp. 563-573, Mar. 1979. 0. J. Boxma, “On a tandem queueing model with identical service times at both counters, I,” Adv. Appl. Prob., vol. 11, pp. 616-643, 1979. -, “On a tandem queueing model with identical service times at both counters, 11,” Adv. Appl. Prob., vol. 11, pp. -59, 1979. H. Dupuis and B. Hajek, “A simple formula for mean delay for independent regenerative sources,” Queueing Sysr., vol. 16, pp.195-239, 1994. 0. Kella, “Parallel and tandem fluid networks with independent Levy inputs,” Ann. Appl. Prob., vol. 3, pp. 682-695, 1993. J. A. Morrison, “A combinatorial lemma and its application to concen- trating trees of discrete-time queues,” Bell Sysf. Tech. J., vol. 57, no. 5, pp. 1645-1652, May-June 1978. G. D. Stamoulis and J. N. Tsitsiklis, “Efficient routing schemes for multiple broadcasts in hypercubes,” Rep. LIDS-P-1948, Laboratory for Information and Decision Systems, MIT, 1990.

Hashing of Databases Based on Indirect Observations of Hamming Distances

Vladimir B. Balakirsky

Abstruct-We describe hashing of databases as a problem of infoma- tion and coding theory. It is shown that the triangle inequality for the Hamming distances between binary vectors may essentially decrease the computational efforts of a search for a pattern in a database. Introduction of the Lee distance in the space, which consists of the Hamming distances, leads to a new metric space where the triangle inequality can be effectively used.

Index Terms-Hashing, searching for patterns, coding, decoding.

I. INTRODUCTION

One of important problems in computer science can be represented as follows: we have given a collection of items, and we wish to store these items and upon demand retrieve the items whose key values match given key values. A particular approach to the storage and retrieval problem is known as hashing when we use the key value of an item to compute an address for the storage of the item. Since the mapping between keys and addresses is not one-to-one, the events when different keys have the same address may take place; these events are known as collisions, and their resolution is of the main interest for a hashing scheme [1]-[4].

Similar ideas form the basis for the procedures which are referred to as external hashing [5]. It is known that the cost per storage increases as the access time decreases. Main memories usually have random access time. Since their size is limited by the cost requirements, databases are stored in the secpndary memory with a rather slow access [6]. The hashing technique can essentially reduce the total number of accesses required if the values of a hash function applied to the keys of each item of the database are stored in the main memory. To find the item with a certain key we calculate the value of the hash function for that key and access only the items whose key values have the same value of the hash function. In the present correspondence, we address the extemal hashing and formulate the following problem (to simplify formalization, the item and its key value are associated): we are given a collection of items which are binary vectors of the same length stored on the external memory, and a part of random- access memory (RAM) that can be filled with the values of a hash function corresponding to each item. For a given pattern, which i s a binary vector of the same length as the items, we must find all the items located at the Hammng distance of less than a fixed threshold valne.

We develop the external hashing in two directions. First, the value of the threshold can be positive, which puts some restnctions on the hash funchons allowed. This is because we want to get information about the Hamming distances between binary vectors based on the comparison of the values of the hash function co&espondmg to these vectors. To make it possible, we use the metric properties of the Hamming space and extend some approaches by Koshelev [7] to the

Manuscnpt received February 10, 1994; revised October 20, 1995 The work was supported by a Scholarship from the Swedish Inshtute, Stockholm, Sweden

The author is with the Department of Information Theory, University af Lund, P 0. Box 118, S-22100 Lund, Sweden ,on leave from The Data Security Association “Confident,” 193060 St Petersburg, Russia

Publisher Item Identlfier S 0018-9448(96)01479-4

0018-9448/96$05.00 0 1996 IEEE

Page 2: Hashing of databases based on indirect observations of Hamming distances

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 2, MARCH 1996 665

problem of finding the neighboring points in random digital fields (these results are briefly stated in Section 111). Second, we represent the hashing problem as a special case of the data compression problem and use some results of coding theory which lead to the hashing procedure with the minimal possible average number of accesses to the external memory when the threshold is equal to zero (the optimality of the algorithm for positive thresholds is an open problem).

The basic quantity we are interested in is the number of accesses to the external memory after the use of the data stored in RAM. The average of this quantity should be minimized over all possible databases and patterns. It means that the hashing procedure must be fixed when we are given a probability distribution on databases, rather than a particular database. Section I11 contains an example when conventional hashing schemes may lead to essentially different results for different probability distributions, and the best result is attained when each item is an independently chosen binary vector. Our hashing scheme gives the same result (in asymptotical sense) for any probability distribution.

The hashing procedure, introduced in the correspondence, is param- etrized by a binary matrix called a code. A random coding method [SI, developed for the hashing scheme, leads to the theorem of an existence type, i.e., we claim that there exists a code such that the average number of accesses to the external memory is upper-bounded by some expression. If the size of the database exponentially grows with the length of the items, then the size of a code grows, slower than a linear function of that length. Therefore, we can find a good code much easier compared to a similar problem in coding theory, where the size of the code is usually assumed to be an exponential function of the code length.

From a formal point of view, the basic idea of the correspondence can be represented as a use of the Lee metric over the space consisting of the Hamming distances between binary vectors, since it gives an opportunity to organize an effective procedure of selecting the: vectors which can be close to each other in the Hamming sense. Some properties of coverings of the Hamming space in the Lee metric, developed in this correspondence, can be useful for a univeirsal data compression with a fidely criterion [SI, [9].

The correspondence is organized as follows. In Section 11, we formulate the mathematical problem and the main result. In Section 111, we discuss possible indirect estimations of the Hamming distance between binary vectors. The hashing algorithm, based on these estimations, is described in Section IV. In Section V, we prove the main result. Section VI contains an example of the hashing pirocedure applied to the database consisting of 50 names of the states in the U.S.A. Some illustrations of hashing methods and proofs of auxiliary results are given in the Appendix.

11. MATHEMATICAL MODEL FOR SEARCH PROBLEM FORMULATION OF RESULTS

Let X be a binary matrix of size M x L that will be referred to as the database (DB) and let z E (0, l}L be a vector that will be referred to as pattern. Let 21, . . . , z~ be the rows of the matrix X that will be referred to as records ofthe DB. The problem is to form the set

JT(2, X ) = { j : d H ( Z , 23) 2 T } (2.1)

for given 2, X , and a threshold value T, where d ~ ( . , .) deinotes the Hamming distance.

We assume that the DB is stored in the external memory of a computer, but a preprocessing of the DB is possible, and there are MI bits of RAM available to store the results. The operations should be defined by a function f, which maps the binary vectors of length L to the binary vectors of length 1. The 2-tuples f(zl), . . . , f ( z ~ ) contain

information about z l , . . . , X M , which may be used for selecting the numbers of records that are close to a given pattern using the Hamming distance as the measure of closeness. We can write the results of this selection via the values of a Boolean function V T , which is defined for all pairs (2, ~(zI)), . . . , (2, ~ ( z M ) ) and equal to 1 for the pairs (z, f(z,)),j E J T ( z , X ) . Thus we represent hashing of the DB as an assignment of the functions

such that the following statement:

d H ( Z , X I ) I T Y T ( Z , f ( % ) ) = 1 (2.3)

is valid for all vectors z, xl E (0, l}L. The functions f and p~ will be referred to as encoding and decoding, respectively. In general, the function f may depend on T , but we restrict our attention to the case when only the decoding depends on T.

Remarks: In computer science, the function f is referred to as a hash function, and hashing as an assignement of such a function. This situation follows from the fact that if T = 0 then PO = f - l . In our case, T can be positive, and it is not evident how to assign P T . Note that similar problems arise in noiseless data compression and in data compression with a fidelity criterion [8], [9].

Using (2.3), we conclude that

where

Therefore, the problem can be solved in two steps: 1) To calculate p ~ ( z , f(z~)), . . . , Y T ( Z , ~ ( z M ) ) using the

data, which are stored in RAM, and to construct the set

2 ) To transmit the records xl,j E J (z ,Xl f ,p~) from the external memory to RAM and to construct the set JT (z, X ) after calculation of the Hamming distances d ~ ( z , z3) and their comparison with the threshold value T.

The complexity of the procedure is determined by the number of computations at the first and second step. However, supposing that the transmission plays the determining role, we characterize effectiveness of the algorithm by the quantity

J(z,Xlf, PT).

NT(Z,Xlf,L?T) =IJ(z,Xlf,w) - JT(Z?X)I M

= X { d H ( Z , z,) > TIX{PT(? f ( Z 3 ) ) = 11 j=1

(2.6)

Hereafter, x denotes the indicator function: x { p } = 1, if the statement /? is true, and 0, otherwise. These transmissions can be interpreted as “extra” transmissions.

Suppose that the matrix X is chosen as a DB with the probability P ( X ) , and the vector x E (0, llL appears as a pattern with the probability p(z1X). Then

NT(P, P l f , W) x P ( Z l x ) ’ p ( X ) ’ N T ( 2 , X l f , P T ) z x

(2.7)

is the average number of “extra” transmissions We set the mathe- matical problem as the mmimization of (2.7) for given p and P over all admssible pairs (f, y ~ ) . In other words, it is required to evaluate the following function:

N T ( P , ~ ) = min N T ( P , P I ~ , Y T ) (2.8) f , v r

Page 3: Hashing of databases based on indirect observations of Hamming distances

666 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 2, MARCH 1996

where the minimum is taken over all pairs ( f , p ~ ) satisfying (2.2,) and (2.3). The aim of this correspondence is to propose an encoding and decoding method providing a sufficiently strong upper bound on the expression on the right-hand side of (2.7) when T is small as compared to L1/’. Obviously, the bound we will obtain is also an upper bound on NT ( p , P) .

Note that the same problem is also possible when the DB is fixed. However, it is included in our considerations because there are no restrictions on P, which may be defined as 0 for all DB’s that do not coincide with a fixed one. The problem we have formulated is more general since, in practice, the contents of the DB may be changed during operations.

Theorem: Let

(2.9)

be the number of records located at the Hamming distance t from a given pattern 2, where t = 0,. . . , L. Then, for any even q, there exist an encoding f * and decoding y:, satisfying (2.2) and (2.3), such that

t=T+l

where

and m is the largest integer such that

Corollary: Let

T + 2 = (277~L)”’

and

P ( X ) > 0 + z, # z, for all i # j ,

(2.10)

(2.11)

(2.12)

(2.13)

(2.14)

(2.15)

(2.16)

(2.17)

The proofs are given in Section V Dzscusszon: 1) The key point of the statement above i s the fact that the

expression on the right-hand side of (2.16) does not depend on p and P. Let us suppose that T = 0 and the pattern is one of the records of the D B , chosen in accordance with the uniform distnbution. Suppose also that the first I bits of each record are stored in RAM. If all the records are random vectors whose components are independently chosen from (0, I} with the probability 112, then the average number of “extra” transmissions to RAM is equal to M . Z-’, and we cannot improve this result. Let us change the input data in the following way: suppose that z1 is chosen as before, but all the

other records are different random vectors, chosen from the set which consists of vectors x’ such that da(z1, z’) 5 EL, where the parameter E is defined in (2.15); in computer science, such a construction is called a cluster [4]. Then the average number of “extra” transmissions to RAM is approximately equal to M.2-‘H(‘), whereH(&) = - - E l o g z ~ - ( l - ~ ) l o g , ( l - ~ ) is a binary entropy function ( M M 2 L H ( a ) ; “almost all” records and “almost all” patterns are located at a distance EL from 2 1 ; and there are approximately 2IH(‘) groups of records of size 2(L-’)N(E) having the same prefixes of length I ) . Thus the characteristics of the hashing algorithm can be essentially different in the first and second cases. Furthermore, it is not evident how to organize a hashing procedure in the second case to get IM. 2-‘ as a number of “extra” transmissions, since the position of the cluster is a uniformly distributed random variable. The theorem above shows that such a procedure exists, and the proof gives a construction of the hash function.

2) The proof of the Theorem shows that in order to calcu- late f*(zJ) we find the Hamming distances between x, and I/log,qo binary vectors that are chosen in a special way and then find the Lee distances between these numbers and l/log, qo integers stored in RAM; qo = (27r&L)’/’. To calculate pF(x, f*(z3)), we compare the Lee distances with the threshold value T. The complexity of these operations and the required size of RAM may be neglected in comparison with the computational effort needed to access the external memory and transmit a record to the RAM.

3) A simple generalization of the proof of the Theorem allows us to claim that there exists the same pair (f*, 9;). providing the bound (2.10) multiplied by Lr/’, for all T < Lrl2. This increment has little practical effect on the exponent of the bound when L is large enough.

m. ESTIMATIONS OF THE HAMMING DISTANCE BETWEEN BINARY VECTORS BASED ON INDIRECT OBSERVATIONS

Let z, z,, and c be some binary vectors of length L. Then, due to the triangle inequality for the Hammmg distances, we write

d H b , Z j ) 2 I d H ( C , Z ) - d H ( C , Z , ) / (3.1)

and

I d ~ ( c , z ) - d ~ ( c , z , ) I >T * d a ( z , ~ , ) > T . (3.2)

Hence, we can select a binary vector c and assign the encoding f (z,) as the binary representation of d H ( c , x,), and the decodmg as

P T ( Z , f ( Z j ) ) = X { 3 : I d H ( c , z ) - d H - ( c , z , ) I 5 T }

This hashing method is illustrated in Appendix 11. If c E {O, l}L is chosen at random in such a way that every

component takes values 0 and 1 independently of other components with probability 1/2, then

(3.3) Pr{dH(c,=,) = d,lz,} = Br,(d,)

where

B L ( ~ , ) = (’ ) . 2 - ~ for all d, = o, . . . , L (3.4)

is the binomial distribution. It is well known [SI, that the entropy of this distribution is equal to

L

H ( B L ) = - C B r , ( d ) l o g 2 B L ( d ) N -log, m. d=O

This means that if we observe a sequence of m independent random variables, such that every component i s distributed in accordance

Page 4: Hashing of databases based on indirect observations of Hamming distances

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 2, MARCH 1996 667

with the binomial distribution and m is large enough, then, roughly speaking, it is sufficient to have approximately m . log, .m where

bits to store this sequence. In our case, we can, in principle, get this sequende, selecting m 5 l/log, ( L + 1) vectors c1,. . , cm at random, and assign the encoding of z7 as the binary represl-ntation

d L e e ( d H ( C , min I d H ( C , z ) - ( d b ( c , x , ) + sq)l (3.6) - -

s=O,l , . . . of ~ H ( c ~ , x3), . . . , d ~ ( c ~ , x3), but m cannot be large compared to L. Besides, we need to provide the property that any record of the DB, which is close to a given pattern, should not be missed in the first step of the decoding procedure.

is the Lee distance between d H ( c , x ) and d k ( c , x , ) . However, in accordance with (3.1), the expression at the left-hand side of (3.5) is a lower bound on d H ( Z , x 3 ) . Thus

Discussion: If T = 0, then any function f , transforming binary vectors of length L to binary vectors of length 2 may be used for encoding. Then the decoding consists of calculating the value of this function for a given pattern and selecting all records of the DB which have the same value. However, if T > 0, then using a function which is a distance (i.e., the triangle inequality is valid), we can decrease the computational efforts compared to an arbitrary function. The distance function, as it follows from our results, can be also effectivelly used when 2 = 0. The triangle inequality is known as a tool which allows us essentially reduce the number of measurements of the Hamming distances in the problem of finding the neighboring points in random digital fields [7] . This problem can be represented as follows. Suppose, we are given M binary vectors, 21,"' , Z M , of length L and we want to find all the pairs ( 2 , ~ ) such that d H ( z t , z , ) 5 T , where T is a given constant. The direct solution of this problem requires M ( M - 1)/2 measurements. However, as it was shown in [7] , we can fix m = O(ln M/ln a) and measure m M distances d H ( z , , z k ) , i = l , . . . , M , k = l , . . . , m , i.e., we fill in the first m columns and the first m rows of the matrix consisting of the Hamming distances. Using the triangle inequality, we can exclude a number of positions of that matrix, since we definitely know that the corresponding vectors are at a distance which is greater than T . The number of positions, where we cannot come to such a conclusion, is O ( M 1 n M ) . These considerations are valid if T 5 m, where LY ix defined by the equation 1 - H ( a ) = log, M I L , and when we are interested in the average characteristics taken over the ensemble of independent and uniformly distributed vectors XI, . . . , XM. Our considerations can be presented as an extension of the approach of [7] in two directions: 1) we are also interested in the uniform distribution on integers, which are the elements of the first m columns and the first m rows of the matrix of distances; 2) we are dealing with any probability disiributions on the collections of the vectors 2 1 , . . . , ZM.

The idea of introducing the Lee distance in our problem is based on an observation that the pairs of random binary vectors of length L are located at the Hamming distance, which is close to L/2. Therefore, we expect direct use of (3.1) to restrict the domain of the search leads to a nonoptimal tradeoff between the number of "extra" tran'smissions of records from the external memory and the size of RAM, required to store the results of preprocessing.

Let q be an integer, and dg (e, z3) denote the remainder of division of d H ( e , x , ) by q. Then

0 I d; l ( c , x , ) I P - 1

and

~ H ( c , z , ) = d & ( c , x , ) + s q f o r s o m e s E { O , 1 , . . . } .

Therefore

( d H ( C , x ) - ~ H ( c , z ~ ) ) 2 d L e e ( d H ( C , x ) , d ; l ( ~ , x 3 ) ) (3.5)

It means, that we can assign an encoding and decoding, satisfying (2.2) and (2.3), based on the Lee distance. To store a value of the Lee distance, we need to have only log, q bits instead of log, (L+ 1) bits for a value of the Hamming distance, and this hashing method may lead to a more effective tradeoff between the accuracy of a lower bound on d H ( 2 , x,) and the required size of RAM. We will examine this point in the next sections, and the properties of the Lee distance, which are given below, will be used.

Lemma 1: If q is even, then the following inequality:

is valid for any d 2 0, and any x, E (0, l}L, where PL is defined in (2.11).

Lemma 2: Let z, x3 E (0, l}L be chosen in such a way that

d H ( 2 , Z j ) = t

q is even, and k < q/2. If both t and k are even or odd, then

Otherwise

The proofs of Lemmas 1 and 2 are given in Appendix I. Discussion: Let x, be given and e be chosen at random in accor-

dance with the uniform distribution. Let us calculate the Hamming distance d = d H ( c , x 3 ) and generate a new random variable d* , which is equal to 0 if d is even and equal to 1 if d is odd, i.e., d* = d , mod2. It is well known that if L is odd, then

B L ( d ) = c B L ( d ) = l /2 . d-even d-odd

The sums in this equation are the probabilities that d* = 0 and d* = 1. Thus we obtain a uniformly distributed bit. To get a good characteristics of the hashing algorithm, we need to keep this property and reduce d modulo q > 2, since we want to use the triangle inequality and estimate the Hamming distance between x, and z when T > 0. The results of Lemmas 1 and 2 show that this is possible if

q<J3pL1 N m. Note also that the probability, written on the left-hand side of (3.8), will be obtained if we pick up every qth term of the binomial

Page 5: Hashing of databases based on indirect observations of Hamming distances

EF!E TRANSACTIONS ON INFORMATION THEORY, VOL 42, NO 2, MARCH 1996

Fig. 1. ordinates of the impulses; d, = d; defines the first impulse.

The probability on the left-hand side of (3.8) is equal to the sum of

distribution BL ( d 3 ) when the first term is equal to d;. This procedure is illustrated in Fig. 1.

IV. HASHING ALGONTHM Let us assign an even q and an integer m, satisfying (2.12). Besides,

let us assign binary vectors c1,. . . , cm E (0, l}" and assume that they are stored in RAM. Let d&(c,,s,) be the Hamming distance between cz and x J r reduced moduloq, and let

Using (3.7), we conclude that this pair (encoding, decoding) satisfies (2.2) and (2.3). The hashing algorithm is illustrated in Appendix 11.

v. PROOF OF THE THEOREM AND COROLLARY

Let (f"? &) be a pair (encoding, decoding) defined in (4.1), which leads to the minimal average number of "extra" transmissions, where the minimum is taken on all binary vectors cl , . . . , cm E (0, l }L , i.e.

Let us introduce the code ensemble, consisting of all collections of m binary vectors of length L, such that each component of each vector is an independent binary variable, equaling 0 or 1 with probability l /2. Then we can use the random coding arguments [8] and claim that the minimal value of the expression at the right-hand side of (5. l), taken on all possible vectors c1, . . . , c,, is not greater than the

This inequality coincides with (2.10), and the Theorem has been proved.

To prove the Corollary we minimize the expression at the right- hand side of (2.10) on all integers M t ( X / z ) , satisfying the restrictions

t =O

Hence, if M satisfies (2.15), then the numbers Mt(Xlx) should be as large as possible for f = 0, . . . , EL and all the other numbers should be equal to zero. Therefore

5 M . (T + 2)" . E L . (i + &),. (5.2)

Let us assign q = m. Then 2 / q 2 P r ~ and using (2.13, (2.13), and (5.2) we write

where exp, z = 2" for all z. Using (2.17) we conclude that the expression on the right-hand side of (5.3) coincides with (2.16).

Page 6: Hashing of databases based on indirect observations of Hamming distances

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 2, MARCH 1996 669

VI. AN EXAMPLE OF HASHING PROCEDURE

Let us suppose that the DB consists of the binary representations (in ASCII codes) of the names of 50 states of the U.S.A., written in capital letters. Then M = 50 and L = 13 . 8 = 104, since 13 is the maximal length of the names (for example, MASSACHUSETTS) and we assume that the gaps are added to each name of length less than 13.

This example was considered in [lo], where a hashing procedure, which assigns a unique integer, belonging to the set { 1, . . . , 5 3 } , to each name, was proposed. The procedure is based on the observation that it is possible to pick up two letters, kl and k z , from each name in such a way that all the pairs ( k 1 , k 2 ) are different. An integer, corresponding to the pair ( k l , k 2 ) , is assigned as ~ ( k l ) + w ~ ( k 2 ) ,

where V I and 212 are integer-valued functions constructed in ,I special way. The letters k l and kz were chosen from a name of length T in accordance with the following algorithm:

1) if r = 9, then k l is the last letter and k2 is the third Letter; 2) if r # 9, then kl is the first letter and kg is the (7 - Lr/2j)th

letter, where [ ~ / 2 ] denotes the maximal integer such that

A generalization of this procedure to the other DB’s seeins to be d i f fi c u 1 t .

Let us suppose that 1 = 8, i.e., one byte can be used to store a representative of each name in RAM. If we assign two codewords of length 104, then this byte can be constructed in such a way that the 4kth, . . . , the 4k + 3rd bits contain the Hamming distance between the element of the DB and the kth codeword, reduced modulo16, where IC = 0 , l . For example, the codewords can be given by the matrix at the bottom of this page (we use the hexadecimal notations).

Then we obtain different bytes for 42 names and three groups of names having the same byte. These groups are as follows:

l./2l 5 712.

{ A L A S K A , M O N T A N A , N E W Y O R K } ,

{ H A W A I I , I N D I A N A , P E N N S Y L V A N I A ) . { A L A B A M A , T E N N E S S E E } .

If we assume that each name appears as a pattem with the probability 1/50 and we want to find this name in the DB (the case T = 0), then the average number of “extra” transmissions to RAM is equal to

2.3150 + 2 * 3/50 + 1 .2/50 = 14/50.

Since 211 bytes are not the values of our hash function, we can use the triangle inequality to correct some misprints in the patterns (the case T > 0). Note that we also need a “good” code, which assigns a byte to each letter, to realize this possibility in an effective way.

The matrix C , given above, is a result of the simulation, where we examined different matrices, chosen at random, and selected the matrices that provide a minimal average number of “extra” transmissions. The estimate of the average value of this number, taken over the matrices we have examined, is equal to 0.60, while the estimate of the square root of the variation is equal to 0.15. We assume that if the size of the DB is large enough, then thie average number of “extra” transmissions is a very stable random variable in the code ensemble, and some analytical upper bounds on the moments of this variable confirm the assumption. More detailed discussion of this point will be proposed in the future.

Fig. 2. Disposition of the points . . . , d ( - l ) , do), d ( l ) , . . . and the corresponding impulses at the binomial curve; impulses at the points L f iq/2,i = 0,1;.. are marked by stars.

APPENDIX I

Proof of Lemma 1

class Id;] by d”), s = 0, fl , . . . , in such a way that (Fig.2) Suppose that L is even, and denote the elements of the residue

do) E [L/2 - q / 2 , L / 2 + q / 2 )

Pr{d;i(c,z,) = ci;lz,} = CBL(~(~)).

dS) = do) + sq for all s = *I, * 2 , . . . . Then we can write

(AI) S

Suppose that do) 2 L/2. Then

BL ( L / 2 + sq) 5 BL (d ( ” ) )

BL(L/2 + q/2 + sq) I B L ( d ( “ ) )

I BL ( L / 2 + q/2 i- sq)

for alls = -1, -2,. . .

I BL ( L / 2 + sq)

for alls = 0 ,1 ,2 , . . . . (A2)

Using (Al) and (A2), we conclude that

b I Pr{d&(e,z,) = d:1z3} 5 b + B L ( L / ~ ) (A3)

where

b = B L ( L / ~ + sq/2). s>o

Inequalities (A3) follow from the identities

B L ( L / ~ + sq/2) = B L ( L / ~ - sq /2) ,

BL(L/2 + q / 2 + “4) = BL(L/2 - q / 2 - sq)

for alls = -1,-2, . . . .

It is easy to see that inequalities (A3) are also valid for do) < L/2. Furthermore, they are valid for all d; _> 0. Since

0 - 1

I( A1 C6 87 B4 DD 52 23 20 D9 9E 7F 4C 95 AA 9B 38 11 76 77 E4 4D 02 13 50 49 4E ’

Page 7: Hashing of databases based on indirect observations of Hamming distances

670 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 42, NO. 2, MARCH 1996

Proof of Lemma 2 Let I be the set that consists of all indices, for which z differs X =

from z3, and let

using the first inequalities in (A3), we obtain

21 0 0 0 0 0 0 0 0 - 0 1 1 0 1 0 1 1 . “ -

0 1 1 1 0 0 0 0 ‘ x4 1 0 0 0 0 0 0 0

1

Y b < -.

APPENDIX I1 ILLUSTRATIONS OF HASING METHODS

Let

c I ~ { n , i } t d=O

’ X(dL,,(d, t - d ) = k } .

If

d L e e ( d , t - d ) = k

then there exists an integer s such that

Id - (t - d + sq)l =

and either

2d = t - k + sq

or

2d = t + k + sq.

k

Then

The values of the hash function can be defined as binary represen- tations of these integers

Let

z= (0,0,0,0,0,0,0,1).

The vector z differs from all rows of the matrix X and does not coincide with the row z1 in one bit. Therefore, using (2.1) we write

J o (2, X) = 0, 51 (x, X ) = { l}.

Inequalities (A4) and (A5) may be satisfied if and only if (iff) both t and k are either even or odd. Then

Pr { d L e e ( d H ( C , x), d&(c, ~ 3 ) ) = kjz , 2 2 ]

= Pr{dH(cr,zI) E [(t - k ) / 2 I , p l z 1 }

+ Pr { ~ H ( c I , 21) E [(t + J i ) / 2 I q / z 1 ~ r }

where [i],p denotes the residue class modulo q / 2 , i.e., [ i IQp consists of integers z * + sq/2, s = 0,1,. . . , where a* E [0, y/2) is assigned in such a way that i - z * = sq/2 for some integer s. Therefore, substitution of the leaders of the residue classes [(t + k ) / 2 I q p and [(t - Ic) /2]q/2 for d3+, t for L, and q for y/2 to the result of Lemma 1 completes the proof.

Since d ~ ( e l , x ) = 1, to select the rows, which can coincide with 2, we find the indices J such that

d H ( C 1 , 2 3 ) = 1

and to select the rows, which can differ from x in one bit, we find the indices j such that

d H ( C 1 , Z J ) E {0,1,2}.

Jo(~,Xlf, p n ) = {4}, Jl(Z,Xlf, Yl) = { L 4 }

Then we will construct the sets 644)

(A51 and

L v o ( x , X ~ f , c p , ) = l - o = l , Nl(Z,XJf,cp1) = 2 - 1 = 1.

The hashing method, that includes the previous one as a special case, consists in the assignment of integers m and y, connected by (2.12), assignment of the matrix C of size m x L,, and calculation of d& (cZ, xJ ), where z = I, . . . , m, j = 1, . . . , M . Let

m = 2 , q = 4

and

Page 8: Hashing of databases based on indirect observations of Hamming distances

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 2, MARCH 1996 67 1

f(z1)

f(x4)

Then

0 0 0 1

1 1 1 0 . 0 1 1 0

- 0 1 1 0 -

and

The values of the hash-function are binary representations of these integers so that two bits are used to store each integer

Then

( d ~ (ci, x) , d~ (cz , x)) = (1,4)

(dX.1, x), d; (c2 , .)I = (1,O)

d&(.1,23) = 1

d & ( C 2 , 2 3 ) = o

and to construct the set Jo (2, X) we select the indices j such that

and to construct the set JI (2, X) we select the indices j such that

d r , ( C l , Z , ) € (1 - 1 = 0, 1,1+ 1 = 2)

~ > ( c z , z ~ ) E { O - l + 4 = 3 , 0 , 0 + 1 = 1 } .

It leads to the following equations:

Jo(~,xlfc,~o) = { 2 , 4 ) n 0 = 0 Jl(.,XlfC,Cpl) ={1,2,41 n (11 = (11

and

Note, that the last hashing algorithm leads to the assignment of the same values of the hash function to the rows x2 and x4, while the Hamming distance between these rows is equal to 6.

ACKNOWLEDGMENT

The author wishes to thank Prof. V. Koshelev for the support and encouragement. The author also wishes to thank Prof. R. Ahlswede and Dr. G. Lindh for interesting discussions, as well as anonymous reviewers for helpful comments.

REFERENCES

[l] D. E. Knuth, The Art of Computer Programming. vol. 3. Sorting and Searching. New York: Adison-Wessley, 1973.

[2] H. R. Lewis and L. Danenberg, Data Structures & Their Algorithms. New York Harper Collins,,l991.

[3] J . S . Vitter and W.-C. Chen, Design and Analysis of Coalesced Hushing. New York: Oxford Univ. Press, 1987.

[4] G. D. Knott, “Hashing functions,” Comput. J., vol. 18, no. 3, pp. 265-278, Aug. 1975.

[5] G. H. Gonnet and P-A. Larson, “External hashing with limited intemal storage,” J. ACM, vol. 35, no. 1, pp. 161-184, Jan. 1988.

161 J. P. Hayes, Computer Architecture and Organization 2nd ed. New York McGraw-Hill, 1988.

[7] V. N. Koshelev, “Computational aspects of the triangle axiom,” Cybern. Comput. Technique, no. 4. Moscow: Nauka, 1988, pp. 204-216 (in Russian).

[8] R. G. Gallager, Informution Theory and Reliable Communication. New York Wiley, 1968.

[9] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York Wiley, 1991.

[lo] C. C. Chang, C. Y. Chen, and J. K. Yan, “On the design of a machine- idependent perfect hashing scheme,” Comput. J . , vol. 34, no. 5, pp. 469-474. Oct. 1991.

Approaching Capacity of a Continuous Channel by Discrete Input Distributions

Heinrich Schwarte, Member, IEEE

Abstract-In this paper memoryless channels with general alphabets and input constraint are considered. Sufficient conditions are given that channel capacity can be approached by discrete input distributions or by uniform input distributions with finite support. As an example, the Additive White Gaussian Noise channel is considered.

Zndex Terms- Continuous channel, channel capacity, input distribu- tions

I. INTRODUCTION Consider the Additive White Gaussian Noise (AWGN) channel

with signal-to-noise ratio I?. Given an input symbol x E R, the channel output Y, is defined by Y, = x + N where N is a Gaussian random variable with zero mean and unit variance. As is well known, the capacity C = sup{I(X, Yx): X random variable independent of N, EX2 5 r} of this channel equals C = I(G, YG) = (1/2) log (I+ I‘). Here G denotes a Gaussian random variable independent of N with zero mean and variance r. Furthermore, I(X,Yx) < C if X is not Gaussian and EX2 5 r. Recently, Sun and van Tilborg [I] constructed a sequence ( X n j n E ~ of random variables with finite support and uniform distribution on the support set such that EX: 5 I? and I(X,,Yx,) + C for this channel. Whereas one expects to be able to approach channel capacity by distributions with finite support (this holds by the very definition for the capacity of channels with infinite input alphabet given by Gallager [2, p. 3241 and Augustiu [3]), the authors’ observation may seem surprising.

In this correspondence, we extend Sun and van Tilborg’s result to a much broader class of channels. The setup is as follows. We consider memoryless channels with general alphabets in separable metric spaces possibly with input constraint. The channels are assumed continuous in the sense that output distributions corresponding to

Manuscript received July 14,1994; revised September 5, 1995. The material in this paper was presented in part at the International Symposium on Information Theory and its Applications, Sydney, Australia, Nov. 20-24, 1994.

The author was with the Institute for Experimental Mathematics, University of Essen. He is now with the Department of Mathematics, University of Essen, 45117 Essen, Germany.

Publisher Item Identifier S 0018-9448(96)01028-0.

0018--9448/96$05.00 0 1996 IEEE