g tree.pdf

7/28/2019 G Tree.pdf

1/7

IEEE TRANSACTIONS ON KNOWLEDGEAN D DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994 341

paper we mad e an attempt to integrate them. First, we have illustratedthe need fo r integration, and then outlined a procedure f or integration.Future work should concentrate on optimizing the procedure.

G - n e e : A New Data Structure forOrganizing Multidimensional DataAkhil Kumar

ACKNOWLEDGMENT

The author thanks the anonymous referees for their careful readingof the manuscript and for their comments.

REFERENCESA. El Abbadi and S . Toueg, Maintaining availability in partitionedreplicated databases, ACM Trans. Database Syst. , vol. 14, no. 2, pp.264290, June 1989.D. Agrawal and A. El Abbadi, Integrating security with fault-tolerantdistributed databases,Comput.J., vol. 33, no. 1, pp. 71-78, Feb. 1990.-, Storage efficient replicated databases, IEEE Trans. KnowledgeData Eng., vol. 2, no. 3, pp. 342-352, Sept. 1990.S. Y. heung, M. Ahamad, and M. H. Ammar, Optimizing vote andquorum assignments for reading and writing replicated data, IEEETrans. Knowledge Data Eng., vol. 1, no. 3, pp. 387-397, Sept. 1989,D. E. Denning, Cryptography and Data Security. Reading, MA:Addison-Wesley, 1982.H. Garcia-Molinaand D. Barbara, How to assign votes in a distributedsystem,J. ACM, vol. 32, no. 4, pp. 841-860, Oct. 1985.D. K. Gifford, Weighted voting for replicated data, in Proc. 7 th Symp.Ope r. Systems Principles, Dec. 1979, pp. 150-159.S . Jajodia and D. Mutchler, Dynamic voting algorithms for m aintainingthe consistency of a replicated database, ACM Trans. Database Syst.,vol. 15, no. 2, pp. 230-280, June 1990.T. F. Lunt, D. E. Denning, R. R. Schell, M. Heckman, and W . R.Shockley, The SeaView security model, IEEE Trans. Sofiare Eng.,vol. 16, pp. 593-607, June 1990.J.-F. PaTis, Voting with witnesses: A consistency scheme for replicatedfiles, in Proc. 61h Conf. Dist. Comput. Syst., June 1986, pp. 606-612.J.-F. P a s and D. E. Long, Efficient dynamic voting algorithms, inProc. 4th IEEE Intl. Con$ on Data Engineering, Feb. 1988,pp. 268-215.M. 0. Rabin, Efficient dispersal of information for security, loadbalancing, and fault tolerance, J. ACM, vol. 36, no. 2, pp. 335-348,Apr. 1989.B. Randell and J. Dobson, Reliability and security issues in distributedcomputing systems, in Proc. 5th Sym p. Reliability in Distributed Soft-ware and Database Syst., 1986, pp. 113-118.R. H. Thomas, A majority consensu s approach to concurrency controlfor multiple copy databases, ACM Trans. Databa se Syst., vol. 4, no. 2,pp. 180-209, June 1979.G.T. J. Wuu and A. J. B emstein, Efficient solutions to the replicated logand dictionary problems, in Proc. 3rd Symp. Principles of DistributedComputing, 1984, pp. 233-242.

Abshulcf-This paper describes an efficient data structure called theG-tree (or grid tree) for organizing multidimensional data. The datastructure combines the features of grids and B-trees in a novel manner.It also exploits an ordering property that numbers the partitions insuch a way that partitions that are spatially close to one another in amultidimensionalspace are also close in terms of their partition numbers.This structure adapts well to dynamic data spaces with a high frequencyof insertions and deletions, and to nonuniform distributions of data.We demonstrate that it is possible to perform insertion, retrieval, anddeletion operations, and to run various range queries efficiently using thisstructure. A comparision with the BD tree, zkdb tree and the KDB treeis carried out, and the advantages of the G-tree over the other structuresare discussed. The simulated bucket utilization rates for the G-tree arealso reported.BD tree, grid files, G-tree, bucket utilization, range queries.Index Te nn s4 at a structure, multidimensionaldata, B-tree, KDB tree,

I. INTRODUCTIONB-trees are an efficient data structure for indexing one-dimen sionaldata. However, they are not suitable for indexing multidimensional

data. A query such as, Find all employees who are between 30an d 35 years old and earn a salary between lOOK and 120K, is anexample of a two-dimensional range query. One way of answeringsuch a query is by maintaining one-dimensional indexes on age andsalary, and using one or both indexes to narrow down the set ofrelevant employee records. Another way is to maintain a compositeindex on age and salary.

None of these solutions is, however, efficient, and therefore spe-cialized structures are required to handle multidimensional queries.One approach fo r organizing multidimensional data consists of havingmultidimensional binary trees [ I ] (see also [14] for an excellentcoverage of various data structures described in this paper). How-ever, this is a main memory resident structure, not suitable forstoring on disk. Two useful disk-based data structures for organizingmultidimensional data are KDB trees and grids.KDB rees [12] successively divide the entire k-dimensional dataspace into smaller k-dimensional regions or hyper-rectangles. Theregions are organized into a tree similar to a B-tree. The nonleafpages of the B-tree are called region pages, while the leaf pagesare called po i n t page s because the entries in the leaf level of thistree point to the data tuples. Each point page corresponds to one diskpage or a bucket. The entries in the nonleaf pages of the tree consistof nonintersecting k-dimensional regions, and pointers to lower levelpages that contain subregions within that region.Grid f i l e s [ 8 ] , [7] artition a data space into a grid structure bysplitting each dimension into several nonuniformly spaced intervals.Such a grid may be viewed as an n-dimensional array where ablock or element in the grid is defined by identifying its indexalong each dimension. Each grid element points to a disk bucketor disk page. More than one grid element can point to the same disk

Manuscript received November 9, 1990; revised May 26, 1992. This workThe author is with the G radu ate School of M anagement, Come11 University,IEEE Log Number 9212684.

was supported in part by the NSF under Grant RI-9110880.Ithaca, NY 14853.

1041-4347/94$04.00 0 1994 IEEE-r - -_ - -


2/7

I

34 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 2, APRIL 1994

bucket if the total number of points contained in them together isless than the capacity of the disk page. A directory is maintained inorder to determine the physical location of each grid element. Theproblem with this structure is that directory maintenance can be veryexpensive, and if the data is nonuniform, the size of the directoryca n be very large.

Another technique for organizing multidimensional data is multiat-tribute hashing [13].The basic idea is t o concatenate several attributevalues in the construction of a hash key, and to store the data pointsin a hash table. This technique works well for exact match queriesbut is not suitable for range queries. Other hashing-based methods,such as those presented in [2] and [111 , have similar shortcomings.This paper describes a new data structure called G -t r e e thatcombines features of both grids and B-trees in a novel manner. Itis called a G-tree (or grid tree) because the multidimensional spaceis divided into a grid of variable size partitions, and the partitionsare organized into a B-tree. The data structure that comes closest toour proposed structure is the BD tree [4], [5]. (The BANG file [6]is a variant of the BD tree.). Both methods employ a variable-lengthpartition numbering scheme. The partitions in the G-tree correspondto disk pages or buckets, and points are assigned to a partition untilit is full. A full partition is split into two equal subpartitions, andthe points in the original partition are distributed among the newpartitions. Since the partitions are completely ordered, they can bestored in a B-tree.On the other hand, the B D tree proposal organizes the partitions in

a binary search tree such that an intemal node of this tree acts like adecision node. It contains a partition number and the two branches,the in an d out branches, emerging from the node represent altemativesearch paths, depending upon whether a point belongs to that partitionor not. The leaf-level nodes point to disk buckets that contain thedata. Therefore, while the G-tree maps a partition directly into abucket, the BD tree allows buckets to be defined by including someand excluding other subpartitions from a partition. A more detailedcomparison between the two schemes is given in Section IV. Thezkdb tree [9], lo ] also uses a similar numbering scheme, called z-order , to organize multidimensional data into a B-tree. However, thez-order fields are of fixed length.This paper is organized as follows. Section I1 explains our datastructure and gives detailed algorithms for performing insertions anddeletions. Section I11 discusses our query processing algorithm. Then,Section IV tums to perform a comparison with the B D tree, zkdb tree,and KDB tree. Section V presents results of simulation experimentsperformed to study utilization figures for this structure, and SectionVI concludes the paper.

II. G-TREESection 11-A describes the details of our data structure, whileSections 11-B and 11-C give algorithms for perform ing insertions and

deletions. Finally, Section 11-D gives an example that illustrates thealgorithms.A . Data S truc ture

We assume that each data point (or tuple) lies in an n-dimensionalspace, and the dimensions are numbered 1 , 2 , . . . , n It is alsoassumed that the range within which the points lie along dimensioni is [ I I , h , ] .

A multidimensional data space is divided into several partitions,each containing not more than a maximum number of entries, denotedby the parameter max-entries. Each partition corresponds to onedisk page or bucket and, upon becoming full, is split into two. Themain features of our scheme are:

010 011+ I : ,( c )

Fig. 1. Partitioning scheme.

00I I I

The data space is divided into nonoverlapping partitions ofvariable size.Each partition is assigned a unique partition number.A total ordering is defined on the partition numbers, and theyare stored in a B-tree.Empty partitions are not stored in the tree to save space.

We first describe our partition-numbering scheme and partitionarithmetic, and then tum to the other details of our structure. Thisnumbering scheme has also been discussed in [4] an d [5].

Part i t ion Num bering and Spl i t ting: A partition is numbered as abinary string of 0s an d 1s. Successive parts of Fig. 1 show howpartitions are split and new partitions created. In this example,there are only two dimensions, 1 an d 2 (or dimensions X andY , respectively). Initially, the entire data space is divided into twopartitions by splitting its range along dimension 1 (or X axis) intotwo equal subpartitions, numbered 0 and 1 (see Fig. l(a)). As morepoints are added to a partition and it becomes full, it is subdividedto create two new partitions of equal size. (In Fig. 1 , we assume thatm a x - e n t r i e s i s 2, i.e., each partition can accommodate only twoentries.)

When partition 0 of Fig. l(a) is full, its range along the Y-axis issplit to create two new partitions, 00 and 01, of equal size (see Fig.l(b)). In this case, the Y-axis is called the splitting dimension. Now,if more points are added, and partition 01 also becomes full, then itsrange along the X-a xis is split to create two new partitions 010 and011 (see Fig. l(c)). Finally, when partition 011 becomes full, it issplit into partitions 0110 and 01 1 1 (see Fig. l(d)).l We shall refer tothe parti tion that is spli t as the pa re nt pa rt i t i on and the two newpartitions resulting from the split as its child partitions.

With only two dimensions, the splitting dimension altemates; ingeneral, with n dimensions, the splitting dimension recycles with aperiodicity of n such that each dimension appears once in a cycle.

Part i t ion Number Ari thmet ic : The nonempty partitions that spanthe data space are maintained in a B-tree-like structure, called the G-tree. A leaf-level page in the G-tree points to a disk page containingall the points that lie in a partition, while higher level pages pointto pages at the next lower level. In this subsection, we first definevarious operations on partitions (such as , etc.), and then showthat the partition numbers in a G-tree are totally ordered.

Assume two partitions P 1 and P 2 - b l and b2 bits long, re-spectively-and also assum e that b = min(b1, b 2 ) . Our defini-tions below are based on comparing the b most significant bits of

I It should be evident that leading zeros in the partition number string aresignificant. Clearly, partition 1 is different from partition 01. Although thepartition numbers should strictly be viewed as strings, the quotation marksare omitted for brevity.

1


3/7

IEEE TRANSACIlONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.6,NO. 2, APRIL 1994 343

P 1 ,MSB(P1,b), and the b most significant bit s of P2 , MSB (P2 , b) ,as binary numbers.

Definition I : P1 > P 2 if M S B ( P 1 , b ) > M S B ( P 2 , b ) .Definition 2: P1 < P 2 if M S B ( P 1 , b) < M S B ( P 2 , b ).Definition 3: P1 C P 2 i f M S B ( P 1 , b) = MS B(P 2, b) a nd b l< b2.Definition 4 : P1 P 2 , if P1 C P 2 or P1 = P 2 . For example,

consider that P1 = 01101 and P 2 = 011 1; here, since b l = 5 an db2 = 4, b = min(5,4) = 4.Since the first 4 its of P1 are smallerthan those of P 2 , we say that P1 < P 2 o r P 2 > P1. For anotherexample, consider that P1 = 0110 and P 2 = 011011. Here, P1is a subpartition of P 2 ( i.e ., P1 C P2). Similarly, we define thesuperpartition (3)elation as follows:

Definition 5: P1 3 P 2 , if P 2 C P1.Definition 6: P1 1 2 , if P1 I) P 2 or P1 = P2 . Two otheroperations on partitions are defined as follows:Definition 7: The parent partition of P, parent( P), is determined byremoving the last (or the least significant) bit from P. For example,parent(01110) = 0111.Definition 8: The complement of partition P , compl( P), is deter-mined by inverting its last (or the least significant) bit. For example,compl(01110) = 01111.Le m m a I : he partitions in a G-tree do not overlap and are totally

ordered by the > relation.Proofi Initially, the G-tree contains two nonoverlapping parti-tions, and any subsequent partition split results in a larger partitionbeing replaced by two nonoverlapping smaller partitions. On theother hand, the two smaller partitions can also be merged (afterdeletions occur) and replaced by the larger partition. Nevertheless,no two partitions in the G-tree overlap. Consequently, there doesnot exist any pair of partitions P1 a nd P 2 suc h tha t P1 2 P 2 orP 2 C _ P1.Hence, either P1 > P2 or P1 < P2 . Moreover , s ince notwo partitions overlap, it follows that each point lies in exactly o neand only one partition. 0B . Insertion Algorithm

This section describes our algorithm for inserting multidimensionalpoints into the G-tree. The m ain algorithm, along with two functionsit calls, assign and search, is listed below, and a detailed descriptionfollows.

Algorithm Insert(z1, 1 2 , - . , zn) /*coordinates of point tobe inserted */

{P,, = assign(z1, Z Z , . e, I,, 6) ;Pact= search(P,,);flag = 0;while (flag #1)if (nz lm-po ints(P, , t )< mas-entries)

{insert ( a , PZ , ., xn ) into partition Pac t;flag = 1;1

/*insert pointinto partition Pact/

else( /*split a partition*/P O = concatenate(P,,t, 0);P1 = concatenate(Pa,.., I ) ;delete Pact rom G-tree;reallocate all points in Part o P O a nd P1;if (num qoint s (P 0) > 0)if (num-points(P1) > 0)

/*PO and P1 are childpartitions of Pact*/

/*if PO is a nonempty partition*//*if P1 is a nonemptypartition*/

insert P O into G-tree;insert P1 into G-tree;

Pin = assign (a,2 , e . , xn , s ize(P0)) ;Pact = search(P;,);11

Function assign(s1, 52 , - e, z,, b) *computes an initial partitionIp= ).,fo r (k = 1; k S b ; k + +) /* repeat b times */

Ii = k mo d n;if (z, < (1 , + h, 1/21

number *I, /* null string */

(concatenate 0 o P ;h, = (1 , + h, )/2; 1else 1concatenate 1 t o P ;1, = (I , + h, )E ;11r e tum (P)

1

function search(P,,) /* search G-tree and find exactpartition to which point belongs, givenan initial partition P,,*/

(P = smallest partition number in G-tree;while ( P < Pc,)

P = P n e i t

i f (Pz n E P I )else {

/* search for P,, in G-tree */P = P ; /* partition to which point belongs exists *//* partition to which point belongs does not exist */Pact= P I ;

Pact = Rn;while (P,,,, < p a r e n t ( P a c t ) < Pn ez t) *find largest missing/*find the new partition to insert */partition*/Pac t = parent(Pact);insert Pactnto G-tree;

1retum(Pa,t );IThe first step in inserting a point is to compute its partition number.

Since it is not possible to exactly determine the partition to which apoint belongs, an approximate, initial partition number is computedby assuming that the point will be inserted into a partition withdimensions equal to those of the smallest partition created so far.The smallest partition is the one with the most number of bits inits number. The function assign is called with the coordinates ofthe point (XI, 2 , ..., 2, and the number of bits b in the desiredpartition number as parameters (assuming that the number of bits inthe smallest partition is b). It retums an in itial partition num ber P,,,which is b bits long and contains the point.The next step is to use the initial partition number P,, as a handleto search the G-tree and retum Pact, the actual partition to whichthe new point belongs. This is performed by function search. Th eG-tree is searched for a partition P such that En P . If sucha P is found, the point belongs to this partition, and Pact s setequal to P.As already shown in Lemma 1, such a P, f one exists,will be unique. On the other hand, if partition P is not found, the


4/7

search terminates as soon as either the last partition in the G-tree isencountered, or the first partition number higher than P,, is located. 1In both cases, this means that a new partition must be inserted. Thisnew partition is P,, itself or its largest ancestor that does not overlapwith an existing partition in the G-tree. In order to determine thelargest ancestor, successive ancestors of P,, must be compared withP,,,, an d P n e Z t the next lower and the nex t higher entries than P,,in the G-tree, respectively). This is performed within the while loopin function search.Once the correct partition number is found, a check is performed tosee if this partition can accommodate one more entry. If so, the newentry is added to this partition; otherwise, the partition mu st be split.This means that two new child partitions of Pact must be createdby appending a 0 and a 1 to it. The two new partitions created bysplitting Pact are denoted PO an d PI . Partition Pact s deleted fromthe G-tree, and all the points that belong to it are reallocated to P Oand P1. The two new partitions, if nonempty', are inserted into theG-tree. On the othe r hand, if a partition is empty, then our algorithmdoes not require that it should be inserted into the G-tree. This helpsin reducing the size of the tree.

Unless all the points go into only one of P O or P1, the split willresult in the creation of space for an additional point in both PO an dP1. In this case, function assign is called again to determine thepartition to which the point belongs, and the point is inserted into thep a r t i t i ~ n . ~n the other hand, if all the points upon splitting go intoonly one of PO o r P1, then clearly that partition would still remainfull. This means that if the newly inserted point also belongs to thispartition, which is already full, then another split is necessary. Thesplitting must be repeated until the total number of points in Pactare distributed across more than one partition, and space is made forthe new point.C . Dele t ion Algori thm

The detetion algorithm is listed below, and then a descriptionfollows.

Algorithm Delete(z1, 2 2 , ..., 2,){P,, = assign(i1, ZZ, ..., z,, b);P = smallest partition in G-tree; while (Pz, P)P = Pnezt

1Pact = P;delete point (21, Z , .., 2,) from Pact;while (numqoints(Pa , t) + num_points(compl(P,,t)) 5

/*search for P z , in G-tree *//*partition containing the point exists in the tree*/f (pzn P)

max-entries )(reassign points in Pact nd co mpl(P act) to parent(Pact);delete Pac t and compl(Pa,t) from G-tree;insert parent(P ah,t) into G-tree;Pact = gment(Pact) ;I

Ielse /*partition containing the point does not exist */print("error - point does not exist");I

'If all the points of P are reassigned to only one partition, say PO, then3Th e function size, which appears as an argument in the call to assign,

P1 emains empty, and in that case it is not inserted into the tree.returns the number of bits in P O .

0

32

0.0000001-0000012 - wool03-oooO111 4 - 0 0 0 1I a - 001012- 001100I 13-00110114-00111015- 00111116- 010020-01010021-01010172-01011023-01011124-01IO28-0111030- 111132-1048- 1

Fig. 2 . An example of partitioning for several data points.In deleting a point, the first step is to determine the partition in

which it lies. As in the case of insertions, an approximate, initialpartition Pin is computed by function assign. Next, a search iscarried out in the G-tree for an actual partition Pact such that Pact2 PZ,.f such a Pact is found, the given point is searched in Pactand deleted from it. On the other hand, if no Pact such that Pact 2Pt, is found, it means that the given point does not exist.Once partition Pact is located, the next step is to check if Pactan dcom pl(P act) (if one exists) can be merged into a single partition. Ifthe total number of points in partitions Pact and compl(PaC t)ogetheris less than or equal to max-entries4, then it is possible to mergethe two partitions and replace them by their parent partition. In thiscase, all the points belonging to Pact and compl(PaCr) re assignedto parent(P,,t), partitions Pact and comp l(PaCt) re deleted from theG-tree and paren t(Pact) is inserted into it.D. n Example

This section gives an example to illustrate our data structure, andthe insertion and deletion algorithms. Fig. 2 shows the partitioningof a two-dimensional space resulting from the addition and deletionof several points. Each dot represents a data point, and it is assumedthat the maximum number of points in a partition is two. Since thedata distribution is nonuniform, the partition sizes vary considerably.The legend alongside Fig. 2 shows the equivalent binary represen-tations of the partition numbers appearing on the grid itself in decimalform. The conversion of partition numbers from binary to decimalneeds some explanation. Notice that the smallest partition (in termsof area) in this figure is 6 bits long; however, some larger partitionnumbers are fewer than 6 bits long. These ones are padded withtrailing zeros before making the conversion to decimal form. Thus, a4-bit partition nu mber su ch as OOO1 is padded w ith two trailing zerosand translated into decimal 4. Similarly, a 2-bit partition num ber suchas 10 is padded with 4 railing zeros and translated into decimal 3 2. Itis important to no te that this conversion to decimal form is done onlyto facilitate understanding; however, internally the partition num bersare stored as binary strings. The p artition numbers are organized intoa G-tree as shown in Fig. 3, and the entries in the leaf-level pagesof this tree correspond to disk buckets that contain the data tuples.Further, note that partitions 24 and 32 are missing from the G-treebecause they are empty partitions.

41f the complement of Pact does not exist, the number of points in it isobviously 0.


5/7

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO . 2. APRIL 1994 345

Fig. 3. G-tree for the partitions of Fig. 2.

To illustrate our insertion algorithm, we assume that the range ofpoints along both X and Y dimensions is [0,1] and consider howa point with coordinates (0.9, 0.1) is inserted. Since the smallestpartition in the G-tree is 6 bits long, we first compute the 6-bitpartition number, 101010 (decimal 42), of the partition to which thepoint belongs. Next, a search is carried out in the G-tree to find101010,or an ancestor of it. This search fails, and terminates whenthe next higher partition number-1 1 4 s encountered. This meansthat partition 101 010, or an ancestor of it must be inserted. The largestancestor of 101010 that does not overlap with any existing partitionin the tree is IO (decimal 32), and therefore partition 10 would beinserted into the tree. The new point belongs to this partition.For ano ther case of insertion, consider the addition of a point w ithcoordinates (0.1, 0.6). This point belongs to partition 0100 (decimal16). However, since partition 0100 is already full, it must be splitinto partitions 01OOO and 01001. The existing two points must bereassigned, one to each new partition, and the new point assignedto 01OOo.

To understand the deletion algorithm, consider that the point withcoordinates (0.4, 0.9) lying in partition 01111 (decimal 30) is tobe deleted. The coordinates of this point are first converted into a6-bit partition number, 011111 (decimal 31). Then a search for thispartition is conducted in the G-tree by function search, and as a resultits parent partition 01 111 is found. Next, the point is deleted fromthis partition. At this stage, partition 01111 can be merged with itscomplementary partition 01110 to form partition 0111, which wouldcontain two points. Moreover, partition 0111 and its complement0110 (an empty partition) can also bemerged to form 011. Therefore,as a result of deleting the point (0.4,0.9), partitions 01111 and 0111 0would be removed from the G-tree, replaced by partition 01 1, andthe points contained in them reassigned to partition 011.

111. PROCESSING RANGEQUERIESThis section describes how range queries and partial match queriesare processed using th e G-tree. Partial match queries are treated as a

special case of general range queries of the form: qi I X I 5 q: an dwhere[ q : , q p ] : range of query along dimension i

Our basic strategy consists of first identifying the smallest andlargest partition numbers that could overlap with the query region.All partitions that lie within this range potentially contain pointsthat overlap the query region. Next, the G-tree is searched and thepartitions lying in this range are tested to determine whether they arefully contained in the query region, ov erlap the query region (but arenot fully contained), or are outside the query region. If a partition isfully contained, then all points in it satisfy the query. On the otherhand, if it overlaps, then each point in it must be examined andchecked individually. Finally, a partition that lies outside the queryregion does not have to be considered any further. The algorithm islisted below, followed by a description.

q i 5 .r 2 5 and .. . an d 4; I X, 5 q;

1- - -_ -

Algorithm range-query(PI, = assign(qi, q ;, . ., q k , b) ;ph z = assign(qlh, qi?, . . , d , );P = smallest partition in G-tree;while (P < Pl0)

P=Pn zwhile (8, 5 P' 5 P h z ){if (overlap(P)== 1)

/* search for Pio n G-tree *//* P is contained in query region */

output(al1 points in P);elseif (overlap(P) == 2) /* P overlaps query region */examine each point in P and output it if it satisfies

query;P = P*d /* next partition in G-tree */;11function overlap (P){overlap = 0;b = size( P); /*i.e., number of bits in P */fo r ( i = i ; i 5 [ ; l ; i + +)fo r 0' = 1 ; j 5 n ; j+ +)

{p = n x ( i - 1)+j;if ((p 5 b) and (bit in position p == 0))

z+z2;' =

j = 1;while 0' 5 n and xy 1 , an d 1:' 5 h 3 )if (j== n + 1)j + +;

overlap=l; /*partition P is contained in query region */else{j = 1;while 0' 5 n and x 5 h, and I : ~ _ I,)

j + +;if ( j == n + 1)overlap = 2;

I/*partition P overlaps the query region */

return(over1ap);1First, the function assign is called to transform the leftmost and

rightmost points of the query region into b-bit long partition numbers,8, and Pht, respectively. Next, the G-tree is searched for partitionPl O .All partitions in the range Pio through Ph z are checked, byinvoking function overlap, to determine if they lie who lly within thequery region o r overlap it. Function overlap t ra nsf ork the par t it ionnumber into the coordinates of its leftmost and rightmost points(x:, xt', respectively) and then conducts two sets of tests. The firsttest checks if both the leftmost and rightmost points lie within thequery region for a22 dimensions; if so, the partition is fully containedwithin the region and a value of 1 is retumed. If this test fails, thenthe second test checks for an overlap between the partition arrd thequery region, and returns a value of 2, if successful.We illustrate the above algorithm with an example that relates toFig. 2. Say the query region is defined by: 0.3 < x < 0.4, and 0.3


6/7

y < 0.6 (shown by the dotted rectangle in Fig. 2). The coordinatesof the leftmost point in the query region are (0.3, 0.3), while therightmost point has coordinates (0.4, 0.6). These two points lie inthe 6-bit partitions 001100 (decimal 12) an d 011010 (decimal 26),respectively. Next, the G-tree is searched for all partitions in thisrange (i.e., decimal 12 through 26), and each partition is tested forwhether it overlaps or is wholly contained in the query region. Thepartitions that overlap the query region are denoted by decimal 12,13, 14, and 15, while the other partitions are disjoint from the queryregion and can be ignored. Hence, partitions 12, 13, 14, and 15 mustbe examined for all points that satisfy the query.

IV . COMPARISONITH ELATED TRUCTURESIn this section, the G-tree is compared with the BD tree [4], [5],

zkdb tree [9], [lo] and KDB tree [12], and the important differencesbetween them are highlighted.A. BD Tree

The G-tree comes closest to the BD tree because it uses thesame partition numbering scheme. The main difference lies in howpartitions are mapped into buckets, and how they are indexed. In theG-tree, a partition maps directly into a bucket, while in the BD tree,buckets can exclude some subpartitions from a given partition, andhence they can be defined in a more complex manner. Therefore, theBD tree guarantees a utilization of at least 50%, while the G-treemakes no such guarantees and its absolute worst-case performanceca n be very p r . The advantage of the G-tree over the BD tree liesin its ability to better index the partitions for efficient access. TheG-tree index is similar to a B-tree, which is a balanced structure. Onthe other hand, the BD tree index consists of a binary search treethat can get highly unbalanced unless special methods are applied.Moreov er, a binary search tree is a main memory structure, while theB-tree is more suitable for storage on disk.B . zkdb Tree

In the zkdb tree, each data point is transformed into afued -lengthz-order (similar to a partition number), and the z-orders are organizedin a B-tree. The shortcoming of this structure lies in the fixed length ofthe z-order which means that the correct length of each z-order valuemust be predetermined. Since two different points must not have thesame z-order, the exact length of this field would depend upon thetotal number of points in a data space and also their distributionpattem. This can be a problem in a dynamic structure. The G-treedoes not suffer from this drawback because it permits variable-lengthpartition numbers and each partition number is only as long as isnecessary.

The long z-order values also affect the efficiency of the query pro-cessing algorithm presented in [9], [101. The algorithm decomposeseach query region in to small boxes, each defined by a z-order as manybits long as the z-order for a data point. An ordered list is formedconsisting of all the boxes in the query region and is merged w ith thedata points stored in the tree. The performance of this algorithm isO(Q+D), where D is the number of data points in the query regionan d Q is the number of boxes in it. In our algorithm, partitionsthat do not overlap with the query region can be eliminated fromconsideration as explained in the previous section. Moreover, if apartition is wholly contained within the query region then the pointsin it do not have to be checked individually.C. KDB Tree

In the KDB tree, an entry in a higher level page defines a largerregion and points to a page containing smaller, nonoverlapping

regions that lie within the larger region. On the other hand, the G-tree is a single dimension structure like the B-tree. We discuss nextissues related to fanout, utilization, and ease of performing variousoperations in the context of these two structures.

First, we shall compare the fanout' of a region page (i.e., a nonleafpage) in a KDB tree with the fanout of a nonleaf page in the G-tree.An entry in a nonleaf page of the G-tree is a (partition #, page #)pair, 8 bytes long (assuming 4 bytes per field). On the other hand,a region page entry in the KDB tree is a (region id., page #) pair.Since 2 n values must be stored to define an n-dimensional hyper-rectangular region, each KDB entry would occupy 20 bytes in thetwo-dimensional case. Therefore, the fanout for a G-tree page is2.5 times as large as the fanout for a KDB tree page in the two-dimensional case. This relative advantage increases even further forhigher dimensional data.

The two structures also differ in the manner that page splits arehandled. In the G-tree, when a partition becomes full, it is split intotwo equal subpartitions. In the KDB tree when a point page (i.e., atthe leaf level of the KDB tree) is full, it is split into two (usually)unequal point pages because the splitting is done so as to assignan equal number of points to each new page. This is good from autilization point of view, but it creates problems when region pages(at the next level above the point pages) become full and are splitinto two new region pages. In this case, there are often point pagesthat span two region pages, and they must also be split in order toensure that a point page lies entirely within one region page.

The way in which point pages are split also affects the algorithmfor handling deletions in the KD B tree. When p oints are deleted fromthe KDB tree, point pages can be merged together. In order to ensurethat the new point page is rectangular, often a reorganization must beperformed in a somewhat arbitrary manner. This means that handlingdeletions is hard, and moreov er, if the structure is very dynam ic, thenthe utilization can deteriorate over time. On th e other hand, the G -treehas the nice property that the partitioning scheme is identical for agiven set of points regardless of the order of insertions and deletions.

V. UTILIZATIONThe G-tree structure was simulated and tested. The purpose of thesetests was to measure th e average utilization of partitions (or buckets)in the G-tree and compare it with the utilization of buckets in otherstructures. The maximu m num ber of points in a disk bucket is denoted

by max-en t r i e s , and utilization =We ran several experiments in which a large number of pointswere inserted into the G-tree. The points were pseudo-randomlygenerated and drawn from two-dimensional uniform, exponential,and normal distributions. For each distribution, 50000 points wereinserted into the G-tree, and the experiments were repeated severaltimes with various seed values. Th e average values are reported here.The parameter max-en t r i e s was also varied to see how it affectsthe utilization. The results are given in Table I.

Table I shows the percentage utilization as a function of thenumber of points inserted for different values of max-entries. Th eaverage utilization is around 69% and depends only very slightlyupon the nature of the distribution from which the data is drawn.These utilization values are very close to those for both BD trees[5] and grid files [ 8 ] , and better than those for KDB trees since theresults reported in [12] show that the utilization of KDB trees liesin the range of 60 f 10%.

ma l - - en t r zcsf en t r * r s z n

'The fanout is defined as the number of entries in a page of the index.


7/7

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO . 2, APRIL 1994 347

TABLE IOF m a r - e n t r i e s AND DATADISTRIBUTIONSG-TREEUCKETTILIZATIONFOR VARIOUS VALUES

max-entries uniform exponential normal25 69.38 69.40 68.8050 69.78 69.21 68.61100 70.57 69.43 68.17

VI. CONCLUSIONA.new data structure called the G-tree was proposed for index-

ing multidimensional data. It is based on organizing variable-sizedgrids or partitions into a B-tree. Algorithms for searching, insertingand deleting points from this structure were given and illustratedwith examples. An algorithm for processing multidimensional rangequeries was also given. The G-tree was compared with other relatedstructures, such as the BD tree, zkdb tree, and the KDB tree, andthe strengths and weaknesses of each approach were discussed.Finally, simulation experiments were run to determine the utilizationof buckets in the G-tree. The average utilization in these experimentsfor various data distributions was approximately 69%. This is betterthan the utilization for KDB trees and compares favorably withthe utilization for BD trees and grid files. The results show thateven a relatively simple partitioning scheme like the one employedperforms well and produces average bucket utilization similar to orbetter than the utilization achieved from more complex partitioningschemes.

1---- -

REFERENCESJ.L. Bentley, Multi-d imensiona l binary search trees used for asso-ciative searching, Commun. ACM , vol. 18 , no. 9, pp. 509-517, Sept.1975.W. Burkhard, Interpolation-b ased index maintenan ce, BIT, vol. 23,no. 3, pp. 274-294, 1983.D. Comer, The ubiquitous B-tree, ACM C omp ut. Surveys, vol. 11 ,no. 2, pp. 121-137, June 1979.S. Dandamundi and P. Sorenson, An empirical performance compar-ison of some variations of the k-d tree and BD tree, Inr. J . Comput.Inform. Sci., vol. 14, no. 3, pp. 135-159, June 1985.S . Dandamundi and P. Sorenson, Algorithms for BD-trees, SoftwarePractice and Experience, vol. 16, no. 12, pp. 1077-1096, Dec. 1986.M. Freeston, The bang file: A new kind of grid file, in Proc . ACMSIGMOD Co nf., San Francisco, May 1987 , pp. 260-269.K. Hinrichs, Imp lementation of the grid file: Design concep ts andexperience, BIT, vol. 25, no. 4, K., pp. 569-592, 1985.J. Nievergelt, H. Hinterberger, and K. Sevcik, The grid file: Anadaptable, symmetric multikey file structure, ACM Trans. DatabaseSysr., vol. 9, no. 1, pp. 38-71, Mar. 1984 .J. Orenstein and T. Merrett, A class of data structures for associativesearching, in Proc. 3rd ACM SIGACT-SIGMOD Symp. Principles ofDatabase Systems, Waterloo, Apr. 1984, pp. 181-190.Orenstein, J., Spatial query processing in an object-oriented databasesystem, in Proc. ACM SIGMOD Conf., Washington, DC, May 1 986,pp. 326-336.M. Ouksel and P. Scheuermann, Storage mappings for multidimen-sional linear dynamic hashing, in P roc. 2nd ACM SIGACT-SIGMODSymp. Principles of Darabase Systems, Atlanta, Mar. 1983, pp. 90-105.J. T. Robinson, The KDB tree: A search structure for large multi-dimensional dynamic indexes, in Proc. ACM SIGMOD Conf., AMArbor, MI, Apr. 1981, pp. 10-18.J. B. Rothnie and T. Lozano, Attribute based file organization in apaged memory environm ent, Commun. ACM , vol. 1 7, no. 2, Feb. 1974.H. Samet, The Design and Analysis of Spatial Data Structures. Read-ing, MA: Addison Wesley, 1991.

g tree.pdf

Documents