fractal dimension and space filling curve › ... › thesis-57782-ana-gomes.pdf · et...

81
Fractal Dimension and Space Filling Curve Approximate Space-filling Curve Ana Karina Tavares da Moura Gomes Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering Supervisor: Prof. Dr. Andreas Wichert Examination Committee Chairperson: Prof. Dr. Mário Jorge Costa Gaspar da Silva Supervisor: Prof. Dr. Andreas Miroslaus Wichert Member of the Committee: Prof. Dr. Pável Pereira Calado June 2015

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Fractal Dimension and Space Filling CurveApproximate Space-filling Curve

Ana Karina Tavares da Moura Gomes

Dissertation submitted to obtain the Master Degree in

Information Systems and Computer Engineering

Supervisor: Prof. Dr. Andreas Wichert

Examination Committee

Chairperson: Prof. Dr. Mário Jorge Costa Gaspar da SilvaSupervisor: Prof. Dr. Andreas Miroslaus WichertMember of the Committee: Prof. Dr. Pável Pereira Calado

June 2015

ii

Dedicated to my beloved son Enzo.

You were my greatest motivation to finish the course and not give up,

regardless of the difficulties. I love you so much Zo!

iii

iv

Acknowledgments

First of all, I would like to thank my mother Ju and my sister Carol for all the support they gave me. I have special

thanks to Carol who so many times listened me without understand anything just to help me think out loud. Thank

you for all the love and care. Love you sis!

I would like to thank my advisor for the patience and for the advices, but mainly for help me growing as a

student and future professional allowing me to think and decide for my own head. Thank you!

I would like to thank my mother-in-law Cristina for taking care of my child allowing me to work on the

afternoons.

I would like to thank my closest friends Ana Silva and Jose Leira for all the support. Thank you for soften

the worst days of work. You are one of the greatest wealth that I gained from this course. Ana, thank you for not

giving up on me and for being an angel! Ze, thank you for being my thesis partner and for listening all my endless

doubts.

Finally, I have a special thanks to my husband Tiago. Thanks for not judge me, for believe in me (even when I

didn’t) and for your endless patience. There are no words to describe how grateful I am for having you in my life.

Thank you for all the love, all the support and comprehension during all this years. Love you!

v

vi

Resumo

Curvas preenchedoras de espaco sao fractais gerados por computador que podem ser usadas para indexar espacos

de baixas dimensoes. Existem estudos anteriores que as usam como metodo de acesso nestes cenarios, contudo sao

muito conservadores usando-as apenas em espacos ate quatro dimensoes. Adicionalmente, os metodos alternativos

tendem a apresentar um desempenho pior do que uma pesquisa linear quando os espacos ultrapassam as dez di-

mensoes. Deste modo, no contexto da minha tese, estudo as curvas preenchedoras de espaco e as suas propriedades

assim como os desafios apresentados pelos dados multidimensionais. Eu proponho o uso destas, especificamente

da curva de Hilbert, como um metodo de acesso para indexar pontos multidimensionais ate trinta dimensoes. Eu

comeco por mapear os pontos para a curva de Hilbert gerando os seus h-values. Em seguida, desenvolvo tres

heuristicas para procurar vizinhos aproximadamente mais proximos de um dado ponto de pesquisa com o objec-

tivo de testar o desempenho da curva. Duas heurısticas usam a curva com metodo de acesso direto e a restante usa

a curva como chave secundaria combinada com uma variante da B-tree. Estas resultam de um processo iterativo

que basicamente consiste no planeamento, concepcao e teste da heurıstica. De acordo com os resultados do teste,

a heurıstica e alterada ou e criada uma nova. Os resultados experimentais com as tres heurısticas provam que a

curva de Hilbert pode ser usada como metodo de acesso e que esta consegue funcionar pelo menos em espacos ate

trinta e seis dimensoes.

Palavras-chave: Fractais, Curva de Hilbert Preenchedora de Espaco, Indexacao de Baixas Dimensoes,

Vizinho Aproximadamente Mais Proximo

vii

viii

Abstract

Space-filling curves are computer generated fractals that can be used to index low-dimensional spaces. There

are previous studies using them as an access method in these scenarios, although they are very conservative only

applying them up to four-dimensional spaces. Additionally, the alternative access methods tend to present worse

performance than a linear search when the spaces surpass the ten dimensions. Therefore, in the context of my

thesis, I study the space-filling curves and their properties as well as challenges presented by multidimensional

data. I propose their use, specifically the Hilbert curve, as an access method for indexing multidimensional points

up to thirty dimensions. I start by mapping the points to the Hilbert curve generating their h-values. Then, I

develop three heuristics to search for approximate nearest neighbors of a given query point with the aim of testing

the performance of the curve. Two of the heuristics use the curve as a direct access method and the other uses

the curve as a secondary key retrieval combined with a B-tree variant. These result from an iterative process that

basically consists of planning, conceiving and testing the heuristic. According to the test results, the heuristic is

adjusted or a new one is created. Experimental results with the three heuristics prove that the Hilbert curve can be

used as an access method, and that it can operate at least in spaces up to thirty-six dimensions.

Keywords: Fractals, Hilbert Space-Filling Curve, Low-Dimensional Indexing, Approximate Nearest

Neighbor

ix

x

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1 Introduction 1

1.1 Hypothesis and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Multidimensional Data 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Multidimensional Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Multidimensional Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Nearest Neighbor Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Multidimensional Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.2 Kd-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Fractals 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 The Hausdorff Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Space-Filling Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Z-Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 Hilbert Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Clustering Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 Fractals: An Access Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xi

3.6.1 Secondary Key Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6.2 Direct Access Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Approximate Space-Filling Curve 35

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Mapping System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Dataset Analysis based on a Linear Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.6 Experiment 1: Hypercube Zoom Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.7 Experiment 2: Enzo Full Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.8 Experiment 3: Enzo Reduced Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Conclusions 55

5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 62

A Appendix 63

A.1 Chapter: Fractals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.2 Chapter: Approximate Space-Filling Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xii

List of Tables

4.1 Datasets Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Neighbor Distance to Query Point per Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 HZO Runtime Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 HZO Approximate Nearest Neighbors Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Enzo FS Runtime Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Enzo FS Approximate Nearest Neighbors Results . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.7 Enzo RS Runtime Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.8 Enzo RS Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.1 Gray Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.2 Enzo RS Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xiii

xiv

List of Figures

2.1 Examples of Multidimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Example of Unit Circles for L1, L2 and L∞ Norms . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 R-tree Planar and Directory Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Kd-tree Planar and Directory Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Ratio Similarity Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Ratio Similarity Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Hausdorff Dimension Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Z-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Z-curve Bit Shuffling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Z-curve Query Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 Hilbert Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.8 Hilbert Curve 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.9 Hilbert Second Order Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.10 Hilbert Curve VS Z-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.11 Hilbert Curve Analysis to Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.12 Analysis of the Entry and Exit Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.13 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.14 Lawder and King Hilbert Curve Indexing Structure . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.15 Chen and Chang Hilbert Curve Indexing Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Approximate Space-filling Curve Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 The Nearest Neighbor Distribution per Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Linear Search Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Linear Search Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Hypercube Zoom Out Algorithm Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6 Hypercube Zoom Out Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.7 HZO Relative Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.8 HZO Algorithm Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.9 Enzo FS Relative Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.10 Enzo RS Algorithm Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

xv

4.11 Enzo RS Runtime Performance of the First Group of Analysis . . . . . . . . . . . . . . . . . . . 50

4.12 Enzo RS Runtime Performance of the Second Group of Analysis . . . . . . . . . . . . . . . . . . 50

4.13 Enzo RS Relative Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.14 Runtime Comparison Between Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.15 Average Relative Error Comparison Between Heuristics . . . . . . . . . . . . . . . . . . . . . . . 52

4.16 Enzo RS Global Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.17 Enzo RS Global Average Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xvi

Chapter 1

Introduction

Throughout the years, several indexes have been proposed to handle multidimensional (or spatial) data, but the

most popular is the R-tree. Despite all the effort dedicated to improve it, R-tree performance does not appear

to keep up the increasing of the dimensions. It is common to multidimensional indexes to suffer from the curse

of dimensionality which means they present a similarity search performance worse than a linear search in spaces

that exceed ten dimensions [Weber et al., 1998, Yu, 2002, Mamoulis, 2012]. Even with other indexes with better

performance, Mamoulis considers that similarity search in these spaces is a difficult and unsolved problem in its

generic form [Mamoulis, 2012].

The approximate nearest neighbor is a similarity search technique that neglect the accuracy of query results

for reduced response time and memory usage. It is an invaluable search technique in high dimensional spaces.

Commonly, we use feature vectors to simplify and represent data in order to lower the complexity. The features

are represented by points in the high dimensional space and the closer the points in space, the more similar objects

the objects are. In the case of images, these are usually represented by feature vectors, which are then used to

find approximately similar objects. In this case, an approximate result is an acceptable solution. Multimedia data

is only an example among several other applications. The approximate nearest neighbor, as its name suggests,

does not guarantee the return of the exact neighbor, and its performance is highly related with the data distribution

[Wichert, 2015].

The space-filling curves are also highly related to the data distribution. They are fractals that enable mapping

multidimensional points to a single dimension allowing the traditional one-dimensional indexing structures, such as

B+-tree, to be used for indexing multidimensional data. Basically, they consist on a path in multidimensional space

that visits each point exactly once without crossing itself. The curve defines a total order between points visited

making possible its linear ordering. The purpose is to preserve the spatial proximity between points allowing

a good clustering property. Indexes are then built on the reduced space enabling the resultant data to be on a

linear range of page memory or disk block addresses [Yu, 2002, Jagadish, 1990, Faloutsos and Lin, 1995, ip Lin

et al., 1994, B.-U. Pagel and Faloutsos, 2000]. Therefore, space-filling curves become very useful in areas such as

multidimensional indexing [Mamoulis, 2012, Faloutsos and Roseman, 1989].

In the context of my thesis, I studied the space-filling curves and their properties as well as challenges presented

by multidimensional data. I explored the use of the space-filling curves, specifically the Hilbert curve, as an access

1

method for indexing multidimensional points. I also tested three heuristics to search for an approximate nearest

neighbor on the Hilbert curve. The main goal was to explore the potential of the Hilbert curve in low dimensional

spaces (up to thirty dimensions).

My thesis can be justified by previous studies that indicate that unless the data are well-clustered or correla-

tions exist between the dimensions, nearest neighbor search may be meaningless in spaces where the number of

dimensions are bigger than ten [Mamoulis, 2012]. Studies indicate as well that space-filling curves are very useful

in the domain of multidimensional indexing due to their properties of good clustering, preservation of the total

order and spatial proximity [Mamoulis, 2012, Faloutsos and Roseman, 1989].

1.1 Hypothesis and Methodology

In the context of my thesis, I studied space-filling curves as a direct and secondary access method in multidimen-

sional spaces. I chose the Hilbert space-filling curve due to its superior clustering property. Additionally, I applied

a similarity search technique on the Hilbert curve where I opted for the approximate nearest neighbor to speed

up the search. Formally, the hypothesis that I tried to validate was that a space-filling curve can operate in a

low-dimensional space (up to thirty dimensions), in terms of index and search.

Considering this hypothesis, there were some questions that I tried to answer:

• What is the performance of an approximate nearest neighbor search, in a Hilbert space-filling curve, in terms

of relative error and runtime, compared with a basic linear search?

• What are the Hilbert space-filling curves limitations in terms of usability?

In order to validate my thesis hypothesis and answer the questions above, I build a prototype called approximate

space-filling curve. The system configures basically two steps: (1) mapping the original space to the Hilbert space-

filling curve generating the Hilbert Index to each multidimensional point, and (2) performing the query applying

the heuristic chosen.

In the first step, I started by cleaning the data and remove the duplicated points in order to obtain clear results.

Then, for each multidimensional point in the original data space a Hilbert index key is generated with help of library

Uzaygezen 0.21 from Google Code, which is a space-filling curve library capable of mapping multidimensional

data points into one dimension through a compact Hilbert index [Hamilton and Rau-Chaplin, 2008]. On the second

step, I applied two heuristics using the Hilbert curve as a direct access method and one using the Hilbert index with

a B-tree variant. The three heuristics search for an approximate nearest neighbor to a defined query point on the

Hilbert curve. I evaluated the results for the three heuristics and compared with a simple linear search.

1.2 Main contributions

The main contributions of this work are summarized below:

1https://code.google.com/p/uzaygezen/

2

• I tested the performance of the Hilbert space-filling curve as a direct access method with two heuristics up

to 36 dimensions. The results indicated that the curve generates too much data making it difficult to use the

curve beyond the 4 dimensions.

• I tested the performance of the same curve as a secondary key combined with a B-tree variant up to 36

dimensions. The results showed that this combination works in the six spaces tested inclusive in the space

with 36 dimensions. Compared with the linear method, it is generally faster as the number of dimensions

and order of the curve increase.

1.3 Document Outline

The remaining chapters of this document are structured as follows: the chapter 2 introduces the fundamental

concepts required to understand the context of the problem as multidimensional data, indexing and query as well

as the alternatives to the proposed fractal access method. Chapter 3 presents the main concepts relating fractals

and reveals some previous work done with space-filling curve in multidimensional spaces in terms of indexing and

query execution. Chapter 4 presents the solution proposal and the methodology followed, the experiments and the

results. Finally, the chapter 5 summarizes this document, the results achieved and the future work in this domain.

3

4

Chapter 2

Multidimensional Data

In this chapter, we will see the basic concepts of the surrounding context of this thesis proposal as well as the

possible alternatives. Therefore, the first section presents a short introduction to the multidimensional data and its

challenges. The second section introduces the basic notion of relation in multidimensional data that influence on

the choice of the access method. Section 2.3 explain the concept of vector space as a result of applying a feature

transformation function in order to perform easier operations on multidimensional data, very used on multimedia

data. This section also shows the common metric distances used in vector spaces. Section 2.4 introduces the main

queries performed in a multidimensional space with the auxiliary of the metric distances. Followed by section 2.5

which reveals the alternatives to the fractals’ proposal to index low-dimensional data (two to thirty dimensions).

Finally, the last section sums up this chapter.

2.1 Introduction

Data is becoming increasingly complex. It contains an extensive set of attributes and sometimes requires to be seen

as part of a multidimensional space. Database Management Systems (DBMSs) have undergone several changes

in its access methods in order to support this type of data, also known as spatial data. It is defined by its location,

which is limited by a boundary, known as the spatial extent. In a DBMSs file, a spatial data can be represented

as a point data or region data. In point data, its spatial extent comes down to its location, since a point has no

area or volume. On the other hand, when it has a spatial extent, the location refers to the centroid region. In two

or three-dimensional space, the boundary corresponds to closed line and to a surface, respectively. To store some

data object, we can perform a geometric approximation through points, lines or other geometric figures and store

the data as a feature vector. For example, as a point the vector is stored as d-tuple where d represents the number

of dimensions. The purpose is to perform some easy operations on the data but this can result in data spaces up

to hundreds of dimensions. This spatial data is often called multidimensional when the data are feature vectors,

and the vector space has usually more than three dimensions. Despite these differences, both concepts relate to the

same subject. For convenience, we will refer to all as multidimensional data since it is the broader concept. The

differences, when they exist, will be referred in due course.

The challenges start when to apply access methods and query processing techniques. There are several indexes

5

structures developed to deal with multidimensional data. Some are more suitable to low-dimensional space, some

to high-dimensional one. The purpose of this research is to study the fractals and understand how their properties

can help to optimize the multidimensional indexing structures to index data of low dimensions (two to thirty).

These low-dimensional spaces may result from the application of indexing structures oriented to high-dimensional

space or result from the application of dimensionality reduction techniques for subsequent indexing. In either sit-

uation, the resulting space, despite being of lower dimensions still requires a large computational effort to perform

operations on data. The main problem deals with the fact that the majority of the indexes tend to suffer of the curse

of dimensionality when the number of dimensions are bigger than ten. This translates into a degrading of search

performance related to the increase of the dimensions being worse than a linear search [Berchtold et al., 1996,

Weber et al., 1998, Bohm et al., 2001, Clarke et al., 2009, Mamoulis, 2012]. This happens because the volume of

the space also increases, although so fast that the data in the space become sparse [Clarke et al., 2009].

In terms of query challenge, it relates with a special property of multidimensional data. It refers that ”there is

no total ordering in the multidimensional space that preserves spatial proximity” [Mamoulis, 2012]. This means

that there is no guarantee that two objects close in space will also be close in the linear order. Therefore, according

to Mamoulis, the objects in space cannot be physically clustered to disk pages in a way that provides theoretically

optimal performance bounds to multidimensional queries.

2.2 Multidimensional Relations

It is normal to speak of multidimensional data without think of them as such. When we want to give our house

location to someone, it is common to use one of these two options: either we give the address, which gives a good

location of the house, or can give a reference to something that is near to it. According to [Mamoulis, 2012], our

house can be defined as a spatial object once it has at least one attribute which characterizes its location. Whether

the house is a detached house or an apartment, it occupies an area in two or three-dimensional space, depending

on the chosen representation. The Figure 2.1(a) shows this area and is defined as a geometric extent of the spatial

object. However, there is not always a geometric extent. Another example is a large-scale map where our home

city is a spatial object which has a location, yet does not have a geometric extent. This is typically represented as a

point on the map (see Figure 2.1(b)). The multidimensional data are clearly in the last case, since it is represented

by points or vectors without geometric extent in spaces usually with more than three dimensions.

Two or more objects with the same semantic define a spatial relation. It can be represented in a table where

each row and columns correspond to an object and its attributes respectively [Mamoulis, 2012]. In Figure 2.1(a),

if we choose to give a reference to something near to our house, for example, the Paladares restaurant, we are

establishing a spatial relation between two objects in space based on the distance between them. This can be defined

using a distance metric, or implicitly through a distance range. This distance can be further classified objectively

or subjectively [Mamoulis, 2012]. In Figure 2.1(a), our house is within 5Km from the nearest university. This

distance can be considered near if we drive or far if we walk.

These relations can also be of other two types: directional or topological. The first, relates two objects based

on a global reference system [Mamoulis, 2012], such as a Compass Rose. Following our example in Figure 2.1(a),

we may say that the University is located southwest of our house. We can also add that our house is right in front

6

My house

University

Paladares

My house

Paladares

University

My city

Lisbon

Oeiras

(a) Example of map that includes ourhouse, the nearest restaurant (Paladares)and the nearest University.

My house

University

Paladares

My house

Paladares

University

My city

Lisbon

Oeiras

(b) Example of large scale map that representcities as points.

Figure 2.1: Examples of multidimensional data.

of the Paladares restaurant. In this example, we are using a reference system defined by the viewer, in this case,

us.

An object can be defined by a given set of points that fill it - defining its interior - and another set that defines

its frontier. The topological relation uses the notion of interior and frontier of an object to relate them. When

we say, for instance, that our house has a garage that is adjacent to the kitchen. The words ”has” and ”adjacent”

express topological relations since the first implies that the house contains a garage, and the second reveals a

relation between the limits of two objects. There are other topological relations of intersection in addition to

these like inside, equals or overlap. There are still disjoint topological relations, which are another extension of

the topological relations in multidimensional spaces. For more information on this topic, please see [Mamoulis,

2012].

2.3 Vector Space

Multidimensional data, especially high-dimensional data, arise mostly from the need to map features of multimedia

objects to high dimensional space. Also called feature vector space, it can represent multimedia repositories in a

simple way and be used to find similar objects with similar feature vectors. The idea behind is very simple. Take,

for example, your favorite song. You can choose to represent your song as a histogram containing the percentage

of each note used in the song. Then, each note will correspond to a dimension and each percentage of the note used

in the song will be the value in each dimension. At the end, you will have a d-tuple that can be represented by a

point in a multidimensional space. Finally, if you look for the nearest neighbor point to yours in a song repository,

this will probably be the similar song to your favorite song. Afterwards, two songs resulting in a map of two nearby

points should be more similar than two songs that result in two spaced points [Ramakrishnan and Gehrke, 2002].

According to [Bohm et al., 2001], the feature transformation F can be defined as the mapping of a multimedia

object Obj into a d-dimensional feature vector

F : Obj −→ Rd (2.1)

The similarity between the two objects can be determined using a distance metric distp where p describe the metric

7

(a) L1 norm (b) L2 norm (c) L∞ norm

Figure 2.2: Examples of unit circles for the L1, L2 and L∞ norms. Adapted from Wikipedia.

chosen,

distp(obj1, obj2) = ||(F (obj1)− F (obj2)||p (2.2)

The Lm is usually known as Lp but will be used here with a m to distinguish from the point p. This norm

represents the variety of metrics that we can use to define the distance between two points p and q in arbitrary

space S. Let m ∈ [1,∞[ be a real number. The Lm metric can be defined as follows:

||p− q||m = [∑|p− q|p]1/m (2.3)

According to Bohm et al., the Euclidean - metric L2 - is the most common function used and is defined substituting

the m in the Equation 2.3 for the number 2:

||p− q||2 = [∑|p− q|2]1/2 (2.4)

Basically, it consists on the straight distance between p and q. A unit circle using this metric is represented in

Figure 2.2(b). There are other popular functions like the Manhattan - metric L1 - and it is defined substituting the

m in the Equation 2.3 for the number 1:

||p− q||1 =∑|p− q| (2.5)

This metric has the name of a city due to the distance are conceived as streets in a city. The distance is not straight

because we assume that we cannot cross through a building. A unit circle using this metric is represented in Figure

2.2(a). Another metric is the Maximum norm or Chebyshev metric L∞ where the m is substituted by the infinite

symbol:

||p− q||∞ = max{|p− q|} (2.6)

This metric is a particular case of Manhattan norm where it is only considered the maximum distance between

two points. The Figure 2.2 illustrates three examples of unit circles for the three norms presented here. For more

information on this topic, please see [Bohm et al., 2001].

8

2.4 Multidimensional Query

Similarly to relational queries, multidimensional queries are the substantial motivation for the optimization of

multidimensional data management. Take again the example of your favorite song presented in the previous

section. After applying a feature transformation, the idea is to find the nearest point to ours in order to discover

our preferred song in a song repository. This search is called similarity queries and there are mainly two types:

range queries and nearest neighbor queries. The first refers to a spatial extent of a point and returns all regions that

overlap the query region. On the other hand, the nearest neighbor queries aim to find the spatial data objects close

to a particular point. There are other types of queries [Mamoulis, 2012] but will focus on these since they are the

most common in d-dimensional vector spaces with d > 3.

According to Bohm et al., all these queries are based on the notion of distance metric between two points p

and q in the vector space Sd with d dimensions. Let m be the metric chosen to compute the distance [Bohm et al.,

2001].

2.4.1 Range Query

As the name indicates, the query is related to a range, for example, ”Find all points within a radius of 5”. This

will depend on the distance metric previously defined. Imagine a large multimedia repository and you want to find

your favorite song. The search should retrieve all the songs with the same specifications you defined in the query.

More formally, this type of query is the simplest and searches for all the points p in the vector space Sd that are

identical to our query point q.

PointQuery(Sd, q) = {p ∈ Sd|p = q} (2.7)

Looking now at the Figure 2.1(a), what if we want to search for all the restaurants that are less than 5 Km away

from our house? This question is a range query where the predicate is well defined - less than 5 Km away from

our house. More formally, this query searches for all the points p in the vector space Sd that are inside a radius r

defined by a distance metric m.

RangeQuery(Sd, q, r,m) = {p ∈ Sd|distm(p, q) ≤ r} (2.8)

Point queries are particular cases of a range queries with radius r = 0. Typically, the range queries describe

geometric figures in the vector space according to the metric m chosen. For example, if the m relates to the

Euclidean (L2) the figure described will be a hypersphere and all the points inside this figure are retrieved by

query. On the other hand, if the m relates to the Chebyshev (L∞) the figure will be a hypercube.

The range query and its variants have some disadvantages due to the size of the query result set are unknown,

which may produce or an empty set or almost the entire space. This happens because the user must specify the

radius without knowing the amount of results the query may produce.

2.4.2 Nearest Neighbor Query

For example, ”Find the nearest point to mine”. In the Figure 2.1(a), suppose we want the nearest restaurant to our

house. This query represents the classic nearest neighbor query that retrieves the point p with the shortest distance

9

to our query object q, according to the distance metric m. In case of a tie, the query retrieves all the tied points.

NNQuery(Sd, q,m) = {p ∈ Sd|∀p′ ∈ Sd : distm(p, q) ≤ distm(p′, q)} (2.9)

There are a few variants of the Nearest Neighbor Query. We may want to specify the k nearest neighbors the

query must retrieve instead of getting the nearest neighbor or obtain an approximate nearest neighbor instead of an

exact one or the combination of both getting an approximate k-nearest neighbor.

The k-nearest neighbor query searches for the k nearest neighbors points pi, i ∈ [0, k − 1] such that no other

point p′ are closer than any pi to our query point q.

kNNQuery(Sd, q, k,m) = {pi ∈ Sd| 6 ∃p′ ∈ Sd\{pi}∧ 6 ∃i : distm(pi, q) > distm(p′, q)} (2.10)

The approximate nearest neighbor query and the approximate k-nearest neighbor query follow the same idea

of the previous given a query point q and defining the number of k neighbors. In these cases, the query retrieves

an approximation instead of an exact result which in high dimensional spaces could be a significant advantage in

terms of computational effort. The heuristic to find the approximate neighbor varies according to each application.

For more information on multidimensional query, please see [Bohm et al., 2001].

2.5 Multidimensional Index

Database Management Systems (DBMSs) have undergone several changes in its access methods in order to support

the multidimensional data. In order to locate and fetch some data fulfilling minimum performance requirements,

structures like indexes are of the utmost importance. Multidimensional indexing turns to be essential to perform

a faster data mining and similarity search compared with traditional indexing structures. In the multidimensional

space, data can be seen as points or regions according to the application usability. To organize these points, an

important factor must be taken into account over the common space: the spatial relation between the them. This

factor is considered when organizing the entries in the index structure [Mamoulis, 2012].

There are several structures oriented for indexing multidimensional data. These vary according to their struc-

ture, if they have taken into account or not the data distribution, if they are targeted for points or regions, if they

are more suited to high or low dimensional spaces, etc. For example, structures such as Kd-trees [Bentley, 1975,

Bentley and Friedman, 1979], hB-trees [Lomet and Salzberg, 1990], Grid files [Nievergelt et al., 1984], Point Quad

trees among others, are oriented to indexing points. Structures such as the Region Quad trees [Finkel and Bentley,

1974], SKD [Ooi et al., 1987] or R trees can handle both points or regions [Ramakrishnan and Gehrke, 2002].

Subspace-tree [Wichert, 2008] and LSD-tree are more suited to high dimensions and R-tree, Kd-tree and Space-

Filling Curves are more suited to low dimensional spaces. These lists are quite incomplete due to the wide variety

of existing proposals. Although there is no consensus on the best structure for multidimensional indexing, the R

tree is the most commonly used for benchmark the performance of new proposals. Furthermore, in commercial

DBMSs, the R-trees are preferred due to their suitability for points and regions and orientation to low-dimensional

space [Yu, 2002, Ramakrishnan and Gehrke, 2002]. The list of alternative access methods to compete with fractals

in low-dimensional space (two to thirty dimensions) are quite extensive. Therefore, I choose the R-tree and the

10

p1

R1

R2

R3

R4

R5

R6

p5 p4 p2

p3 p6

p7 p8 p9

p10

p11

R1 R2

R3 R4 R5 R6

p5 p1 p4 p8 p2 p9 p7 p10 p3 p6 p11

Figure 2.3: R-tree planar and directory representation adapted from [Yu, 2002].

Kd-tree as two possible alternatives to take a closer look.

2.5.1 R-tree

Since the multidimensional indexes are primarily conceived for secondary storage, the notion of data page is a

concern. These access methods perform a hierarchical clustering of the data space, and the clusters are usually

covered by page regions. To each data page is assigned a page region, which is a subset of the data space. The

page regions vary according to the index and in the R-tree case, it has the name of minimum bounding rectangle

(MBR) [Bohm et al., 2001]. At a higher resolution level, each MBR represents a single object and is stored on the

lower leaf nodes (see Figure 2.3). These leafs contain also an identifier of a multidimensional point or region that

represents some data. The R-tree descends from the B+-tree and therefore, preserves the height-balanced property.

Thus, at the higher levels will have clusters of objects, and since they are represented by rectangles, we will also

have clusters of rectangles [Yu, 2002, Ramakrishnan and Gehrke, 2002].

To insert a new object in the tree, the search of the perfect leaf describes a single path traversed from the root

to the leaf node. At each level, the base criterion is to find the covering box that needs least enlargement to include

the new object. In the event of a tie, we chose the node where the covering box has the smaller area. This rule

is applied until we reach a leaf node. Here, if the node is not full, the insertion is straightforward. The object is

inserted, and the box area is enlarged in order to cover it. The enlargement must be propagated to the parent nodes

in order to the tree remain coherent. But if the leaf node is full, the node is split into two and it can generate a

recursive split across the branch or even increasing height of the tree [Yu, 2002, Ramakrishnan and Gehrke, 2002].

On the other hand, the deletion of an object may alter the upper nodes. If a leaf node become underflow, the node is

deleted and all the remaining neighbors are reinserted on the tree. Therefore, the deletion may cause re-adjustments

upwards and downwards. In order to find elements that are encompassed in a given query rectangle, all the leaves

11

s1

s2

s3

s4

s1

s2

s3

s4

Figure 2.4: Kd-tree planar and directory representation adapted from [Bohm et al., 2001].

that overlap the query are traversed until reaching a leaf node. In these, the MBRs are tested to the query rectangle,

and the data is fetched if the intersection is not empty [Yu, 2002].

The R-tree search performance relates to the size of the MBRs. Note that if there are too many overlapping

regions, when performing a search it will result in several intersections with several subtrees of a node. This

will require to traverse them all, and we must consider that the search occurs in a multidimensional space. Thus,

the minimization of both coverage and overlap, has a large impact on the performance [Yu, 2002, Ramakrishnan

and Gehrke, 2002]. Other proposals based on R tree have been presented over the years. The R*-tree tries to

improve the R-tree by choosing coverage of a square shape rather than the traditional rectangle. This shape allows

a reduction in coverage area by reducing the overlap and hence improving performance of the tree. Like the R*-

tree, many other structures have been proposed based on the R-tree as the R+-tree, the X-tree, the Hilbert R-tree

among others. However, there is also another problem that haunts these hierarchical indexing structures known

as the curse of dimensionality. It refers to the degradation of performance of these structures with the increasing

number of dimensions. The main underlying cause relates to the volume of the chosen form that covers the points.

This has a constant radius or edge size which increases exponentially with the increasing of dimensions [Berchtold

et al., 1996, Wichert, 2009]. Nevertheless, the R-tree is the most popular multidimensional access method and has

been used as a benchmark to evaluate the new structures’ proposals [Yu, 2002, Ramakrishnan and Gehrke, 2002].

2.5.2 Kd-tree

Throughout the years, several hierarchal indexes have been proposed to handle multidimensional data space. These

structures evolve from the basic B+ tree index and can be grouped in two classes: indexes based on the R-tree and

indexes based on the K-dimensional tree (Kd-trees). Unlike the R-tree, the Kd-tree is unbalanced, point and

memory-oriented (see Figure 2.4). It divides the data space through hyper-rectangles resulting in mutually disjoint

subsets. Like the R-tree, the hyper-rectangles correspond to the page regions but in this case, most of these do not

represent any object [Bohm et al., 2001, Wichert, 2009]. If a point is on the left of the hyper-rectangle it will be

represented on the left side of the tree, otherwise it will be represented on the right side. The advantage is that the

choice of which subtree to search is always unambiguous [Bentley, 1975, Yu, 2002].

Adding an element on this tree is the same process as any other tree structure. We start on the root, and we

decide whether the left or right branch according to the elements’ position are on the left or right side of the data

12

space. Once we find the parent node where we should be located below, we insert the node on the right or left side

according to the element position to the parent node. There are some options to remove an element from this tree

but in general is easier than the R-tree because all leaf nodes with the same parent create a disjoint hyper-rectangle

of the data space. Bohm et al. refer that for this reason, the leaves can be merged without violating the conditions

of complete.

There are some disadvantages of having complete disjoint partitions of the data space. Bohm et al. refer that

the page regions are normally larger than necessary, which can lead to an unbalanced use of data space resulting

in high accesses than with the MBRs. The R-tree only covers the space with data unlike the Kd-tree. The latter

is unbalanced by definition, in a way that there are no direct connections between the subtrees structure [Yu,

2002]. This leads to the impossibility of pack contiguous subtrees into directory pages. The Kd-B-tree [Robinson,

1981] results from a combination of the Kd-tree with the B-tree [Comer, 1979] and solves this problem by forcing

the splits [Bohm et al., 2001, Mamoulis, 2012]. This culminates in a balancer resulting tree, such as the B-tree,

overcoming the problem above. Creating completely partitions has its drawbacks. Namely, the rectangles can

be too big containing only a cluster of points. Or the reverse, a rectangle may have too many points becoming

overloaded. Thus, more adjustable rectangles may bring better performance. The hB-tree tries to overcome this

problem decreasing the partition area using holey bricks. However, the hB-tree is not ideal for disk-based operation

[Yu, 2002]. Berchtold et al. refer that all these index structures are restricted with respect to the data space

partitioning. They also suffer from the well-known drawbacks of multidimensional index structures such as high

costs for insert and delete operations and a poor support of concurrency control and recovery [Berchtold et al.,

1998].

2.6 Summary

This chapter introduces the basic concepts relating multidimensional data and its challenges within the context of

this thesis. After the introduction, we saw the multidimensional relations (section 2.2) that later would influence

the multidimensional indexing. We also saw the vector space (section 2.3) which basically consist on a simple

representation of the data, typically multimedia data. The vector space allows to search for similar objects since

they are represented in the same way. This similarity search is performed using distance metrics also described

in this section. In section 2.4, we saw how to perform similarity search queries. Finally in Section 2.5, we

saw the most common access methods used in lower dimensional spaces (two to thirty) as an alternative to the

space-filling curves which will be introduced in the following chapter. However, all these index structures are

restricted with respect to the data space partitioning. Additionally, they suffer from the well-known drawbacks

of multidimensional index structures such as high costs for insert and delete operations and a poor support of

concurrency control and recovery.

13

14

Chapter 3

Fractals

In the previous chapter, we saw basic concepts of the surrounding context of this thesis proposal as well as the

possible alternatives. In this chapter, we will see the fundamental concepts for understanding and reasoning the

thesis proposal. Thus, this chapter starts with a short introduction on Fractals followed by section 3.2 that explains

the concept of the Hausdorff dimension. Section 3.3 describe the space-filling curves which are computer generated

fractals. Section 3.4 describe how a space-filling curve can be generated to any dimension. The following section

3.5 describe a special property of the space-filling curves that can be very useful when they are used as an access

method. Section 3.6 describe how the Hilbert curve, which is the most promising space-filling curve had been used

as an access method. Finally, the last section 3.7 summarizes this chapter.

3.1 Introduction

Benoit B. Mandelbrot, in 1975, introduces the concept of a fractal as an object that can be broken into several

parts, each one equal to the original object [Mandelbrot, 1975]. Examples of fractals are forests, mountains,

leaves, galaxies, etc. They all have a self-similarity that repeats at distinct levels of magnitude with different ratio

similarity in different directions. In the traditional geometric point of view, objects are abstracted and based on the

classic geometric figures like squares, triangles or others [Frame and Mandelbrot, 2002]. Fractal geometry will

makes us all see the world differently [Barnsley, 2013].

Clouds are not spheres, mountains are not cones, coastlines are not circles, and bark is not smooth,

nor does lightning travel in a straight line.

Mandelbrot in [Mandelbrot, 1982]

A closer look at a cloud or mountain and we can realize a degree of repetition between the parts and the whole.

Fractal geometry allows to construct precise models of all kind of physical structures. According to Frame and

Mandelbrot, it allows us to describe the shape of cloud like an architect describe a house. It also relies on a special

notion of dimension that I will introduce below.

15

3.2 The Hausdorff Dimension

Usually when you go the cinema you probably say, ”I am going to see the Big Hero 6 in 3D!”. You don’t say,

”I am going to see the Big Hero 6 in 2,7D!”. In fractal geometry, it is possible to use a fraction as a value for an

object dimension. This is a classical mathematical chapter called Hausdorff Dimension that is forgotten due to a

lack of interest or ignorance about its potential [Falconer, 2007, Mandelbrot, 1982].

The fractal geometry relies on a special notion of dimension - the Hausdorff Dimension [Mandelbrot, 1982].

The idea of a line having a dimension one and a surface having a dimension two may not occur in this domain. For

example, in classical geometry two coordinates are required to define a point. In fractal geometry, a point is can

be seen as being in a line thus we need single a coordinated to represent it [Mandelbrot, 1982]. An object can have

a dimension bigger than one, expressing an infinite length and, simultaneously, minor than two corresponding to a

zero area [Falconer, 2007].

A

B B B

A, L = 9

B, s = 3

r = 1/2

D = 2

s = 2, N = 4 s = 3, N = 9

Figure 3.1: Ratio similarity example with line segments.

A line can be decomposed into N parts, which together are the size of the original (see Figure 3.1). Although

the previous statement is obvious, it transmits information that will be understood in more detail below. Imagine a

line A of length L and N subsegments B of length s. The union of the N segments B are equal to A. Therefore,

the length of B can be derived by dividing the length of A by the N segments B, known as the scale factor:

s =L

N(3.1)

Or it can be defined in terms of similarity ratio:

r =N

L(3.2)

A

B B B

A, L = 9

B, s = 3

r = 1/2

D = 2

s = 2, N = 4 s = 3, N = 9

Figure 3.2: Ratio similarity example with squares

16

In Figure 3.1, the line A has length L = 9 and can be broken into 3 (N ) segments B. Note that the length

of B is obtained by dividing the length of A by the number of segments B, then s = 9/3 = 3 and the ratio of

similarity corresponds to r = 1/3 since the line segments are 1/3 smaller than the original. Following the same

line of thought it is possible to increase the complexity thinking now in squares. In Figure 3.2 the ratio of similarity

corresponds to r = 1/2 since the side length of the inner square corresponds to half the side of the outer square. In

the Euclidean geometry, the notion of the similarity ratio is not very useful to realize the self-similarity relationship

between large and small objects. Therefore, the Hausdorff Dimension, also known as the Fractal Dimension, gives

more complete notion of dimension and self-similarity. Thus, suppose a figure that can be decomposed in N

sub-figures of size s, the Hausdorff Dimension D is obtained through this formula:

D =log(N)

log(s)(3.3)

Analyzing the dimension 2 to a square, the variations of the two variable are shown in Figure 3.3. TheN varies

according to the number of cuts the outer square suffers. When the side length of the outer square is divided into

two (s), this results in two segments along the axes with half of the original size. As the dimension of the square

is two, the result is four inner squares, and this last number is obtained using equation 3.4 which is a variation

of equation 3.3 and corresponds to the number of elements needed to cover the outside square, in this case, it

corresponds to 4. The same analogy is done to the side length divided into three.

N = sD (3.4)

A

B B B

A, L = 9

B, s = 3

r = 1/2

D = 2

s = 2, N = 4 s = 3, N = 9

Figure 3.3: The variation of the variables of the Equation 3.3 s and N along the dimension 2.

Consequently, if a square, which is a figure that has dimension two, has a curve that covers the entire area, we

must conclude the curve also has dimension two. The Hilbert curve is an example of curves that have dimension

two but still a line in space. On the other hand, for curves such as the snowflake Mandelbrot [1982] the Hausdorff

Dimension is a fraction, since D = log(4)/log(3) [Mandelbrot, 1982].

17

3.3 Space-Filling Curves

A fractal can be found in Nature, like we saw, or can be computer generated thanks to its self-similarity property.

A space-filling curve is an example of computer generated fractal. Jordan, in 1887, defines formally a curve as a

continuous function with endpoints whose domain is the unit interval [0,1] [Moon et al., 2001, Hales, 2007] but the

space-filling curve is more than a simple curve. It has special properties. It consists on a path in multidimensional

space that visits each point exactly once without crossing itself, hence introducing a total ordering on the points.

However, in the previous chapter, we saw that there is no total ordering that fully preserves the spatial proximity

between the multidimensional points in order to provide a theoretically optimal performance for spatial queries.

According to Moon et al., once the problem of the total order is overcome, it is possible to use the traditional one-

dimensional indexing structures for multidimensional data with good performance on similarity queries [Moon

et al., 2001].

So, the reader must be thinking on how can we represent a space-filling curve. Well, we assume that every

space has a finite granularity and is possible to see the space as grid cells according to the Hausdorff dimension.

A space-filling curve starts with a basic curve called first-order curve. To obtain the curves of following order, we

replace each vertex of the curve by the previous order curve (see Figure 3.4). The i− 1 order curve can be rotated

or mirrored to form the curve i [Jagadish, 1990]. The curves can vary along the axes, order and dimension. The

first order of a curve in 2-dimensional space has 4 cells (22×1). The scale factor for the space-filling curves is 2.

Therefore, we can rewrite the Equation 3.4 to calculate the number of cells for each order o and will have that

N = 2(D·o) (3.5)

According to this equation, the second order curve has 16 cells, the third has 64 cells and so on.

There are several ways to sort the grid, but some of them preserve the locality better than others. The purpose

is to keep the spatial locality, or in other words, points that are close in space should be also close in the linear

order [Yu, 2002, Jagadish, 1990, Mamoulis, 2012, Moon et al., 2001]. So, the reader must be thinking now on how

can we represent the points on the curve. Well, each vertex of the curve corresponds to a possible point. Each axis

is represented using the binary code and their combination results on the point location. The curves can be used as

a secondary access method combined with a simple B-Tree [Comer, 1979]. Some of the curves use the traditional

binary code and some use the Gray code as we’ll see below.

3.3.1 Z-Curve

The Z-curve is a space-filling curve and is introduced for the first time by Morton in 1966. This curve, as its name

suggests, has the ”Z” (or ”N”) shape as you can see in Figure 3.4. In this, the curve starts on the left lower corner

and ends on the upper right corner, but this can be changed as long as the curve keeps the ”Z” shape. The following

orders are derived replacing each vertex by the previous order and then connecting the figures.

Bit Shuffling A space-filling curve maps an n-dimensional point to a single dimension. Each space-filling curve

has it own algorithm to do the map. The Z-curve performs a bit interleaving of the coordinate values in order

to generate the so-called z-value, which is the value on the curve. Since there are a few algorithms to calculate

18

Figure 3.4: Z-curve orders 1, 2, and 3 respectively.

the z-value [Faloutsos and Roseman, 1989], we select one to present here. The Z-curve bit shuffling function in

two-dimensional space will be represented as follows [Wichert, 2015]:

z = f(x, y) (3.6)

In the Cartesian axis for two dimensions, the point coordinates can be represented by x and y using decimal

notation. In this case, the coordinates are represented using a binary code of maximum length of 2n. Therefore,

the two coordinates can now be expressed as x = x1x2(2) and y = y1y2(2) where the numbers correspond to the

bit position in the binary code representation. As stated before, the Z-curve uses the bit interleaving in order to

generate the z-value, thus the expression 3.6 can be rewritten as

z = x1y1x2y2(2) (3.7)

In the Figure 3.5 it is easy to see how the z-value is generated. To obtain the upper right value on the grid, we

interpolate the x and y bit. Starting with x = 0(2) and y = 1(2) the result will be z = 01(2). The same for the

others z-values.

0 1

0

1

00

01

10

11

Figure 3.5: Z-curve first order bit shuffling.

Queries Space-filling curves allow to perform range queries and usually, other types of similarity search are

transformed into range queries. A square or rectangle is regularly used to represent the query area in the grid ?.

Take Figure 3.6 as an example. Suppose the query point has the coordinates x = 01(2) and y = 10(2) therefore,

z = 0110(2) or z = 6(10) using the decimal notation. In this figure, the darker gray square is the query point, and

19

the lighter gray one covers the query area. The latter matches to four ranges in the linear mapping.

R1 = 1, R2 = 3, 4, 5, 6, 7, R3 = 9, R4 = 12, 13

5 7 13 15

4 6

1

0

3

2

12 14

9 11

8 10

3 7 9 10 15 14 13 12 11 8 6 5 4 2 1 0

Figure 3.6: Z-curve query example transformed into linear mapping.

This ranges are also called clusters. They represent an agglomerate of points that can be read at once. Another

way to analyze the clusters has to do with the number of entry or exit points of the research area. Looking at

this example, we can conclude that Z-curve generated four clusters since it has four entry (or exit) points. This

clustering property of space-filling curves arouses considerable attention due to the benefits it can bring. For

example, a good clustering of multidimensional data on a disk translates into a reduction in disk (or memory)

accesses required for a range query [Moon et al., 2001]. The path that the curve describes is very important in this

domain. The Z-curve has a few long jumps that can be rearranged in order to reduce the number of clusters.

3.3.2 Hilbert Curve

Using a similar technique, the Hilbert Curve is another space-filling curve based on Peano Curve and comes as

an improvement[Faloutsos, 1988]. The Peano basic curve has the shape of an upside down ”U”. The curves of

higher order are obtained by replacing the lower vertices for the previous order curve. The upper vertices are also

replaced by the previous order curve but suffering a rotation of 180 degrees. The resulting curves can be seen in

[Faloutsos, 1988]. The Hilbert starts with the same Peano’s basic curve but evolves differently. On the following

Figure 3.7: Hilbert Curve orders 1, 2, and 3 respectively.

20

orders, the top vertices are replaced by the previous order, and the bottom vertices suffer a rotation (see Figure 3.7).

The bottom left vertex is rotated 90 degrees clockwise, and the bottom right rotates 90 degrees counterclockwise.

The Figure 3.7 shows the Hilbert curve orders one, two and three. In this, the curve starts on the lower left corner

and ends on the lower right corner, but this can be changed as long as the curve keeps the ”U” shape. The Hilbert

curve in three dimensions can have different Hamiltonian paths for the first-order curve depending on the defined

entry and exit point from the cube. In Figure 3.8 has an example of this curve is three dimensions.

Figure 3.8: Hilbert Curve in three dimensions from order one to three respectively [Dickau, 2015]

Bit Shuffling Analogous to the z-value, the Hilbert Curve has it own h-value. This curve as the Peano, use the

Gray Code to order their values. The Gray Code is a binary code system that every adjacent number differ only

in one digit (see Table A.1). There are several algorithms [Faloutsos and Roseman, 1989] to compute the h-value.

However, will see a simple explanation in order to have a basic comprehension. The Hilbert curve bit shuffling

function for two-dimensional space can be represented as follows [Wichert, 2015]:

h = f(x, y) (3.8)

The two coordinates can be expressed as x = x1x2(2) and y = y1y2(2) where each one is a bit string of length

two. To compute the h-value, we can follow four basic steps:

1. Perform the bit interleaving;

2. Split the resulting bit string and represent them using the decimal notation;

3. Apply the ”0-3 rule”;

4. And finally, concatenate the strings to obtain the h-value.

As a result of the first step we obtain:

h = x1y1x2y2(2) (3.9)

In the second step, we split the previous string into n substrings, where n correspond to the coordinates bit length,

and with a string length equal to the original coordinates length. After splitting, the substrings are converted to the

decimal notation having the Gray code as reference. Thus,

21

00 01 10 11

00

01

10

11

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

Figure 3.9: Hilbert seconde order curve - location of the h-value 12

h = x1y1x2y2(2)

s1 = x1y1(2), s2 = x2y2(2)

d1 = s1(10), d2 = s2(10)

d = [d1, d2]

(3.10)

The third step configures a rule that I informally called ”0-3 rule”:

• If 0 is present in d, change every following occurrence of 1 to 3 and vice-versa;

• If 3 is present in d, change every following occurrence of 0 to 2 and vice-versa.

After applying the supposed changes in array d, its values are converted to binary code and concatenated in order

to obtain the h-value.

d = [d1, d2]

d1 = x1y1(2), d2 = x2y2(2)

h = x1y1x2y2(2)

(3.11)

In order to clarify the explanation, we will see an example of a h-value computation in the Figure 3.9. In this,

we intend to obtain h=12 using its coordinates x = 11 and y = 01 First, we apply the bits interleaving technique

using the two coordinates. Then, (2) we split the string into two substrings and convert them to the decimal notation

having the Gray code as reference (see Table A.1).

h = 1011(2)

s1 = 10(2), s2 = 11(2)

d1 = 3(10), d2 = 2(10)

d = [3, 2]

(3.12)

Afterwards, (3) we apply the ”0-3 rule”. Since the first number is 3 it is necessary to swap every following 2 for 0

22

and vice-versa.d = [3, 2]

d = [3, 0](3.13)

Finally, (4) we convert d to binary code and concatenate the substrings.

s1 = 11(2), s2 = 00(2)

h = 1100(2)

h = 12(10) �

(3.14)

This algorithm only works up to three dimensions. A broader one will be introduced in the following section 3.4.

Queries The queries on the Hilbert Curve are done in the same way as told in Z-curve but the result clusters

are slightly different. Let’s analyze the same query done for the Z-curve now with the Hilbert (see Figure 3.10).

Suppose the query point has the coordinates x = 01(2) and y = 10(2) therefore, z = 6(10) and h = 7(10). For

the same grid with 4x4 cells and for the same query point, the same range query has two different maps. The

Hilbert maps only two clusters compared to four generated by the Z-curve. This means that the Hilbert curve has a

better clustering property which translates to a reduction in disk (or memory) accesses required to perform a range

query [Moon et al., 2001].

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

3 7 9 10 15 14 13 12 11 8 6 5 4 2 1 0

5 7 13 15

4 6

1

0

3

2

12 14

9 11

8 10

Hilbert Curve Z-curve

Hilbert Curve map

3 7 9 10 15 14 13 12 11 8 6 5 4 2 1 0

Z-curve map

Figure 3.10: Hilbert Curve clusters generation compared with Z-curve for the same query.

There are several ways to perform similarity search on space-filling curves [Chen and Chang, 2011, Faloutsos,

1988]. Suppose we want the find the nearest gas station to a certain point X. Faloutsos and Roseman suggest that

we follow this algorithm:

1. Calculate the h-value of X;

2. Find the X’s preceding and succeeding points on the Hilbert path until one of the points corresponds to the

gas station;

23

3. Calculate the distance d from the X location to the gas-station location;

4. And check all the points within d blocks of X.

Note that the fourth step is done because we intend to find the nearest neighbor of X. If an approximate value is

sufficient, the algorithm could end at the third step as soon as it finds the first gas station.

3.4 Higher Dimensions

We have seen how the space-filling curves work in 2 dimensions. In this section, we will see the behavior of the

Hilbert curve in when the dimensions are bigger than 2. The concepts introduced in this section are based on

[Hamilton, 2006]. The author does a geometric approximation to the curve in order to extend the concept to higher

dimensions. We will focus on the Hilbert curve since it generally presents the best cluster property, which is a

subject that we will focus on the next section. To simplify the explanation, the author chose to use the Hilbert

first-order curve rotated 90 degrees clockwise and then 180 degrees along the horizontal axis (see Figure 3.11).

10 11

01 00

010

011 00

01

10

0 1

2 3

Figure 3.11: Hilbert curve analysis to higher dimensions.

In order to increase the curve dimensions, we need to generalize its construction and its representation. If we

look closely to the Hilbert first order curve in two dimensions (see Figure 3.11), we can observe an unfinished

square with 2 × 2 vertices. Each one of the them represents a single point. On the second-order curve, we have

22×22 vertices due to the recursive construction of the curve. As the order curve increases, we will have 2×· · ·×2

points in n dimensions corresponding to the hypercube vertices. Each of the 2n vertices are represented by an n-bit

string such as

b = [βn−1 · · ·β0] (3.15)

where βi ∈ B takes 0 (low) or 1 (high) according to its position on the grid. Looking at the Figure 3.11, we can

conclude that b = [00] corresponds to the lower left vertex of the square and b = [11] corresponds to the top

right vertex of the square. Along the curve, all the vertices are immediate neighbors. This means that their binary

representation only changes in one bit. The directions of the curve inside the grid are therefore represented using

the Gray code (see Table A.1). The function gc(i) returns the Gray code value for the ith vertex. The symbols ⊕

and >> represent the exclusive-or and the logical shift right respectively.

gc(i) = i⊕ (i >> 1) (3.16)

24

Applying this Equation for the four vertices in Figure 3.11 we will have:

gc(0) = 00⊕ (00 >> 1) = 00⊕ 00 = 00(2)

gc(1) = 01⊕ (01 >> 1) = 01⊕ 00 = 01(2)

gc(2) = 10⊕ (10 >> 1) = 10⊕ 01 = 11(2)

gc(3) = 11⊕ (11 >> 1) = 11⊕ 01 = 10(2)

(3.17)

The curve is constructed by replacing each vertex for the previous order curve. Each replacement is a transforma-

tion/rotation that generates a new sub-hypercube. These operations must make sense along the 2n vertices. With

sense, I mean that ”every exit point of the curve through one sub-hypercube must be immediately adjacent to the

entry point of the next sub-hypercube” [Hamilton, 2006]. The author defines an entry point e(i) where i refer to

the ith sub-hypercube.

e(i) =

0, i = 0

gc(2⌊i−12

⌋), 0 < i ≤ 2n − 1

(3.18)

The entry point e(i) and the exit point f(i) are symmetric and the latter is defined as

f(i) = e(2n − 1− i)⊕ 2n−1 (3.19)

Even at the first-order curve, we can look at each vertex as sub-hypercube. Therefore, each one as an entry and

exit point. The entry points for the Hilbert first-order curve are presented in Figure 3.12 are calculated now:

e(0) = 00(2) f(0) = e(3)⊕ 10 = 01(2)

e(1) = gc(0) = 00(2) f(1) = e(2)⊕ 10 = 10(2)

e(2) = gc(0) = 00(2) f(2) = e(1)⊕ 10 = 10(2)

e(3) = gc(2) = 11(2) f(3) = e(0)⊕ 10 = 10(2)

(3.20)

We can look to the Figure 3.12 and observe that to generate the first vertex/sub-hypercube the curve entry on the

left lower corner (00) and exit at the right lower corner (01). The next vertex is generated by entering on the left

lower corner (00) and exiting on the right upper corner (10). And so on. Additionally, we need to know along

which coordinate is our next neighbor in the curve. The directions between the sub-hypercubes are given by the

function g(i) where i refer to the ith sub-hypercube.

g(i) = k such that gc(i)⊕ gc(i+ 1) = 2k (3.21)

Calculating g(i) to the Hilbert first-order curve, we have that

g(0) = k, 00⊕ 01 = 01 = 1, k = 0

g(1) = k, 01⊕ 11 = 10 = 2, k = 1

g(2) = k, 11⊕ 10 = 01 = 1, k = 0

(3.22)

25

10 11

01 00

010

011

e(0) = 00

e(2) = 00

e(1) = 00

e(3) = 11

f(0) = 01

f(1) = 10

f(2) = 10 f(3) = 10

𝒊 𝒆(𝒊) 𝒇(𝒊) 𝒅(𝒊) 𝒈(𝒊)

0 [00]2 [01]2 0 0

1 [00]2 [10]2 1 1

2 [00]2 [10]2 1 0

3 [11]2 [10]2 0

Figure 3.12: Analysis of the entry and exit points of Hilbert first-order curve in two dimensions. The author referthat the x corresponds to the least significant bit and the y to the most significant. Adapted from [Hamilton, 2006]

This means that, in Figure 3.12, the next neighbor of the vertex i = 0 is along the x coordinate. In this case, x is

the least significant bit. For i = 1, the next neighbor is along the y coordinate. The author refers that g(i) can also

be calculated as g(i) = tsb(i) where tsb is the number of trailing set bits in the bynary representation of i. The

g(i) indicate the inter sub-hypercube direction along the curve. We can also calculate intra direction d(i) on the

sub-hypercubes.

d(i) =

0, i = 0;

g(i− 1) mod n, i = 0 mod 2;

g(i) mod n, i = 1 mod 2,

(3.23)

for 0 ≤ i ≤ 2n − 1.

Again, for the Hilbert first-order curve, we have

d(0) = 0

d(1) = g(1) mod 2 = 1 mod 2 = 1

d(2) = g(1) mod 2 = 1 mod 2 = 1

d(3) = g(3) mod 2 = 0 mod 2 = 0

(3.24)

All these functions can be calculated for any dimension, allowing to construct a Hilbert curve of higher dimen-

sions. Well, now that the basic functions are introduced, we can pass to the main function. The author main idea

is to define a geometric transformation T such that the ordering of the sub-hypercubes in the Hilbert curve defined

by e and d will map to the binary reflected Gray code” [Hamilton, 2006]. The operator x � i is defined as the right

bit rotation and rotate n bits of x to the right i places. A more formal definition is given in [Hamilton, 2006].

Te,d(b) = b⊕ e � (d+ 1) (3.25)

Therefore, we will now do a complete example to simplify and reduce the explanations. Given p = [5, 6] =

[101(2), 110(2)], we want to determine the h-value for p. In this case, n = 2 and m = 3, where n refers to the

dimension and m to the bit precision - 3 bits to represent the value. The result is achieved through a series of m

26

projections and Gray code calculations. Given p, we can extract an n-bit number lm−1 that will tell us whether the

point p is in the lower or upper half set of points with respect to a given axis.

lm−1 = [bit(pn−1,m− 1) . . . bit(p0,m− 1)] (3.26)

The bit function returns the m− 1 bit of p. The simplified algorithm has two steps:

1. Rotate and reflect the space such that the Gray code ordering corresponds to the binary reflected Gray code,

lm−1 = Te,d(lm−1) (3.27)

2. Determine the index of the associated sub-hypercube,

wm−1 = gc−1(lm−1) (3.28)

Applying this to our example, we will have that:

p = [101(2)︸ ︷︷ ︸p0

, 110(2)︸ ︷︷ ︸p1

] (3.29)

We will have m− 1 iterations to compute the h-value h. So i represent the iterations and it varies from i = m− 1

to i = 0. To i = 2, we start with the variables e = 00(2), d = 1 and h = 0.

l2 = [bit(p1, 2)bit(p0, 2)] = [11],

T0,1(l2) = 11⊕ 00 � 2 = 11(2),

w2 = gc−1(11) = 10(2),

e(w2) = e(2) = 00(2),

d(w2) = d(2) = 1(10),

h = [w2] = 10(2) = 2(10)

(3.30)

So, for the first iteration we can locate the point on the sub-hypercube h = 2 of the first-order curve. Since the

curve to represent this point is a three order curve, we must perform another two iterations to find the exact location

of the point. To i = 2, we have the variables e = 00(2), d = 1(10) and h = 2(10).

l2 = [bit(p1, 1)bit(p0, 1)] = [10],

T0,1(l2) = 10⊕ 00 � 2 = 10(2),

w2 = gc−1(10) = 11(2),

e(w2) = e(2) = 11(2),

d(w2) = d(2) = 0(10),

h = [w2w1] = 1011(2) = 11(10)

(3.31)

27

So, for the second iteration we can locate the point on the sub-hypercube h = 11 of the second-order curve. For

the last iteration i = 0, we have the variables e = 11(2), d = 0(10) and h = 11(10).

l2 = [bit(p1, 0)bit(p0, 0)] = [01],

T3,0(l2) = 01⊕ 11 � 1 = 01(2),

w2 = gc−1(01) = 01(2),

e(w2) = e(1) = 00(2),

d(w2) = d(1) = 1(10),

h = [w2w1w0] = 101101(2) = 45(10)

(3.32)

Thus, the h-value for p = [5, 6] is h = 45(10).

3.5 Clustering Property

The clustering property of space-filling curves arouses considerable attention due to the benefits it can bring.

Clustering relates to the preservation of locality between multidimensional objects when mapped to the linear

space. For example, a good clustering of multidimensional data translates into a reduction in disk accesses required

for a range query. Moon et al. conducted a study [Moon et al., 2001] to derive closed-form formulas for a number

of clusters in a particular region. The formulas provide a measure that can be used to predict a disk total access

time.

The analysis is based on Hilbert space-filling curve since it presents better results in preserving the locality. For

this purpose, some assumptions are made to simplify and clarify the lines of this argument. The multidimensional

space is considered to have finite granularity where each point corresponds to a grid cell. The number of factors that

influence the disk accesses are diverse [Moon et al., 2001, Faloutsos and Roseman, 1989]. Therefore, the average

number of clusters in a subspace of a point grid is used as a performance measure of the Hilbert curve. This

subspace is the region of a query and each grid point maps a disk block. This performance measure corresponds

to the number of non-consecutive hits.

The analysis takes two different courses. The first one is the asymptotic analysis of the clustering property

of the Hilbert curve. It focuses on the relation between growth of space grid tending to infinity and the average

number of clusters in a query subspace region. Through a series of demonstration exercises, the authors reach the

Theorem 1. The formal definition is presented below.

Theorem 1. In a sufficiently large d-dimensional grid space mapped by Hdk , let Sq be the total surface area of a

given rectilinear polyhedral query q. Then,

limk→∞

Nd =Sq

2d. (3.33)

In a simple way, this sets an asymptotic solution revealing that the number of clusters is approximately pro-

portional to the hyper-surface area of a d-dimensional polyhedron. It also provides the constant factor of the linear

function as being (2d)−1. Where d, represents the number of dimensions. Hdk corresponds to Hilbert’s curve

representation in dimension d, order k. This theorem has immediate consequences, therefore, comes along an

28

2

3

2

(a) Rectangle query area example.

2

3

2

(b) Square query area example.

Figure 3.13: Clustering analysis example with different query area shape.

important corollary.

Corollary 1. In a sufficiently large d-dimensional grid space mapped byHdk , the following properties are satisfied:

1. Given an s1 × s2 × ...× sn hyper rectangle,

limk→∞

Nd =1

d

d∑i=1

1

si

d∏j=1

sj

(3.34)

2. Given a hypercube of side length s,

limk→∞

Nd = sd−1 (3.35)

Understand the true meaning of these formulas is difficult without an example thus let’s observe the Figure

3.13. The idea is to calculate the cluster average number inside a query area. In this Figure, we have two examples

to test the Corollary 1 formulas. The first example, Figure 3.13(a), is a rectangle area with the sides s1 = 2 and

s2 = 3. We apply now the Equation 3.34 to dimension d = 2.

limk→∞

N2 =1

2

2∑i=1

1

si

2∏j=1

sj

=

1

2

[1

s1(s1 · s2) +

1

s2(s1 · s2)

]=

1

2

[1

2(6) +

1

3(6)

]=

5

2= 2, 5 �

(3.36)

So we can conclude that the cluster average number is 2,5. In this case, the exact cluster number is 2 since we have

two uninterrupted curves inside the query area. Let’s now analyze the square case in the Figure 3.13(b). To square

29

query areas, we apply the Equation 3.35. Since it is a dimension 2 and the square as side 2 we can substitute now.

limk→∞

N2 = 22−1 = 2 (3.37)

In this case the cluster average number match the example since we have again two uninterrupted curves inside

the query area. The second course of this analysis is related to the exact analysis of the same property in a 2-

dimensional space. It is the same idea but to a finite grid space. This way, giving the notion of how fast the

number of cluster converges to the asymptotic solution. In order to demonstrate the accuracy and correctness of

the asymptotic analysis (see Appendix A.1), it is created a simulation experiment for range queries of different

shapes and sizes. Despite the distinct forms of queries chosen, all of them can be decomposed into rectangles. The

simulation is performed only for grids of two and three dimensions due to the extend range of options. Although

the paper focusing on the Hilbert curve, the simulation is extended to the Z and gray-code curves in order to

compare performance.

The simulation results show deep similarities between the empirical results, and the results obtained from the

derived formulas. The Theorem 1 and Corollary 1 provide an excellent approach to d-dimensional queries of

different shape. It is further concluded that the Hilbert curve overall outperforms Z and Gray-code curves. Moon

et al. also refer that assuming the blocks on the disk are ordered according to a Hilbert curve, accessing to the

minimum bounding rectangles of d-dimensional query d (d ≥ 3) may increase the number of nonconsecutive

accesses as well as non-rectangular queries. The same does not happen for the two-dimensional query case.

Despite the satisfactory results, it has only been tested for 2 and 3 dimensions. It would be interesting to have more

empirical results in higher dimensions. For further details, please see [Moon et al., 2001].

3.6 Fractals: An Access Method

The use of space-filling curves for indexing multidimensional spaces is not something new. Basically, there are

two ways to use them and perform queries on space-filling curves. One hypothesis is to use a variant of B-tree

combined with the curve as a secondary key retrieval. The other alternative is to use the curve directly through the

h-values. In section 3.3, we saw how the queries look like on the grid of the curve. Now, we will see how these

indexes are used in the related work. Since the scope of this work is also focused on similarity search, I will only

present works underlying this matter or showing different ways to use fractals as an access method.

3.6.1 Secondary Key Retrieval

Faloutsos and Roseman propose the use of fractals to achieve good clustering results for secondary key retrieval.

The performance of an access method for secondary key depends on how good is the map to preserve locality.

The authors proceed with the study on the basis that the Hilbert curve shows a better locality preservation map by

avoiding long jumps between points. To a more interesting study, it is added two other curves to compare with

the Hilbert curve, the Z-curve and the Gray-code curve. The analysis focus on range and nearest neighbor queries

to test the hypothesis. The main purpose of their study is to find the best locality preserving map. The best is

defined based on two measures. The first focuses on the performance of the locality preserving map when using

30

range queries. Simplifying, the average number of disk accesses required for a range query that is exclusively

dependent on the map, revealing how good the clustering property is. The second measure is related with the

nearest neighbor query and its called the maximum neighbor distance. Faloutsos and Roseman refer that for a

given locality preserving mapping and a given radius R, the maximum neighbor distance to a point X is the

largest Manhattan - L1 metric - distance (in the k-dimensional space) of the points within distance R on the

linear curve. The shorter the distance between the points the better the locality preserving map. In order to make

the experiment feasible, the maps are tested for dimensions two, three and four to fourth order. Generally, the

Hilbert curve presents better results than the other two curves for the two measures applied. This study thus

contributes to encourage the use of fractals as secondary keys, especially the Hilbert curve. However, despite the

positive contribution this study makes, it would be interesting to know the performance of fractal beyond the four

dimension.

00 01 10 11

00 10 11 01

00 01 10 11

00 01 11 10

00 01 10 11

00 01 11 10

00 01 10 11

11 01 00 10

00 01 10 11

00 01 11 10

11 10

00 01

10 11

01 00

01 10

00 11

10 11

01 00

01 10

00 11

10 11

01 00

01 00

10 11

10 11

01 00

Figure 3.14: Lawder and King indexing structure based on the Hilbert curve - Example of a indexing a secondorder curve. Adapted from [Lawder and King, 2000].

Lawder and King introduce a more developed work based on this proposal. They present a multidimensional

index based on the Hilbert curve. The points are partitioned into sections according to the curve, and each section

corresponds to a page storage. The h-value of the first point in a section is used as corresponding page, given to the

pages a coherent ordering. The page-keys are then indexed using a B-tree variant. The height of the tree reflects the

order of the curve indexed, and the root corresponds to the first-order curve. Each node makes the correspondence

between the h-value and the position of the point on the grid (see Figure 3.14). For the second-order curve level, the

first ordered pairs are the parent of the lower nodes, and so on. In the Figure 3.14 we present an example adapted

from [Lawder and King, 2000] that represent a the index of a Hilbert second order curve. The nodes have a top

line and a bottom line. The top one corresponds to the ith vertex of the curve. And the bottom one corresponds

to its position on the grid, like we explained in section 3.4. Reading the bottom line of the node from left to right,

we describe the curve’s path on the grid. So, the first (left) node, in the second level of the hierarchy starts on

31

the left lower corner (00), to the right lower corner (01), to the righ upper corner (11) and finally to the left upper

corner (10). The same are valid for the rest of the nodes. Since the tree can easily become too big the authors use

a state diagram, suggest by [Faloutsos and Roseman, 1989], to represent the four possible nodes and express the

tree in a compact mode. However, this only works up to 10 dimensions. Above that the ”memory requirements

become prohibitive”[Lawder and King, 2000]. Some modifications are made but the method only works up to 16

dimensions. The implementation of the tree is compared to the R tree, revealing that indexing based on the Hilbert

curve saved 75% of time to populate the data store and reduced by 90% the time to perform range queries. The

authors focused on points as datum-points (records) and a further investigation considering points as spatial data

would also be interesting. To more information on Lawder and King indexing structure, please see [Lawder and

King, 2000].

3.6.2 Direct Access Index

Another interesting indexing structure proposal based on the Hilbert curve is from Hamilton and Rau-Chaplin.

They present an algorithm based on [Butz, 1971] and perform an geometrical approximation of the curve already

presented in section 3.4. The authors developed a compact hilbert index that enables a grid to have different sizes

along the dimensions. The index simply converts a certain point p to a compact h-value. The authors extract

redundant bits that are not used to calculate the index. This allows a reduction in terms of space and time usage.

The index is tested and compared with a regular Hilbert index and Butz’s algorithm. The authors conclude that

although the compact Hilbert index is more ”computationally expensive to derive”, it saves significantly more

space and reduces the sorting time. The compact Hilbert index is tested up to 120 dimensions to map the indexes

[Hamilton and Rau-Chaplin, 2008].

Chen and Chang also choose to use the Hilbert curve using the h-values for a direct access to the points. They

perform a set of operations similar to what we present in section 3.4 in order to access a certain point. Unlike

Hamilton and Rau-Chaplin, they also developed a query technique based on the map. There are two basic steps

to find the nearest neighbor to a query point. First, locate the query point on the curve, then locate the neighbor.

To locate a certain point on the grid, they start by defining the direction sequence of the curve (DSQ) based on

the cardinal points. The direction is fixed starting from left lower vertex (SW) and finishing at the right lower

vertex (SE). The DSQ is represented as (SW,NW,NE,SE). Each of the four cardinal points is then substitute for the

h-value at that position (see 3.15). The combination of the h-values with the cardinal points give us the direction of

the curve in each sub-hypercube. The h-value of the query point is converted to a number of base four [d1 · · · dm]4

and each digit indicates the curve’s direction along the zoom in on the curve. In other words, the number of digits

m refers to the order of the curve. Additionally, as we know, the curve suffers some transformations/rotations

along the sub-hypercubes. These transformations are here defined and related with the derived quaternary number

composed along zoom in:

• C13 - If the number ends with zero, then switch all ones to three and vice-versa;

• C02 - If the number ends with three, then switch all zeros to two and vice-versa;

• C11/22 - If it ends with one or two, remains the same.

32

3

SW SE

NE NW

0

1 2

3

1

2

1

0

x

Figure 3.15: Chen and Chang indexing structure based on the Hilbert curve - Example of locate the h-value 50marked with X on the grid. Adapted from Chen and Chang [2011].

The authors present an example to a better understand of the concepts. Suppose we want to find the nearest

neighbor to the h-value 50 at the Hilbert third-order curve. First, we need to find the location of the query point on

the grid (see Figure 3.15). They start by convert the number 5010 to 3024. Each of digits in (d1d2d3)4 give us the

direction along each sub-hypercube. Therefore, on the first-order curve we have:

d1 = 3, DSQ3 = (0, 1, 2, 3) (3.38)

The digit d1 = 3 points to the number 3 at SE position. This indicates that the point is in the sub-hypercube

located at SE of the grid. Since the first sub-hypercube is found, we must continue locating the point inside this

sub-hypercube. As DSQ3 ends with three, we must apply to the DSQ3 the case C02.

The second order,

d2 = 0, DSQ30 = (2, 1, 0, 3) (3.39)

The digit d2 = 0 points to the number 0 at NE position. This indicates that the point is in the sub-hypercube

located at NE of the first sub-hypercube. Since the second sub-hypercube is found, we must continue locating the

point inside this sub-hypercube. As DSQ30 ends with zero, we must apply to the DSQ30 the case C13.

The third order,

d3 = 2, DSQ302 = (2, 3, 0, 1) (3.40)

Finally, we find the point located at SW of the second sub-hypercube marked with a X in the Figure 3.15. So, once

the query point is located, we can now search for neighbors. The authors store the location, and the DSQs used

to find the location. With this, we know that our query point 5010 is located at SW of a hypercube of level three.

Therefore, they generate the following neighbors’ h-values in order to find out if they exist. Since we know that

the query point 5010 is located SW of the grid, we know that the other three positions can be generated through

the quaternary number used to guide the query point location. So, the values 3004, 3014 and 3034 will gives the

location of the possible local neighbors. And, if this search turns out empty, we can go up one level, for example,

to 314 and restart the process again.

33

The authors tested this algorithm with six distinct spatial dataset and compared with a previous version that

computes the exact location through the binary bits of the coordinates. The last version showed better results

despite only being tested in two dimensions [Chen and Chang, 2011].

3.7 Summary

This chapter presented the basic concepts relating fractals, which are the technique I intend to use as an access

method in the following chapter. Therefore, in section 3.2, we understand how fractals can have a fraction as a

value for dimension instead of an integer. In section 3.3 we saw how the most popular space-filling curves are

generated by computers. The section 3.4 explained how the space-filling curves can be generically calculated for

any dimension. Although, there are several different curves, in section 3.5 we focused on the Hilbert a superior

clustering property when compared with other. In section 3.6 we saw that the Hilbert curve is the most promising

curve to be used as a multidimensional index. In general, the related work encourages the use of the Hilbert

curve as an index method, whether as a secondary key retrieval or as a direct access. Nevertheless, they are

very conservative testing the index performance in extremely low dimensional spaces, usually up to two or three

dimensions.

34

Chapter 4

Approximate Space-Filling Curve

In the two previous chapters, we saw the basic concepts relating this thesis context and proposal. This chapter

presents a solution proposal to explore the behavior of fractals, especially the Hilbert space-filling curve used as an

access method for multidimensional data in low-dimensional spaces. The first section explains the motivation be-

hind this work. Section 4.2 describes the methodology chosen to develop these experiments. Section 4.3 describes

the origin of the data used during the experiments. Section 4.4 describes the first part of the proposal. The section

4.5 presents a characterization and analysis of the data before running the experiments based on a linear search.

The rest of the sections 4.6, 4.7, 4.8 describe the experiments done and the results achieved. The final section sums

up the content of this chapter.

4.1 Motivation

Multidimensional data is not an easy subject. In chapter 2, I introduced the basic concepts relating multidimen-

sional data and its indexing challenges. First, we saw that multidimensional or spatial data has a special property.

Ideally, in order to reduce the search time, all the multidimensional data that are related should be linearly stored

on disk or memory. However, this cannot be done because it is impossible to guarantee that all the pair of objects

close in space will be all close to the linear order. It is impossible to all but not for some and the space-filling

curves, as we saw in chapter 3, provide a total order among the points in space with superior clustering property.

Another problem relating the multidimensional data refers to its indexing structures. In chapter 2, we also saw

two alternatives to index low dimensional data up to thirty dimensions - the R-tree and the Kd-tree. Although,

like many other indexes, they present a problem that concerns about the number of dimensions increase. When

this happens, the volume of the space increases so fast that the data on the space become sparse. The curse of di-

mensionality appears when the spaces usually surpass the ten dimensions, comparing their performance to a linear

search or worse [Clarke et al., 2009]. Berchtold et al. refer that they also suffer from the well-known drawbacks

of multidimensional index structures such as high costs for insert and delete operations and a poor support of con-

currency control and recovery. A possible solution to this problem is to use dimensionality reduction techniques

like the space-filling curves [Berchtold et al., 1998, Clarke et al., 2009]. They allow to map a d-dimensional space

to a one-dimensional space in order to reduce the complexity. After this, it is possible to use a B-tree variant to

35

store the data and take advantage of all the properties of these structures such as fast insert, update and delete

operations, good concurrency control and recovery, easy implementation and re-usage of the B-tree implementa-

tion. Additionally, the space-filling curve can be easily implemented on top of an existing DBMS [Berchtold et al.,

1998].

Previous studies indicate that unless the multidimensional data, typically over ten, are well-clustered or correla-

tions exist between the dimensions, similarity search like the nearest neighbor in these spaces may be meaningless

[Mamoulis, 2012]. Other studies indicate as well that fractals are very useful in the domain of multidimensional

indexing due to their properties of good clustering, preservation of the total order and spatial proximity [Mamoulis,

2012, Faloutsos and Roseman, 1989]. In the previous chapter 3, we saw that exist related work using fractals as an

access method. However, they are very conservative using those access methods to index and search in extremely

low dimensional spaces, usually up to three or four dimensions. It would be interesting to explore the fractals as an

access method till thirty dimensions. For all the reasons mentioned above, the purpose of this research is to study

the space-filling curves. The goal is to understand how their properties can help to optimize the multidimensional

indexing structures to index low-dimensional data (two to thirty). These spaces may result from the application of

indexing structures oriented to high-dimensional space or result from the application of dimensionality reduction

techniques for subsequent indexing. In either situation, the resulting space, despite being of lower dimensions, still

requires a large computational effort to perform operations on data.

4.2 Methodology

Mapping System

ANN Heuristic System

12 17 59 14 13 1 18 45 78

Multidimensional Data Hilbert Curve 4. Performance

Measures

3. ANN Query Search

Uzaygezen 0.2

2. Generate Space

1. Read Data

Figure 4.1: Approximate Space-filling Curve Framework

My proposal is based on the framework presented in Figure 4.1 and it has two main phases. The Mapping

System and the ANN Heuristic System. In this thesis, we are studying the use of the space-filling curve as an

access method in low-dimensional spaces. It can be used as dimensionality reduction technique, and so it reduces

the number of dimensions of a space by mapping the data from a d-dimensional space to a single one. In order to do

this, I create the Mapping System. It starts by extracting the multidimensional points and generating the respective

h-values with the help of the Uzaygezen 0.2 library. As a result, a space is created according to the Hilbert curve.

Once the space is created, it is necessary to test the curve behavior as an access method. Therefore, I develop at

least two heuristics to search for approximate nearest neighbors in the space. One of them uses the Hilbert curve as

a direct access method and the other, as secondary key retrieval combined with a B-tree variant. This second phase

36

is called Approximate Nearest Neighbor Heuristic System (ANN Heuristic System) and focuses on creating

heuristics to find the ANN of an arbitrary query point enhancing the Hilbert curve properties. Basically, I test a

heuristic on the Hilbert space and readjust it or create a new one based on its performance measures. In other

words, the ANN Heuristic System is an iterative testing system. In order to test something, it is necessary to define

evaluation metrics. These are defined in terms of runtime performance and distance relative error.

Runtime performance The first is the most basic metric that we can use. As I mentioned before, several indexes’

structures present worse performances than a simple linear search when the number of dimensions surpasses the

ten. For this reason, I decided to have as reference the running time of the linear search. It has a simple algorithm.

It assumes that the first multidimensional point present in the data file is the nearest neighbor to a given query

point. It stores the coordinates of the neighbor and computes the distance to the query point. Then, it visits the

next point in the data file and computes the distance to the query point. If the new point is closer than the stored

nearest neighbor, the point becomes the new nearest neighbor. The verification is repeated for each point present

in the data file. The heuristics runtimes are considered acceptable if they are lower than the linear search runtime.

As the reader could notice, I referred to a distance that must be computed, but I did not say how this distance is

calculated. Therefore, we must also define the distance metric that are used either for the linear search or for the

heuristics. A point in a two-dimensional space can have several nearest neighbors at the same time. If they are all

at a distance of one cell to a central query point, they describe a unit circle in the grid. Well, this is clearly the

L∞ norm since the unit circle is a square (remember section 2.3). Therefore, a nearest neighbor or a set of them is

defined by having the smallest distance in the space to a given query point.

Distance relative error On the other hand, I intend to find an approximate nearest neighbor to a given query.

The word ”approximate” immediately leads us to think that the neighbor, not being the nearest neighbor, is not far

from this. Thus, it is necessary to calculate the approximation error. Consider a space S containing points in Rd

and a query point q ∈ Rd. Given c > 0, we say that the point p ∈ S is a c-approximate nearest neighbor for any q:

1. If there is a p ∈ S, ||p− q||∞ = r, where r is the distance value for the nearest neighbor

2. It returns p∗ ∈ S, ||p∗ − q||∞ ≤ c · r, therefore p∗ is an approximate nearest neighbor of q and c is the error

factor.

If we define c as being 1 + ε, we may see p∗ within the nearest neighbor’s relative error ε

||p∗ − q||∞ ≤ (1 + ε) · ||p− q||∞,

The relative error ε can vary from zero, where it corresponds to the nearest neighbor, to infinity. The values of the

nearest neighbors are previously defined by the linear search and used to calculate the error.

4.3 Data Description

The recently introduced subspace tree [Wichert, 2009] promises low retrieval complexity of extremely high-

dimensional features. The search in this structure begins in the subspace with the lowest dimension. In this

subspace, the set of all possible similar objects is determined. In the next subspace, additional metric information

37

that corresponds to a higher dimension is used to reduce this set. This process is repeated. However, for the lowest

dimensional space with dimensions from 2 till 36 a linear search has to be performed. In this thesis, I research

how to speed up this search using space filling curves. I perform empirical experiments on the following low

dimensional data, the SIFT, GIST and RGB images.

SIFT A scale-invariant feature transform (SIFT) generates local features of an image that are invariant to image

scaling and rotation and illumination. An image is described by several vectors of fixed dimension 128. Each

vector represents an invariant key feature descriptor of the image.

• The vector x of dimensionality 128 is split into 2 distinct windows of dimensionality 64. The mean value is

computed in each window resulting in a 2 dimensional vector.

• The vector x of dimensionality 128 is split into 4 distinct windows of dimensionality 32. The mean value is

computed in each window resulting in a 4 dimensional vector.

GIST Observers recognize a real-world scene at a single glance. The GIST is a corresponding concept of a

descriptor and is not a unique algorithm. It is a low-dimensional representation of a scene that does not require

any segmentation. It is related to the SIFT descriptor. The scaled image of a scene is partitioned into coarse cells,

and each cell is described by a common value. An image pyramid is formed with l levels. A GIST descriptor is

represented by a vector of the dimension

grid× grid× l × orientations (4.1)

For example, the image is scaled to the size 32 × 32 and segmented into a 4 × 4 grid. From the grid orientation,

histograms are extracted on three scales, l = 3. It is defined a histogram with each bin covering 18 degrees, which

results in 20 bins (orientations=20). The dimension of the vector that represents the GIST descriptor is 960.

• The vector x of dimensionality 960 is split into 5 distinct windows of dimensionality 192. The mean value

is computed in each window resulting in a 5 dimensional vector.

• The vector x of dimensionality 960 is split into 10 distinct windows of dimensionality 96. The mean value

is computed in each window resulting in a 10 dimensional vector.

• The vector x of dimensionality 960 is split into 30 distinct windows of dimensionality 32. The mean value

is computed in each window resulting in a 30 dimensional vector.

RGB images The database consists of 9.877 web-crawled color images of size 128×96. The images were scaled

to the size of 128 × 96 through a bilinear method resulting in a 12288-dimensional vector space. Each color is

represented by 8 bits. Each of the tree bands of size 128× 96 is tiled with rectangular windows W of size 32× 32.

The mean value is computed in each window resulting in a

36 = 12× 3 = 4× 3× 3

dimensional vector.

38

4.4 Mapping System

The subspace tree generated the low-dimensional data that is used by this framework. However, the resulting six

spaces contain duplicated data that could lead to biased conclusions. The reader must remember that, in the next

phase, we will be searching for approximate nearest neighbors. If there are many duplicated neighbors, we could

find them at a zero distance, that is, ourselves. Thus, these spaces require to be clean before being used by the

Mapping System. The resulting data of the subspace tree application are, in fact, multidimensional points. For

each of these spaces, I map each multidimensional point to its one-dimensional h-value with the auxiliary of the

Uzaygezen 0.2 library. It is based on the theory of the Compact Hilbert Indexes [Hamilton, 2006] already presented

in section 3.4 and follows this basic algorithm [Hamilton and Rau-Chaplin, 2008]:

1. Find the cell containing the point of interest;

2. Transform as necessary;

3. Update the cell index value appropriately;

4. Continue until sufficient precision has been attained.

Find the cell containing the point of interest Locate the cell that contains our point of interest by determining

whether it lies in the upper or lower half-plane regarding each dimension. Assuming we are working with a m-

order curve where m also refers to the number of bits necessary to represent a coordinate, we can use the Equation

3.26 defined in section 3.4.

lm−1 = [bit(pn−1,m− 1) . . . bit(p0,m− 1)] (3.26 revisited)

Transform as necessary We zoom in the cell containing the point and rotate and reflect the space such that the

Gray code ordering corresponds to the binary reflected Gray code. The goal is to transform the curve into the

canonical orientation.

lm−1 = Te,d(lm−1) (3.27 revisited)

Update the index value appropriately The orientation of the curve is given only by an entry e and the exit f

through the grid. Once we know the orientation at the current resolution, we can determine in which order the cells

are visited. So, h gives the index of a cell at the current resolution.

h = [wm−1 · · ·w0] (4.2)

Where w is index of the associated cell and defined in section 3.4

wm−1 = gc−1(lm−1) (3.28 revisited)

39

We continue until sufficient precision has been attained. A full example can be found in section 3.4. Each multi-

dimensional point is mapped to the Hilbert curve allowing that the ANN Heuristic System can perform queries on

this space.

4.5 Dataset Analysis based on a Linear Search

The code is developed in JAVA and tested in a computer with the following properties:

• Windows 8 Professional 64 Bits

• Processor Pentium(R) Dual-Core CPU T4300 @ 2.10GHz

• 4 GB RAM

In order to run the experiments, we must first remove the duplicated data as I explained in section 4.2. The datasets

are organized in files per dimension. The dimensions tested are 2, 4, 5, 10, 30 and 36. The files of dimensions

2, 4 and 36 have 10.000 points and the remain have 100.000. The files are analyzed and duplicated data are

removed. After this cleaning process, the files have the number of points presented in Table 4.1, in the column

”Data Volume”. It is possible to know already the order of the Hilbert curve through the number of bits to represent

the higher number present in a file. The order of the curve is therefore, in this same table, in the column labeled as

”Resolution”.

Spaces Original Volume Data Volume Resolution

S2 10 000 264 6S4 10 000 9 229 7S5 100 000 98 917 16

S10 100 000 98 917 16S30 100 000 98 917 16S36 9 877 9 831 8

Table 4.1: Datasets description

Before running the heuristics experiments, we must define the set of query points for which we intend to find

their neighbors. We need to define as well as the distance values of their nearest neighbors. The query group vary

from de point P0 to P9 and are the first 10 points extracted from each file after the cleaning process. The linear

search, described in section 4.2, defines the distance values of their nearest neighbors for the query group and these

are presented in Table 4.2. In this Table, we can observe that spaces of dimension 2 and 4 are densely populated

since all the neighbors are immediate cell neighbors, and the mean distance is 1. The spaces regarding dimensions

5, 10 and 30 are quite sparse since the distance mean are 332, 3, 629, 2 and 943, 4 respectively. Finally, for the

space 36 we can say that it is in between the first two and the rest of the spaces regarding the density of the data. It

presents a mean distance of 34, 7. In terms of analysis, I grouped the spaces in two sets like {2, 4, 36} and {5, 10,

30} because they are more similar between them as shown above.

The Figure 4.2 presents the nearest neighbor to a query point distributed over all spaces. It is interesting to

observe how the neighbor distances are distributed over the spaces and through the query points. If we put our

40

Query Point S2 S4 S5 S10 S30 S36

P0 1 1 341 323 669 35P1 1 1 325 759 1 197 37P2 1 1 421 239 691 14P3 1 1 466 1 516 961 41P4 1 1 249 1 411 853 7P5 1 1 169 117 1 599 36P6 1 1 99 563 734 90P7 1 1 760 538 831 39P8 1 1 278 465 545 14P9 1 1 215 361 1 354 34

Table 4.2: Neighbor distance to query point per dimension

query point at the center of the web graphic, we can see how far is the nearest neighbor to it. We can see it in every

dimension and for each one of the query points. Once again, it is clear that the spaces 5, 10 and 30 are sparser than

the rest. The spaces 2 and 4 don’t even appear in the graphic since the distance for the nearest neighbor is 1 for

each query point. On the query points P3 and P4, the space 10 even surpass the space 30 that has the higher mean

distance over all spaces.

1

10

100

1000

10000P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

Space 2

Space 4

Space 5

Space 10

Space 30

Space 36

Figure 4.2: The nearest neighbor distribution per dimension.

In terms of runtime performance, the linear search results for the first group is shown in Figure 4.3. There

are some aspects to consider that influence this performance (see Table 4.1). The first relates the resolution of

each space that are 6, 7 and 8 for the spaces 2, 4 and 36 respectively. Then the amount of points to compute are

significant low for dimension 2 (264) and almost equivalent to the remain spaces (9.229 and 9.831 for dimension

4 and 36 respectively). Therefore, it is faster to compute the space 2 since it has the lower resolution, the lower

41

amount of data and lower number of dimensions to compute. The analogy is valid for the spaces 4 and 36.

0,047

0,312

0,686

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

2 4 36

Exec

utio

n Ti

me

(sec

ond

s)

Dimensions

Linear Search

Figure 4.3: Linear search runtime performance.

The linear search results for the second group is shown in Figure 4.4. In this scenario, the three spaces have the

same amount of data and the same space resolution (see Table 4.1). Therefore, the runtime performance are only

related with the number of dimensions of each space. The time required to compute the space 10 is almost twice

the time needed to compute the space 5. The same analogy can be done to the space 30 which takes almost three

times longer than the space 10 and nearly six times longer than the space 5.

2,761

4,587

11,091

0

2

4

6

8

10

12

5 10 30

Exec

utio

n Ti

me

(sec

ond

s)

Dimensions

Linear Search

Figure 4.4: Linear search runtime performance.

42

4.6 Experiment 1: Hypercube Zoom Out

Since the reference values are established, we can now focus on develop the heuristics. As mentioned before,

the ANN Heuristic System works on the space generated by the Mapping System. It is an empirical system that

evolves based on the experimental performances. The idea is to find the ANNs to the query group defined in the

previous section, considering the best properties of the Hilbert curve. After creating the heuristic, I test it on the

Hilbert space and readjust it or create a new heuristic based on the performance measures. The first heuristic idea

has the name Hypercube Zoom Out (HZO) and it tries to explore the following properties of the Hilbert curve:

• The space-filling curves divide the space into four quadrants. Each quadrant is a square that has an entry

point and an exit point (remember section 3.3);

• The clustering property of the Hilbert curve is better when the area of analysis tends to a square (remember

section 3.5).

Suggestion In order to perform an approximate nearest neighbor search using the Hilbert curve, the area of the

search should be a square to minimize the number of clusters. To make use of the squares that the curve forms, it is

necessary to point out the entry and exit point of each square as they define the search area. The squares can vary

the number of inside points from one to infinity depending on the space resolution. Thus, the search must begin

from a higher resolution zooming out until reach the lower resolution. Although it stops once it reaches the first

neighbor point inside a square. Since the curve preserves the spatial proximity between the points, it is expected to

find an approximate neighbor point inside a higher-resolution square that will be, in fact, closer to the query point,

therefore, resulting on an efficient computational effort. The reader must notice that this solution generates data

like the proposal presented by Chen and Chang introduced in section 3.6. Thus, it uses the Hilbert curve, like their

proposal, as a direct access method.

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

Figure 4.5: Hypercube Zoom Out algorithm scheme in a two dimensional space with resolution 2.

Algorithm The Figure 4.5 illustrates how the HZO search for a neighbor in a two-dimensional space. The

left space corresponds to the initial search at the highest resolution hypercube. The right space corresponds to

the search at the lowest resolution hypercube. The set of cells {0, 4, 8, 12} are entry-points of their respective

hypercubes. Similarly, the set of cells {3, 7, 11, 15} are exit points of the same hypercubes. The cells 0 and 15 are

as well entry and exit points respectively of the lower-resolution hypercube. A search always flows from an entry

43

to an exit point at the highest resolution hypercube. Since I introduced the basic notions, suppose we are searching

for an approximate nearest neighbor to the query point in the cell 10. We stop searching once we found the first

point. So, the HZO starts looking for neighbors at the entry point of the father hypercube of this cell, which is

the cell number 8. The search continues until it reaches a neighbor, or it reaches the exit point of this hypercube,

the cell 11. If it does not find any neighbors, it will start again at the entry point of the hypercube father, which

is the cell 0. The search continues again until it finds a neighbor or reaches the son hypercube visited {8, 9, 10,

11}. At this point, the search jumps from the cell 7 to the cell 12 and continues until reach the exit point, since the

hypercube between these cells has already been visited.

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

Figure 4.6: Hypercube Zoom Out - Three scenarios in two dimensions with the cell 10 as the query point.

Now, let’s observe three spaces (Figure 4.6 from left to right) in a two-dimensional space to apply the HZO

heuristic. At all three, we are looking for an approximate nearest neighbor to the cell 10. The spaces vary in terms

of population density. In the left space, the HZO returns the cell 9 as the approximate nearest neighbor since it

starts at the entry point 8 and continues until find the cell 9. In the middle space, the neighbor is the cell 4. It starts

at the cell 8 and continues until it reaches the exit point. Since it does not find any neighbor, it changes the search

to the hypercube father, which is the entire space, thus restarting at the entry point 0. The search finishes when

it reaches the cell 4. On the right space, it does not find any neighbor at higher resolution so it restarts from the

hypercube father entry point 0. The search continues until it reaches the cell 7 and then jumps to the cell 12. The

resulting neighbor is the cell 13.

Spaces Linear Search HZO

S2 0,047 0,047S4 0,312 0,289S5 2,761 n/d

S10 4,587 n/dS30 11,091 n/dS36 0,686 n/d

Table 4.3: HZO runtime results in seconds.

Evaluation This heuristic is tested and evaluated on the space generated by the Mapping System with the data

presented in section 4.3 and analyzed in the section 4.5. The HZO did not present any result in a time considered

acceptable (less than a day) for spaces with dimensions higher than 4. It revealed to produce huge amounts of data

since it generates the h-value of the potential neighbor and verifies if it exists on the dataset. In the worst-case

44

scenario, the space 30 at the order curve 16, it can generate the number of cells existent in the maximum order

curve, which is an amazing 3, 121749× 10144 cells. The heuristic requires high computational resources, namely

space, and due to the space constraint, those results could not be computed (n/d). Therefore, it is presented the

results for the remaining spaces. The Table 4.3 shows the HZO runtime performance compared with the linear

search. For the space with the lower number of dimensions, the runtime is the same as the linear search. For the

space 4, the HZO is 23 milliseconds better.

In terms of distance metric, the results for the approximate nearest neighbors produced by the HZO are pre-

sented in the Table 4.4. For the space 2, the HZO presents a mean relative error of 0 which translates into all exact

neighbors. For space 4, the HZO presents 70% of exact nearest neighbors and an average relative error of 0, 7.

HZO Results Relative Error

Query Point S2 S4 S2 S4

P0 1 5 0 4P1 1 1 0 0P2 1 1 0 0P3 1 1 0 0P4 1 1 0 0P5 1 2 0 1P6 1 3 0 2P7 1 1 0 0P8 1 1 0 0P9 1 1 0 0

Mean Value 1 1,7 0 0,7

Table 4.4: HZO approximate nearest neighbors results.

In total, are performed twenty queries in the HZO heuristic evaluation for the two spaces observed. In this

broader overview, and looking at the Figure 4.7, we can see that 85% of the queries return the nearest neighbors.

The rest of the returned results are approximate nearest neighbors where 5% present a relative error ε = 1. This

means that the neighbors are 2 cells away from the nearest neighbor since the error factor is c = 1 + ε. The

remaining 10% of the queries return approximate neighbors with ε ∈ [2, 4]. Thus, the analogy of the error factor

is the same and, on average, for these c = 4. Globally, the average relative error for the HZO is 0, 35 cells away

from the nearest neighbor.

Conclusion In general, we can conclude that the HZO does not work for dimensions bigger than 4. On computing

this heuristic, it stands out that the Hilbert Curve generates much more data than the dataset used to test the

heuristic. As the spaces 2 and 4 are densely populated, the data generated by the space-filling curve does not have

much impact, since the heuristic in general only moves from one cell to another in the curve to find a neighbor. The

problems begin when the number of dimensions increases and the points get sparser. In the space 5, for example,

the points are on average 332, 3 cells away from query point, and the smallest hypercube (higher resolution) has

32 cells. For the resolution 2, the hypercube jumps to an amazing 1.024 cells. If, by chance, the approximate

nearest neighbor is not in this hypercube, the jump will be to 32.768, and so on. The number of cells to visit grows

45

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0 1 [2-4]

Ɛ

Queries

Figure 4.7: HZO relative error distribution over all queries performed.

exponentially, which means that the data generated by the curve follow this trend. Overall, we can say the results

for the lower dimensions are acceptable since in general it presents a good percentage of nearest neighbors and the

runtime performance are slightly better than the linear search.

4.7 Experiment 2: Enzo Full Space

This heuristic proposal tries to overcome the fact that the HZO does not run in acceptable time in spaces with more

than 4 dimensions and, at the same time, reducing the overall ε.

Suggestion Enzo Full Space (FS) leaves the strategy of the hypercube area of search and focuses on the curve

path itself. The idea is to launch two research branches that start on the query point location and walk on the

curve towards to the lower-resolution hypercube entry and exit point until one of them, or both, find a neighbor. In

theory, Enzo FS will be faster and will present a lower ε as we will see below. Like the HZO, the Enzo FS uses the

Hilbert curve as a direct access method.

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

Entry branch

Exit branch

Figure 4.8: HZO algorithm scheme

46

Spaces Linear Search HZO Enzo FS

S2 0,047 0,047 0,031S4 0,312 0,289 0,115S5 2,761 n/d n/d

S10 4,587 n/d n/dS30 11,091 n/d n/dS36 0,686 n/d n/d

Table 4.5: Enzo FS runtime results

Algorithm The Figure 4.8 illustrates how the Enzo FS search for neighbors. The cell 10 represent the query

point. In order to simplify the explanation, I named the branch that search towards the entry point as entry branch

and the other, by analogy, as the exit branch. The entry branch starts searching from neighbors from the immediate

neighbor value of the query point on the curve towards to the entry point. It verifies if the neighbor value exists

on the data point. If it does exist, the neighbor is found, and it tries to store the point as the approximate nearest

neighbor. It tries since the other branch may be doing the same. If there is a tie, the decision is made by storing

the closest point to the query point. Once the approximate nearest neighbor is set, the branch which wrote ends

the concurrent branch. There are factors that influence the performance of this heuristic, and they are related to the

density of the space and the query point location. In the worst-case scenario, the query point is an entry or an exit

point on the lower-resolution hypercube and the search performance could be worse than the linear search.

Enzo FS Results Relative Error

Query Point S2 S4 S2 S4

P0 1 4 0 3P1 1 1 0 0P2 1 1 0 0P3 1 1 0 0P4 1 1 0 0P5 1 2 0 1P6 1 2 0 1P7 1 1 0 0P8 1 1 0 0P9 1 1 0 0

Mean Value 1 1,5 0 0,5

Table 4.6: Enzo FS approximate nearest neighbors results.

Evaluation This heuristic is tested and evaluated in the same context as the previous and, like the HZO, it does

not present results in a time considered acceptable (less than a day) for spaces with a number of dimensions a

higher than 4 for the same reasons. Therefore, the results for the remaining spaces are now presented.In terms of

runtime performance (see Table 4.5), the Enzo FS in the space 2 is 16 ms faster than the linear search and the HZO.

In the space 4, the difference is bigger, reaching the 197 ms comparing with the linear search.

In terms of distance metric, the results for the approximate nearest neighbors produced by the Enzo FS are

presented in Table 4.6. For the space 2, the Enzo FS returns all the nearest neighbors. For the space 4, the Enzo FS

47

presents 70% of the results being the nearest neighbors and an average relative error ε of 0, 5. Analyzing in terms

of the error factor c = 1 + ε we can say that, on average, an approximate nearest neighbor produced by the Enzo

FS is 1, 5 cells away from the nearest neighbor.

For the two spaces observed, twenty queries are performed in total. In this broader overview, observing the

Figure 4.9 we can see that 85% of the queries return the nearest neighbors. From the rest of the results, 10% present

a relative error ε = 1 which means that the neighbor is 2 cells away from the nearest neighbor. The remaining 5%

of the queries return neighbors with ε = 3. The analogy of the error factor is the same and c = 4. Globally, the

average relative error for the Enzo FS is 0, 25 cells away from the nearest neighbor.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0 1 3

Ɛ

Queries

Figure 4.9: Enzo FS relative error distribution over all queries performed.

Conclusion The problems encountered for the HZO are basically the same for the Enzo FS:

• The Hilbert Curve generates much more data than the dataset used to test the heuristic;

• Both heuristics work in the lower spaces apparently because they are densely populated and, in general, the

heuristics only generate a few cells (up to 4) until they find an approximate nearest neighbor;

On the other hand, the Enzo FS presents a less 28, 6% of average ε (from 0,35 to 0,25) when compared with

the HZO and reduced the maximum value for ε in 25% (from 4 to 3). In terms of runtime evaluation, overall the

Enzo FS is better than the HZO and therefore, better than the linear search.

4.8 Experiment 3: Enzo Reduced Space

The Enzo Reduced Space (RS) tries to overcome the fact that neither the HZO nor Enzo FS worked on spaces with

more than 4 dimensions. The goal of this heuristic is to continue to reduce the ε as well as the runtime performance.

Suggestion Since the main problem for both heuristics are related with the data generated by the Hilbert Curve,

the idea to the Enzo Reduced Space (RS) is to simply order the dataset according to the curve without generate

any more data. On top of the curve, it uses a B-tree variant in JAVA therefore using the curve as a secondary key

retrieval.

48

5 6 9 10

4 7

3

0

2

1

8 11

13 12

14 15

3 7 9 10 15 14 13 12 11 8 6 5 4 2 1 0

Entry branch

Exit branch

Figure 4.10: Enzo RS algorithm scheme

Algorithm The Figure 4.10 illustrates by an example how this algorithm works. The cell 10 is the query point

and the rest colored cells are the population within the space. The idea is basic the same as the Enzo FS but, in this

case, the branches do not walk on the curve but on the existent data ordered by the curve. We insert the query point

on the dataset ordered. We locate point inserted with the auxiliary of the B-tree, and then we locate the neighbors.

Since the data are now in 1 dimension, we pick the immediate left and right neighbors to the query point. The

points are then compared in terms of the shortest distance to the query point, and the winner is the approximate

nearest neighbor.

Spaces Linear Search HZO Enzo FS Enzo RS

S2 0,047 0,047 0,031 0,078S4 0,312 0,289 0,115 0,67S5 2,761 n/d n/d 2,669

S10 4,587 n/d n/d 3,853S30 11,091 n/d n/d 7,256S36 0,686 n/d n/d 0,608

Table 4.7: Enzo RS runtime results

Evaluation This heuristic is tested and evaluated in the same context as the previous heuristics and the results

are now presented. On what concern the runtime performance (see Table 4.7), generally the Enzo RS has worse

performance in the spaces with lower dimensions {2,4} and better performance in the spaces of higher dimensions

{5, 10, 30, 36} achieving the best performance at dimension 30 with less 3, 835 seconds than the linear search.

The Figures 4.11 and 4.12 present runtime graphics for the space groups 2, 4, 36 and 5, 10, 30 respectively. The

spaces in each graphic are sorted in ascending order of resolution, the amount of data and finally by dimensions.

The Figure 4.11 presents the analysis for the spaces with lower data and maximum resolution of 8 to space 36.

However, the Enzo RS is faster to compute the space 36 than the space 4. The main differences between the spaces

refer to the average distance neighbor which is 1 for the spaces 2 and 4 and 34, 7 for the space 36.

The Figure 4.12 presents the analysis for the spaces with a higher amount of data (almost 100.000) but equal

49

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

2 4 36

Exec

utio

n Ti

me

(sec

ond

s)

Dimensions

Linear Search

Enzo (RS)

Figure 4.11: Enzo RS runtime performance compared with the linear search - first group of analysis.

for the three spaces. They all have equal resolution of 16, and they only differ in dimensions and average distance

of neighbors. They have 332, 3, 629, 2 and 943, 4 to the space 5, 10 and 30 respectively. Since this group is more

even than the previous, it seems that the higher the average distance of the neighbors and the number of dimensions

the faster is the Enzo RS to compute the space.

0

2

4

6

8

10

12

5 10 30

Exec

utio

n Ti

me

(sec

ond

s)

Dimensions

Linear Search

Enzo (RS)

Figure 4.12: Enzo RS runtime performance compared with the linear search - second group of analysis.

In terms of distance metric, the results for the approximate nearest neighbors produced by this heuristic are

presented in Table A.2. The table shows the results for each of the 10 query point in each of the 6 spaces tested.

The Table 4.8 shows the relative error ε rounded to two decimal places for each of the 10 query points in each

space. For the space 2, Enzo RS returns all the nearest neighbors since ε = 0 for all the queries. For the spaces 4,

5, 10, 30 and 36, the percentage of the nearest neighbors down to 80, 10, 20, 20 and 0%, respectively. Generally

speaking, the mean relative error ε seems to grow as the number of dimensions increases.

For the 6 spaces observed, sixty queries are performed in total. In this broader overview, and looking to Figure

50

Query Point S2 S4 S5 S10 S30 S36

P0 0 3 1,46 0,08 1,74 1,49P1 0 0 0,89 0,61 0,29 0,89P2 0 0 0,19 1,54 1,59 0,71P3 0 0 0,24 0 1,78 0,37P4 0 0 0,23 0 0 1P5 0 1 0,88 0,83 0 0,92P6 0 0 2,15 0,73 0,78 0,88P7 0 0 0 1,02 1,26 1,51P8 0 0 0,22 0,59 0,81 0,43P9 0 0 1,39 1,49 0,49 0,91

Mean Value 0 0,40 0,77 0,69 0,87 0,91

Table 4.8: Enzo RS Relative error ε from space 2 to 36.

4.13, we can see that 38% of the queries return the nearest neighbors. In the remaining approximate results, 40%

of the results have an ε ∈]0, 1], 18% in ]1, 2] and just 3% in ]2, 3]. Globally, the average relative error for the Enzo

FS is 0, 2 cells away from the nearest neighbor.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

0 ]0-1] ]1-2] ]2-3]

Ɛ

Queries

Figure 4.13: Enzo RS relative error distribution over all queries performed.

Conclusion The problem of the huge amounts of data generated by the Hilbert Curve is overcome, and this

heuristic presents results in all the spaces tested. It even shows a better runtime performance when compared

with the linear search in dimensions over 4. Its good performance seems to be related with the low density of the

space and the high number of dimensions. On the other hand, Enzo RS presents worse runtime results in lower

dimensional spaces when compared with the HZO or Enzo FS. In terms of distance metrics, and only looking to

spaces 2 and 4, the Enzo RS presents the lowest relative error ε when compared with the previous heuristics.

51

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

S2 S4

Exec

utio

n Ti

me

(sec

ond

s)

Spaces

Linear Search

Enzo FS

HZO

Enzo RS

Figure 4.14: Runtime comparison between the heuristics.

4.9 Summary

This chapter described the solution proposal to explore the behavior of fractals, especially the Hilbert space-filling

curve used as an access method for multidimensional data in low-dimensional spaces. The section 4.1 described

the motivation for this thesis. The section 4.2 presented my Approximate Space-Filling Curve Framework as the

plan to develop and test and evaluate my proposal. The section 4.3 explain that the data used to test my proposal

is a result of applying the Subspace tree to a high-dimensional data. The section 4.4 described how the mapping

of the data is done using the Uzagezen 0.2 library. The section 4.5 presented the results of performing similarity

search using linear search. The following sections 4.6, 4.7 and 4.8 presented the heuristic proposals, their test and

the evaluation according to the metrics presented in 4.2.

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

S2 S4

Ave

rage

Rel

ati

ve E

rror

Spaces

Enzo FS

HZO

Enzo RS

Figure 4.15: Average relative error comparison between the heuristics.

As said before, the HZO (section 4.6) and the Enzo FS (section 4.7) did not compute the results in spaces over

4 dimensions due to require high computational resources, namely space. For this reason, in Figure 4.14 and 4.15

52

it is presented the overall comparison in the spaces that the referred heuristics presented results. On what concerns

the runtime performance, the Figure 4.14 shows that the Enzo FS presented the lowest runtime to search for an

approximate nearest neighbor. The Figure 4.15, shows that the Enzo FS (section 4.8) despite having the lowest

runtime, it did not present the lowest mean relative error when searching for the approximate nearest neighbor,

being surpassed by the Enzo RS.

0

2

4

6

8

10

12

S2 S4 S36 S5 S10 S30

Exec

utio

n Ti

me

(sec

ond

s)

Spaces

Enzo RS

Linear Search

Figure 4.16: Enzo RS runtime performance along the spaces.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

S2 S4 S5 S10 S30 S36

Ave

rage

Rel

ati

ve E

rror

Spaces

Enzo RS

Figure 4.17: Enzo RS average relative error along the spaces.

Overall, in these spaces, the Enzo FS presents the most acceptable balance between the mean relative error

and the runtime performance. On the other hand, the Enzo RS is the only heuristic which worked beyond the 4

dimensions. Its global runtime performance can be seen in Figure 4.16. The spaces in the graphic are sorted in

ascending order of resolution, the amount of data to compute, dimensions and average distance neighbor. It is

possible to observe that the Enzo RS, is faster than the linear search from the space 36 although it starts to stand

out from the space 10. The lower the density of the space, the greater the distance between the Enzo RS curve and

the linear search curve. The Enzo RS average relative error seems to grow as the number of dimensions increase

53

(Figure 4.17).

54

Chapter 5

Conclusions

In the context of my MSc thesis, I proposed to study the problem of index low-dimensional data up to thirty

dimensions, using the Hilbert space-filling curve as a dimensionality reduction technique that is known to have

superior clustering properties. To address the problem, I created a framework named Approximate Space-Filling

Curve that structured all the steps of the proposal. A few experiments were performed with the resulting prototype.

This framework had two main phases: the Mapping System and the Approximate Nearest Neighbor Heuristic

System (ANN Heuristic System). The first system converted each one of the multidimensional point in the dataset

to the Hilbert curve, generating its h-value. On the second system, I developed a heuristic, tested on the curve and

re-adapted it based on its performance. As a result of this process, three heuristics were created: the Hypercube

Zoom Out (HZO), the Enzo Full Space (FS) and finally the Enzo Reduced Space (RS). The HZO searches for

the approximate neighbor starting at the highest resolution hypercube excluding the query point itself. It starts

at the entry point of the hypercube and ends the first iteration at the end point of the same hypercube. If it does

not find any neighbor, it performs a zoom out to the hypercube of the next resolution. Always starting at an entry

point and ending on an exit point. Once it finds the first neighbor, the heuristic ends. The Enzo FS searches for

the approximate neighbor by launching two search branches from the query point to lower resolution entry an

exit point. If both branches find neighbors, they compare them and the closest neighbor to the query point is the

approximate nearest neighbor.

Once tested, both HZO and Enzo FS did not present any results in a time considered acceptable (less than a

day) in spaces with more than 4 dimensions. They revealed to produce huge amounts of data since they generate

the h-value of the potential neighbor and verify if it exists on the dataset. In the worst-case scenario, the space

30 at the order curve 16, they can generate the number of cells existent in the maximum order curve, which is an

amazing 3, 121749× 10144 cells. The heuristics required high computational resources, namely space, and due to

the space constraint, those results could not be computed.

Nevertheless, in the spaces 2 and 4, regarding the runtime performance, the HZO revealed to be a slightly

better than a linear search and the Enzo FS slightly better than HZO. Although, we must take into account that the

results are approximate. In terms of the overall approximate error distance, 85% of the both heuristics results are

the nearest neighbor to a given query point. Globally, the average relative error is 0, 35 cells away from the nearest

neighbor for the HZO and 0, 25 for the Enzo FS. Unlike the previous two which use the Hilbert curve as a direct

55

access, the Enzo RS uses the Hilbert curve as a secondary key retrieval combined with a B-tree variant. Instead of

generating neighbors and verifying if they exist on the data, it uses the data ordered according to the curve. Using

the B-tree variant and the h-value of the query point, it is determined the query point immediate neighbors on the

curve. Only later, when writing the report and after tested the heuristics, I could notice that the Faloutsos and

Roseman had already suggested this heuristic in [Faloutsos and Roseman, 1989]. Their algorithm was therefore

presented in Section 3.6 as related work. Nevertheless, they only tested the heuristic till 4 dimensions.

The Enzo RS, unlike the previous two, presented results for all the dimensions tested (2, 4, 5, 10, 30 and

36). On what concerns the runtime performance, it revealed to be worse than the linear search in spaces till 4

dimensions. On what concerns the error of the approximate neighbors distance in these spaces, the Enzo RS

presented an average relative error of 0.2. Therefore, this heuristic presented the lowest average approximate error

but the higher runtime performance in these spaces. Beyond these spaces, the Enzo RS worked in less time than

the linear search in all dimensions tested. Never forgetting that we are talking about approximate neighbors, and

the linear search always provides us the nearest. The average relative error of the Enzo RS seems to grow along the

spaces. Globally, on all the query points tested, the Enzo RS presented an average relative error of 0, 61 cells away

from the nearest neighbor. Despite the relative error is higher when compared with the other heuristics, the reader

should not forget that the Enzo RS was tested in more forty queries than the previous. So we cannot compare the

overall relative error between them.

Considering the obtained results from the experiments, we can conclude that a space-filling curve can operate

in a low-dimensional space (up to thirty dimensions), in terms of index and search. However, some observations

can be made:

• The Hilbert curve requires high space resources when generating the cell keys to apply similarity search

techniques. Therefore, the curve seems to be more suited to highly dense spaces when used as a direct

access;

• The Hilbert curve seems to be more suited to sparse spaces when applied to the original dataset combined

with a B-tree variant.

5.1 Contributions

The main contributions of this work are summarized below:

Hilbert curve tested as a direct access till 36 dimensions. I tested the performance of the Hilbert space-filling

curve as a direct access method with two heuristics up to 36 dimensions. The results indicated that the curve

generates too much data making it difficult to use the curve beyond the 4 dimensions.

Hilbert curve tested as a secondary key retrieval till 36 dimensions. I tested the performance of the same curve

as a secondary key combined with a B-tree variant up to 36 dimensions. The results showed that this combination

worked in the six spaces tested inclusive on the space with 36 dimensions. Compared with a linear search, it is

generally faster as the number of dimensions and order of the curve increase, although it provided approximate

results.

56

5.2 Future Work

Despite of the interesting results, that are many aspects on which this work could be improved. I suggest the

following:

Increase the space Both HZO and Enzo FS did not compute theirs results in spaces bigger than four dimensions

due to space restrictions. The experiments could be repeated in another computer with higher space resources in

order to try to compute the experiments that did not provide results due to this problem.

Extend the group of query points The query group points could be increased in order to test, for example, 100

points instead of the 10 tested in this work. A larger group could provide a more solid set of results.

Extend the evaluation metrics It would also be interesting to evaluate the heuristic considering other evaluation

metrics as the number of accesses to the disk or memory. It would allow to analyze the clustering property

empirically beyond the three dimensions tested in [Moon et al., 2001].

Explore new heuristics The heuristics proposed are simple and had the purpose of testing the Hilbert curve

performance as a low-dimensional index. However, a more complete heuristic could be explored in order to

optimize the use of the space-filling curves.

Explore new dimensions Once the curve worked up to 36 dimensions, it could be tested in higher dimensions

applied directly to the original data, as the Enzo RS. It would be interesting to know the Hilbert curve limitations

in terms of dimensions.

57

58

Bibliography

F. K. B.-U. Pagel and C. Faloutsos. Deflating the dimensionality curse using multiple fractal dimensions. In

Proceedings of the 16th International Conference on Data Engineering, ICDE ’00, pages 589–, Washington,

DC, USA, 2000. IEEE Computer Society. ISBN 0-7695-0506-6. URL http://dl.acm.org/citation.cfm?

id=846219.847327.

M. Barnsley. Fractals Everywhere: New Edition. Dover books on mathematics. Dover Publications, 2013. ISBN

9780486320342. URL https://books.google.pt/books?id=PbMAAQAAQBAJ.

J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517,

Sept. 1975. ISSN 0001-0782. doi: 10.1145/361002.361007. URL http://doi.acm.org/10.1145/361002.

361007.

J. L. Bentley and J. H. Friedman. Data structures for range searching. ACM Comput. Surv., 11(4):397–409,

Dec. 1979. ISSN 0360-0300. doi: 10.1145/356789.356797. URL http://doi.acm.org/10.1145/356789.

356797.

S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. pages

28–39, 1996.

S. Berchtold, C. Bohm, and H.-P. Kriegal. The pyramid-technique: Towards breaking the curse of dimensionality.

SIGMOD Rec., 27(2):142–153, June 1998. ISSN 0163-5808. doi: 10.1145/276305.276318. URL http:

//doi.acm.org/10.1145/276305.276318.

C. Bohm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving

the performance of multimedia databases. ACM Comput. Surv., 33(3):322–373, Sept. 2001. ISSN 0360-0300.

doi: 10.1145/502807.502809. URL http://doi.acm.org/10.1145/502807.502809.

A. R. Butz. Alternative algorithm for hilbert’s space-filling curve. IEEE Transactions on Computers, 20(4):424–

426, 1971.

H.-L. Chen and Y.-I. Chang. All-nearest-neighbors finding based on the hilbert curve. Expert Syst. Appl., 38(6):

7462–7475, June 2011. ISSN 0957-4174. doi: 10.1016/j.eswa.2010.12.077. URL http://dx.doi.org/10.

1016/j.eswa.2010.12.077.

B. Clarke, E. Fokoue, and H. Zhang. Principles and Theory for Data Mining and Machine Learning. Springer

59

Series in Statistics. Springer New York, 2009. ISBN 9780387981352. URL https://books.google.pt/

books?id=RQHB4_p3bJoC.

D. Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121–137, June 1979. ISSN 0360-0300. doi: 10.1145/

356770.356776. URL http://doi.acm.org/10.1145/356770.356776.

R. Dickau. Hilbert and moore 3d fractal curves, January 2015. URL http://demonstrations.wolfram.com/

HilbertAndMoore3DFractalCurves/.

K. Falconer. Fractal Geometry: Mathematical Foundations and Applications. Wiley, 2007. ISBN 9780470848616.

URL http://books.google.pt/books?id=xTKvG_j4LvsC.

C. Faloutsos. Gray codes for partial match and range queries. IEEE Trans. Softw. Eng., 14(10):1381–1393, Oct.

1988. ISSN 0098-5589. doi: 10.1109/32.6184. URL http://dx.doi.org/10.1109/32.6184.

C. Faloutsos and K.-I. D. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional

and multimedia datasets. pages 163–174, 1995.

C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proceedings of the eighth ACM SIGACT-

SIGMOD-SIGART symposium on Principles of database systems, PODS ’89, pages 247–252, New York, NY,

USA, 1989. ACM. ISBN 0-89791-308-6. doi: 10.1145/73721.73746. URL http://doi.acm.org/10.1145/

73721.73746.

R. Finkel and J. Bentley. Quad trees: a data structure for retrieval on composite keys. Acta Informatica, 4:1–9,

1974. ISSN 0001-5903. doi: 10.1007/BF00288933. URL http://dx.doi.org/10.1007/BF00288933.

M. Frame and B. Mandelbrot. Fractals, Graphics, and Mathematics Education. MAA notes. Mathemati-

cal Association of America, 2002. ISBN 9780883851692. URL https://books.google.pt/books?id=

Wz7iCaiB2C0C.

T. C. Hales. Jordans proof of the jordan curve cheorem. Studies in Logic, Grammar and Rhetoric, 10(23):45–60,

2007.

C. Hamilton. Compact hilbert indices. Technical report, Faculty of Computer Science, 6050 University Ave.,

Halifax, Nova Scotia, B3H 1W5, Canada, 2006.

C. H. Hamilton and A. Rau-Chaplin. Compact hilbert indices: Space-filling curves for domains with unequal side

lengths. Inf. Process. Lett., 105(5):155–163, Feb. 2008. ISSN 0020-0190. doi: 10.1016/j.ipl.2007.08.034. URL

http://dx.doi.org/10.1016/j.ipl.2007.08.034.

K. ip Lin, H. V. Jagadish, and C. Faloutsos. The tv-tree – an index structure for high-dimensional data. VLDB

Journal, 3:517–542, 1994.

H. V. Jagadish. Linear clustering of objects with multiple attributes. SIGMOD Rec., 19(2):332–342, May 1990.

ISSN 0163-5808. doi: 10.1145/93605.98742. URL http://doi.acm.org/10.1145/93605.98742.

60

J. Lawder and P. King. Using space-filling curves for multi-dimensional indexing. In B. Lings and K. Jeffery,

editors, Advances in Databases, volume 1832 of Lecture Notes in Computer Science, pages 20–35. Springer

Berlin Heidelberg, 2000. ISBN 978-3-540-67743-7. doi: 10.1007/3-540-45033-5 3. URL http://dx.doi.

org/10.1007/3-540-45033-5_3.

D. B. Lomet and B. Salzberg. The hb-tree: a multiattribute indexing method with good guaranteed performance.

ACM Trans. Database Syst., 15(4):625–658, Dec. 1990. ISSN 0362-5915. doi: 10.1145/99935.99949. URL

http://doi.acm.org/10.1145/99935.99949.

N. Mamoulis. Spatial Data Management. Synthesis Lectures on Data Management. Morgan & Claypool, 2012.

ISBN 9781608458325. URL http://books.google.pt/books?id=6z5grzUcPhoC.

B. Mandelbrot. The fractal geometry of nature. Times Books, 1982.

B. B. Mandelbrot. Les objects fractals: Forme. Hasard et Dimension, Flammarion, Paris, 1975.

B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properties of the hilbert space-

filling curve. IEEE Transactions on Knowledge and Data Engineering, 13:2001, 2001.

G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. International

Business Machines Company, 1966.

J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure.

ACM Trans. Database Syst., 9(1):38–71, Mar. 1984. ISSN 0362-5915. doi: 10.1145/348.318586. URL http:

//doi.acm.org/10.1145/348.318586.

Ooi, K. J. Mcdonell, and S. R. Davis. Spatial kd-tree: An Indexing Mechanism for Spatial Database. COMPSAC

conf., pages 433–438, 1987.

R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill international editions: Computer

science series. McGraw-Hill Companies,Incorporated, 2002. ISBN 9780072465631. URL http://books.

google.pt/books?id=JSVhe-WLGZ0C.

J. T. Robinson. The kdb-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of

the 1981 ACM SIGMOD international conference on Management of data, pages 10–18. ACM, 1981.

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods

in high-dimensional spaces. In Proceedings of the 24rd International Conference on Very Large Data Bases,

VLDB ’98, pages 194–205, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 1-55860-

566-5. URL http://dl.acm.org/citation.cfm?id=645924.671192.

A. Wichert. Content-based image retrieval by hierarchical linear subspace method. J. Intell. Inf. Syst., 31(1):85–

107, Aug. 2008. ISSN 0925-9902. doi: 10.1007/s10844-007-0041-4. URL http://dx.doi.org/10.1007/

s10844-007-0041-4.

A. Wichert. Subspace tree. In Content-Based Multimedia Indexing, 2009. CBMI ’09. Seventh International Work-

shop on, pages 38 –43, june 2009. doi: 10.1109/CBMI.2009.14.

61

A. Wichert. Intelligent Big Multimedia Databases. World Scientific, 2015.

C. Yu. High-Dimensional Indexing: Transformational Approaches to High-Dimensional Range and Similarity

Searches. Lecture notes in computer science. Springer, 2002. ISBN 9783540441991. URL http://books.

google.pt/books?id=CgsFwuSC4q8C.

62

Appendix A

Appendix

A.1 Chapter: Fractals

Decimal Code Binary Code Gray Code0 000 0001 001 0012 010 0113 011 0104 100 1105 101 1116 110 1017 111 100

Table A.1: The conversion between Decimal, Binary and Gray code with three bits.

Theorem 2. Given a 2k+n × 2k+n grid region, the average number of clusters within a 2k × 2k query window is

N2(k, k + n) =(2n − 1)223k + (2n − 1)2k + 2n

(2k+n − 2k + 1)2(A.1)

63

64

A.2 Chapter: Approximate Space-Filling Curve

Query Point S2 S4 S5 S10 S30 S36

P0 1 4 839 349 1835 87P1 1 1 613 1223 1541 70P2 1 1 503 606 1788 24P3 1 1 576 1516 2669 56P4 1 1 307 1411 853 14P5 1 2 318 214 1599 69P6 1 1 312 974 1308 169P7 1 1 760 1086 1881 98P8 1 1 339 740 989 20P9 1 1 514 898 2022 65

Mean Value 1 1,4 508,1 901,7 1648,5 67,2

Table A.2: Enzo RS Results from space 2 to 36.

65