efficient proximal support vector machine with web viewin another word, it is the left most ......

23
Efficient OLAP Operations for Spatial Data Using Peano Cube Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding, William Perrizo {baoying.wang, fei.pan, dongmei.ren, yue.cui, qiang.ding, william.perrizo}@ndsu.nodak.edu Computer Science Department North Dakota State University Fargo, ND 58105 Tel: (701) 231-6257 Fax: (701) 231-8255 Abstract Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, computer cartography, environmental assessment data and planning data, efficient OLAP for spatial data is in great demand. In this paper, we present a data warehousing architecture, the Peano Cube, to facilitate OLAP operations on spatial data, e.g., slice/dice, rollup, range queries, and nearest neighbor queries. Fast logical operations on Peano Trees (P-trees) are used to accomplish these OLAP operations. A natural distance metric for P-trees, the High Order Basis Bit (HOBBit) metric is employed to achieve efficient nearest neighborhood queries. Predicate P-trees are used to efficiently reduce data access by filtering out “bit holes” consisting of consecutive 0s using logical operations. Experiments show that OLAP operations can be executed much faster this way than with traditional data methods. Keywords: Online Analytical Processing (OLAP), Peano Cube, P+-trees, HOBBit metric. 1. Introduction Data warehouses (DWs) are collections of historical, summarized, non- volatile data, accumulated from transactional databases. They are optimized for Online Analytical Processing (OLAP) and have proven to be valuable for decision-making [29.]. A copy of the relevant data is periodically appended to the DWs with a proper timestamp. Knowledge Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.

Upload: phamkiet

Post on 30-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

Efficient OLAP Operations for Spatial Data Using Peano Cube

Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding, William Perrizo{baoying.wang, fei.pan, dongmei.ren, yue.cui, qiang.ding, william.perrizo}@ndsu.nodak.edu

Computer Science DepartmentNorth Dakota State University

Fargo, ND 58105Tel: (701) 231-6257Fax: (701) 231-8255

Abstract

Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, computer cartography, environmental assessment data and planning data, efficient OLAP for spatial data is in great demand. In this paper, we present a data warehousing architecture, the Peano Cube, to facilitate OLAP operations on spatial data, e.g., slice/dice, rollup, range queries, and nearest neighbor queries. Fast logical operations on Peano Trees (P-trees) are used to accomplish these OLAP operations. A natural distance metric for P-trees, the High Order Basis Bit (HOBBit) metric is employed to achieve efficient nearest neighborhood queries. Predicate P-trees are used to efficiently reduce data access by filtering out “bit holes” consisting of consecutive 0s using logical operations. Experiments show that OLAP operations can be executed much faster this way than with traditional data methods.

Keywords: Online Analytical Processing (OLAP), Peano Cube, P+-trees, HOBBit metric.

1. Introduction

Data warehouses (DWs) are collections of historical, summarized, non-volatile data, accumulated from transactional databases. They are optimized for Online Analytical Processing (OLAP) and have proven to be valuable for decision-making [29]. A copy of the relevant data is periodically appended to the DWs with a proper timestamp. Knowledge workers can analyze the data without having to wait for updates to be installed.

With more and more spatial data being collected, such as remote sensed images, geographical information, digital sky survey data, computer cartography, environmental assessment data and planning data, it is important to study on-line analytical processing of spatial data warehouses.

The data in a warehouse are conceptually modeled as data cubes. Gray et al. introduce the data cube operator as a generalization of the SQL group by operator [23]. The chunk-based multiway array aggregation method for data cube computation in OLAP was proposed in [24]. Typically, OLAP queries are complex and the size of the data warehouse is huge, which can cause the queries to take very long to complete, if executed directly on raw data. This delay is unacceptable in most data warehouse environments, as it severely limits productivity. The usual requirement is query execution times of a few seconds or a few minutes at the most.

Two major approaches have been proposed to efficiently process queries in a data warehouse: Speeding up the queries by using index structures, and speeding up queries by operating on compressed data. Many different index structures have been proposed for high-dimensional data [20][21][22]. Most multi- Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.

Page 2: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

dimensional index structures have an exponential dependence (in space or time or both) upon the number of dimensions. Some special indexing techniques, such as Bitmap index, join index and Bitmap join index, have been introduced. The Bitmap index uses a bit vector to represent the membership of the tuples in a table, i.e., whether or not the tuples have an attribute with the same value. A significant advantage of bitmap indices is that complex logical selection operations can be performed very quickly, by performing bit-wise AND, OR, and NOT operators. However, bitmap indexes are space inefficient for high cardinality attributes. Bitmap indexes are only suitable for narrow domains, especially for membership functions with 0, or 1 as function values. The problem associated with sparse bitmap indexes is discussed in [25]. In order to overcome this problem, some improved bitmaps, e.g., encoded bitmap [27], have been proposed. The join indexes gained popularity from its use in relational database query processing. Join indexing is especially useful for maintaining the relationship between a foreign key and its matching primary keys, from the joinable relations.

Recently, a new lossless data structure, Peano Count Tree (P-tree), and a natural distance metrics for P-trees, the HOBBit metric, have been introduced [30][31]. Peano Count trees are a lossless, quadrant based compression data structure for spatial data. It includes many variations, such as Pure1 P-trees, Pure0 P-trees, Not Pure0 P-trees, Peano Mixed trees, etc. These variations are tree-based bitmap indexes for Peano quadrants. Achieving high speed using logical AND, OR, NOT, and fewer data scan using bit level indexing and compression provides an efficient data structure that is well suited to multi-dimensional spatial data.

In this paper, we present a data warehousing architecture, Peano Cube, to facilitate OLAP operations on spatial data, e.g., slice/dice, rollup, range queries, and nearest neighbor queries. These OLAP operations can be used to answer questions like: “find all galaxies brighter than magnitude 22”, “find gravitational lens candidates”, or “find other objects like this one.” Fast logical operations on Peano Trees (P-trees) are used to accomplish these OLAP operations. A natural distance metric for P-trees, the High Order Basis Bit (HOBBit) metric is employed to achieve efficient nearest neighborhood queries. Predicate P-trees are used to efficiently reduce data access by filtering out “bit holes” consisting of consecutive 0s using logical operations. Experiments show that OLAP operations can be executed much faster this way than with traditional data methods.

This paper is organized as follows. In section 2, we first review Peano count trees (P-trees) and its operations, and then describe a natural distance metrics for P-trees, HOBBit metrics. In section 3, we propose a new data warehouse architecture, Peano Cube, and efficient OLAP operations using HOBBit rings. Finally, we compare our Peano Cube with Microsoft SkyServer experimentally in section 4 and conclude the paper in section 5. The symbols used in this paper are collected in Table 1 and explained later.

Table 1. Symbols and Notations

Symbol DefinitionX Spatial data, X = {x1, x2, …, xn},

n is the number of attributesm Maximal bit length of attributesr Radius of HOBBit ring

Pi,j Basic P-tree for bit j of attribute iPi,j’ Complement of Pi,j

bi,j The jth bit of the ith attribute of x.Pxi,j Operator P-tree of jth bit of the ith

attribute of xPvi,r Value P-tree within ring rPx,r Tuple P-tree within ring rQid Quadrant identification

Page 3: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

2. Review of Peano Count Trees and HOBBit Metrics

In this section, we first review peano count trees (P-trees) and their variations, predicate P-trees, and then describe a natural distance metrics for P-trees, HOBBit metrics.

2.1 Peano Count Trees

Peano Count trees are a lossless, quadrant based compression data structure for spatial data. It includes many variations, such as Pure1 P-trees, Pure0 P-trees, Not Pure0 P-trees, Peano Mixed trees, etc. These variations are tree-based bitmap indexes for Peano quadrants. Achieving high speed using logical AND, OR, NOT, and less data scan using bit level indexing and compression provides an efficient data structure that is well suited to multi-dimensional spatial data.

2.1.1 Basic P-trees

The basic Peano Count Tree (P-tree) is a tree structure organizing multi-dimensional relational data set with numerical values. Each attribute is split into separate files, one for each bit position. A basic P-tree, Pi, j, is a P-tree for the jth bit of the ith attribute. For attribute of m bit, there are m basic P-trees, one for each bit position. The complement of basic P-tree Pi, j is a P-tree with complement value for every bit, which is denoted as Pi, j

‘. Figure 1 shows an example of basic P-tree construction and its complement.

a) 8x8 bSQ file b) Basic P-tree c) Complement P-tree

Figure 1. Example of P-tree Construction

In Figure 1, the left is 88 bSQ file, and the corresponding P-tree is given in the middle. The root count is 36, and the count at the next level, 16, 7, 13, 0, are the 1-bit count for the four major quadrants (in Peano or Z order). Since the first and last quadrant is made up of entirely 1-bits and 0-bits respectively, we do not need sub-trees for these two quadrants. The complement is shown on the right. It provides the 0-bit’s counts for each quadrant. It provides the 0-bit counts for each quadrant. We note here that we identify quadrants using a Quadrant identifier, Qid - the string of successive sub-quadrant numbers (01,2 or 3 in Z or Peano order, separated by “.” (as in IP addresses). Thus, the Qid of the bolded and underlined quadrant in Figure1 is 2.2.

2.1.2 Predicate P-trees

Predicate P-trees are variations of P-trees. There are many predicate P-trees, such as Pure1 P-trees, Pure0 P-trees, Not Pure0 P-trees, Peano Mixed trees, etc. [30] Predicate P-trees are constructed such that set 1 if and only if condition is true throughout the quadrant, else set 0. In this paper, we exploit Pure1 P-trees (P1-

36 _________/ / \ \__________ / ___ / \___ \ / / \ \ 16 ___7___ ___13___ 0 / / | \ / | \ \ 2 0 4 1 4 4 1 4 //|\ //|\ //|\ 1100 0010 0001

1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0

28 __________/ / \ \_________ / ___ / \___ \ / / \ \ 0 ___ 9___ ___3__ 16 / / | \ / | \ \ 2 4 0 3 0 0 3 0 //|\ //|\ //|\ 0011 1101 1110

Page 4: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

tree), Not Pure0 P-trees (NP0-tree) and Peano Mixed trees (PM-tree), which are tree-based bitmap for Peano quadrants. Example of three predicate P-trees construction, Pure1 P-trees, Not Pure0 P-trees and Peano Mixed trees, are shown in Figure 2.

Figure 2. Three Predicate P-trees

Logic AND and OR are the most important and frequently used P-tree logical operation. Predict P-trees are tree-based bitmap for Peano quadrants. By using predicate P-trees operations, we filter the big holes of consecutive 0s and get the mixed quadrants. Then we load the mixed quadrants of Peano sequence into main memory, thus reduce the data access. The AND and OR operations using NP0-tree are illustrated in Figure 3.

a). NP0-tree1 b). NP0-tree2 c). AND Result d). OR Result

Figure 3. P1-tree Operations: AND and OR

In Figure 3, a) and b) are two NP0-trees, c) and d) are their AND result and OR result, respectively. From c), we know that bits at the range of position {17, 32} and {49, 64} are pure 0 since the bits in quadrant with Qid 2 and 3 are 0. Quadrants with Qid 1 and 2 are non-pure0 parts. We only need to load quadrants with Qid 1 and 2. The logical operation results calculated in such way are the exactly same as ANDing two bit sequence with reduced data access.

The logical operations can be executed among any predicate tree. The operation rules are independent of the predicate of P-trees. The operations of other predicate trees are the same. There are several ways to perform P-tree logical operations. The basic way is to perform operations level-by-level starting from the root level. The node operation is commutative. The rules of operations are summarized in Table 2.

It is natural for the operations of predicate P-trees to make use of parallel techniques, such as Disk Raid techniques, MPI, etc. They are natural extension of the basic P-tree algebra and the details are beyond the scope of this paper. Another important consideration for database query performance is the fact that Boolean operations, such as AND, OR, and NOT are extremely fast for P-trees. We treat the P-trees as arrays of bitsets and looping through them [32].

1 _____/ / \ \_____ / / \ \ / / \ \0 1 1 0 / / \ \ / / \ \ 1 0 1 1 1 1 1 1 //|\ //|\ //|\ 1110 0010 1101

1 _____/ / \ \______ / / \ \ / / \ \ 1 0 1 0 / / \ \ 1 1 1 1 //|\ 0100

1 ______ / / \ \_____ / __ / \ \ / / \ \ 1 0 1 0 / | \ \ 1 1 1 1 //|\ //|\ 1101 0100

P1-tree 0 _________/ / \ \__________ / ___ / \___ \ / / \ \ 1 __ 0__ __1__ 0 / / | \ / | \ \ 0 0 1 0 1 1 0 1 //|\ //|\ //|\ 1100 0010 0001

NP0-tree 1 _________/ / \ \__________ / ___ / \___ \ / / \ \ 1 __1__ ___1___ 0 / / | \ / | \ \ 1 0 1 1 1 1 1 1 //|\ //|\ //|\ 1100 0010 0001

PM-tree 1 _________/ / \ \__________ / ___ / \___ \ / / \ \ 0 ___1___ ___1___ 0 / / | \ / | \ \ 1 0 0 1 0 0 1 0 //|\ //|\ //|\ 0011 1101 1110

1 ______ / / \ \_____ / __ / \ \ / / \ \ 1 0 1 0 / | \ \ 1 1 1 1 //|\ //|\ 1101 0100

Page 5: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

Table 2. P-tree Logical Operation Rules

Operand 1 Operand 2 AND Result OR Result0 0 0 00 1 0 11 1 1 1

By using HOBBit distance, P-trees logic operations can be computed very fast. The detailed algorithm for HOBBit ring based P-tree operations are detailed in next section.

2.2 HOBBit Metrics

Many distance metrics have been used in clustering algorithms. Representative metrics include Euclidean distance, Manhattan distance, and Max distance. The High Order Basis Bit (HOBBit) metric is bit wise distance function [31]. It measures distance based on the most significant consecutive bit positions starting from the left. Bit at different position gives different contribution to similarity measurement. When comparing two values bitwise from left to right, once a difference is found, any further comparisons are not needed.

Let Ai be a non-negative fixed point attribute in data sets R(A1, A2, ..., An). Each Ai is represented as a fixed-point binary number x, i.e., x = x(m)x(m-1)---x(1)x(0).x(-1)---x(-n). Let X and Y be two values of x, the HOBBit similarity between X and Y is defined by

where and are the bits of X and Y respectively, denotes XOR. In another word, it is the left most position at which X and Y differ. The HOBBit distance between two tuples X and Y is defined by

.

For two value X and Y of signed fixed binary attribute Ai, the HOBBit distance between X and Y are same as above if X and Y are same sign. If X and Y are opposite sign, then the distance is

.

Definition 3.2.1. HOBBit Ring The HOBBit ring of radii, r1 and r2 , centered at c is defined as R(c, r1, r2) = {x X | r2 d(c,x) r1}, where d(c,x) is HOBBit distance. Figure 4 shows a diagram of HOBBit ring R(c, r1, r2) in spatial data set, X.

Figure 4. Diagram of HOBBit Ring R(c, r1, r2)

Page 6: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

3. OLAP Operation Using Peano Cube

In this section, we present a new data warehouse architecture, Peano Cube, for multi-dimensional spatial data. OLAP operations, such as slice/dice, rollup and nearest neighbor queries, etc., are developed in section 3.2.

3.1 Peano Cube Model

A Data Cube is a visual organization for multi-dimensional data. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined in terms of dimensions and facts. In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. For example, a traffic supervision system may have a spatial data warehouse of remotely sensed images. The data warehouse is used to keep records of the area’s traffic with respect to the dimensions of coordinates and time. These dimensions may allow the traffic department to keep track of traffic by areas and time.

A multidimensional data model is typically organized around a central theme. This theme is represented by a fact table. Facts have numerical measurements. The fact table contains the names of the facts, as well as keys to each of the related dimension tables. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.

The data warehouse is organized as a star, snowflake or galaxy schema. The star schema model includes a large central Fact table and a set of smaller dimension tables. The snowflake schema is a variant of star in which the dimension tables are normalized. It reduces redundancy and is a better design. The galaxy schema, also called the constellation, includes multiple fact tables that share dimension tables.

In this paper, we propose a new data warehouse architecture, the Peano Cube, to represents multi-dimensional spatial data. For spatial data, the Peano Cube can have many pure 1 and pure0 quadrants. Its predicate trees can have very high compression rates, essential for efficient data access. Its features include the following characteristics:

The Peano Cube is a bitwise cube. Each attribute is split into bits, with one basic Peano Cube for each bit position. For data with m-bit values, we have m basic Peano Cubes.

Peano Cubes take advantage of continuity and sparseness in spatial data, by using the Peano order instead of raster order. They balance between high compression rates and random data access for the sake of efficient OLAP operations, e.g. rollup and rotate. They provide fairness with respect to dimensions.

Peano Cubes use predicate P-trees, which are tree-based bitmap indexes built on quadrants. By means of fast logical AND, OR, and NOT operations of predicate P-trees, Peano Cubes achieves efficient data access, which is very important for massive spatial data set.

Suppose we have a 3-D data cube representing the crop yield in an area at certain time. It has three dimensions: X, Y, and T (time). Figure 5 shows a fact table of yield, its data cube and Peano cube for every bit. Since yield is a 4-bit value, we split yield into four Peano bit sequence files and then build 4 Peano Cubes correspondingly.

We also build predicates trees: NP0 and P, based on the Peano sequence files. The processes and algorithms are described in Section 2

Page 7: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

(b) Data Cube of Field Yield

(a) Fact Table of Yield in Peano Order (c) Peano Cubes for Each Bit of Yield

Figure 5. A Fact Table, Its Data Cube and Peano Cubes

X Y T Yeild1 1 1 15 (1111)2 1 1 15 (1111)1 2 1 15 (1111)2 2 1 15 (1111)1 1 2 15 (1111)2 1 2 15 (1111)1 2 2 15 (1111)2 2 2 15 (1111)3 1 1 15 (1111)4 1 1 4(100)3 2 1 1(001)4 2 1 12 (1100)3 1 2 12 (1100)4 1 2 2(010)3 2 2 12 (1100)4 2 2 12 (1100)1 3 1 15 (1111)2 3 1 15 (1111)1 4 1 2 (010)2 4 1 0 (000)1 3 2 15 (1111)2 3 2 15 (1111)1 4 2 2 (010)2 4 2 0 (000)3 3 1 12 (1100)4 3 1 12 (1100)3 4 1 12 (1100)4 4 1 12 (1100)3 3 2 12 (1100)4 3 2 12 (1100)3 4 2 12 (1100)4 4 2 12 (1100)1 1 3 15 (1111)2 1 3 15 (1111)1 2 3 15 (1111)2 2 3 15 (1111)1 1 4 15 (1111)2 1 4 15 (1111)1 2 4 15 (1111)2 2 4 15 (1111)3 1 3 15 (1111)4 1 3 10 (1010)3 2 3 1(001)4 2 3 14 (1110)3 1 4 14((1110)4 1 4 2(010)3 2 4 12 (1100)4 2 4 12 (1100)1 3 3 15 (1111)2 3 3 15 (1111)1 4 3 2 (010)2 4 3 0 (000)1 3 4 15 (1111)2 3 4 15 (1111)1 4 4 2 (010)2 4 4 0 (000)3 3 3 12 (1100)4 3 3 12 (1100)3 4 3 12 (1100)4 4 3 12 (1100)3 3 4 12 (1100)4 3 4 12 (1100)3 4 4 12 (1100)4 4 4 12 (1100)

3

4

1 2 22 2

5 6

2

0

15 15

5 6

12 12

12

12

13

15 15

15 15

11

9

15 4

10 2

15 15

15 15

15 15

1 12

15 4

12 2 12

12 12 14 12 12 14 12 12

3

4

1 2 22 2

5 6

0

0

1 1

5 6

1 1

1

1

13

1 1

1 1

11

9

1 0

1 0

1 1

1 1

1 1

0 1

1 0

1 0 1 1 1 1 1 1 1 1 1

3

4

1 2 22 2

5 6

0

0

1 1

5 6

1 1

1

1

13

1 1

1 1

11

9

1 1

0 0

1 1

1 1

1 1

0 1

1 1

1 0 1 1 1 1 1 1 1 1 1

3

4

1 2 22 2

5 6

1

0

1 1

5 6

0 0

0

0

13

1 1

1 1

11

9

1 0

1 0

1 1

1 1

1 1

0 0

1 0

1 0 0 0 0 1 0 1 0 0 0

3 4

1 2 22 2

5 6

0

0

1 1

5 6

0 0

0

0

13

1 1

1 1

11

9

1 0

1 0

1 1

1 1

1 1

1 0

1 0

1 0 0 0 0 0 0 0 0 0 0

Page 8: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

3.2 OLAP Operation

In this section, we develop the basic OLAP operations, including slice/dice, rollup and the nearest neighbor query, using our proposed Peano cubes and predicates P-trees. The OLAP operations are the basis for answering spatial data warehouse questions like: “find all galaxies brighter than magnitude 22”, “find gravitational lens candidates”, or “find the total traffic flow during certain time periods from remotely sensed images.”

3.2.1 Slice/Dice

The Slice operation is a selection along one dimension, while the dice operation defines a sub-cube by performing a selection on two or more dimensions. Slice and Dice are very efficiently accomplished using the Peano Cube. Typical select statements may have a number of predicates in their “where” clause that must be combined in a Boolean manner. The predicates may include “=”, “<” and “>”. These predicates lead to two different query scenarios, where “=” clause results in an equal query, and “<” or “>” in a range query.

3.2.1.1 Equality Select Slice

Suppose we have a 3-D data cube representing yield in an field at certain time. Its dimensions are X = {0, 1}, Y = {0, 1} and T = {0, 1}. Also convert yield from decimal to binary. We get a relational table. The data cube and its corresponding relational table are shown in Figure 6.

Figure 6. A 3-D Cube Quadrant and Binary Relational Table For 3-D Cube Quadrant

From the table in Figure 6 we build 6 P-trees as shown in Figure 7 according to the P-tree construction rule.

P11 P12 P21 P22

P31 P72 P41 P42 P43

Figure 7. P-trees for 3-D cube Quadrant

Suppose we want to know that yield at the place where Y = 1. The query is “Select * from Peano Cube where Y = 1”. The process of select is to get P-tree masks, trim the P-trees to represent the selected 2-dimension part. The detailed process is discussed as follows.

X Y T Yield0 0 0 1111 0 0 0110 1 0 0011 1 0 1000 0 1 0111 0 1 0100 1 1 1001 1 1 001

1 1 1 0 1 0 0 11 0 0 1 0 0 1 0

0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1

0 0 0 0 1 1 1 1

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0

1 1 1 1 0 0 0 0

1 4

7 3

3 2

1

Page 9: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

Use PM to generate a new set of P-trees. The new P-tree set is the result of select. Generation of new P-trees is simple: just trim the nodes corresponding to the position of pure 0 in PM. Take P47 for example, the trimming process is shown as Figure 8.

Figure 8. Trim Process of P-tree based on PM

After trim the whole P-tree set, we can convert the trimmed P-tree set easily to the table and cube as shown in Figure 9.

Figure 9.

Result for “Select * from Cube where Y = 1”

3.2.1.2 Range Slice

Typical ranger queries for spatial data warehouse include questions like: “Find quasars with a line width > 2000 km/s and 2.5< redshift <2.7”, etc. We analyze four query cases as follows.

a. A simple case: Select * where Y >1001. Given the number of bits for the second attribute Y is fixed as 4, we can easily get the values that are greater than 1001. From Figure 10 , we can see that {“Y > 1001”} consists of two subsets {“Y = 11**”} and {“Y = 101*”}, where * is 1 or 0. Therefore the query clause can be written as “where Y = 11** || Y = 101*”.

Figure 10. Range Query of the Second Attribute Y > 1001

The query is retrieved by the corresponding P-tree mask PMgt = PM1 || PM2. PM1 is predicate P-tree for data that start with “11” and PM2 is for data that start with “101”. Since Y is the second attribute of the Peano Cube, we have

X Y T Yield0 1 0 0011 1 0 100 0 1 1 100 1 1 1 001

1 1 0 0 1 1 0 0

1 1 1 0 1 0 0 1 1 0 0 1

4

3

2

1

11111110 1101110010111010-------1001

11**

101*

P21 & P22

P21 & P’22 & P23

Page 10: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

PM1 = P21 & P22 PM2 = P21 & P’22 & P27. where P21 – the predicate P-tree for data that has value 1 in the first bit position of the second attribute P’22 – the predicate P-tree for data that has value 0 in the second bit position of the second attribute P22, P27 – follows the same rule.

b. A more general case: Select * from Peano Cube where T > 01011001.The retrieval process of this query is shown in Figure 11. Attribute T is the third dimension of the Peano Cube. {T > 01011001} can be divided a number of subsets: {“1*******”}, {“011*****”}, {“010111**”} and {“0101101*”}, where * is 1 or 0. The retrieval process can be generalized as follows:

Figure 11. Range Query of the Third Dimension Y > 01011001

The number of subsets is equal to the number of 0 in the bit stream. {T > 01011001} consists of four subsets.

Start with the highest bit of string until the nth 0, revert it to 1 and append it with wildcard *, we get the first subset. The first subset of {T > 01011001} is {“1*******”}, the second subset is {“011*****”}, and so on.

Generate mask P-tree for each subset according to bit pattern and position as discussed in previous example. For this example PM1 = P71, PM2 = P71’ & P72 & P77, PM7 = P71’ & P72 & P77’ & P74 & P75 & P76 and PM4 = P71’ & P72 & P77’ & P74 & P75 & P76’ & P77.

AND all mask P-trees together to get a general mask P-tree. For this example PMgl = PM1 || PM2 || PM7 || PM4.

c. Combination of Equal Query and Range Query It is Often the case that “where” clause has “>=”. For example, “select * from Peano Cube where T >= 01011001”. We can easily divide set {“T >=01011001”} into two subsets, {“T >01011001”} and {“T = 01011001”}. Therefore the mask P-tree PMge = PMgt || PMeq. PMgt is related to range query and PMeq is related to equal query.

For “where” clause like “01111011 > T > 01011001”, we also need to divide set {“01111011 > T > 01011001”} into two subsets, {“01111011 > T”} and {“T > 01011001”}. The mask P-tree PMgl = PMgt || PMlt.

d. Complement of Range QueryWe have just analyzed “where” clause with “>” or “>=”. But we want “<” or “<=”. For example, “select * from Peano Cube where T <= 01011001.” Obviously, set {“T <= 01011001”} is the complement of {“T > 01011001”}. With the result of query “Select * from Peano Cube where T > 01011001, we can easily

11111111…. 1000000001111111…. 0110000001011111…. 010111000101101101011010------------01011001

011*****

1******* P31

P’31 & P32 & P33

010111** P’31&P32&P’33&P34&P35&P36

0101101* P’31&P32&P’33&P34&P35&P’36&P37

Page 11: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

retrieve query “Select * from Peano Cube where T <= 01011001” by complement the former, i.e. PM le = PM’gt.Dice operation is similar to slice operation. Only dice involves more than one dimension while slice is on one dimension. The example of dice can be “select * from Peano Cube where T <= 01011001” AND “Y > 1001.” We just need to AND P-tree masks of slice on dimension T and Y to get the final P-tree mask for dice.

3.2.2 Rollup

The rollup operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction (climbing all the way up to the top of a concept hierarchy for a dimension at which point the dimension goes away). For example, by climbing up from month to year we can get the amount of traffic by year. Or we can reduce the time dimension by summing up the total traffic along the time dimension.

Rollup can involve more than one dimension at a time. For example, we can rollup along dimension X and Y, and get the total traffic by time T. Rollup is always just a "group by" operation. For data cube by dimension X, Y and T, roll up along T means “group by” X and Y.

Traditional data cube is stored in the form of multi-dimensional array. Its rollup operation is simply performed along a certain dimension of the array matrix. However, our Peano Cube is in Peano order instead of raster order which takes most advantage of the continuousness and sparseness of spatial data. The comparison of data ordering between traditional data cube and Peano cube is shown in Figure 12. For example, a rollup along the back pointing coordinate in traditional data cube is D1 + D17 + D77 + D49, while in Peano cube, it is D1 + D5 + D77 + D77.

(a) (b)

Figure 12. Comparison of data ordering between Traditional Data Cube and Peano Cube

(a)The traditional data cube; (b) Peano Cube

Figure 13 is a 4-D Peano Cube and Figure 14 demonstrates its rollup process along 4 dimensions.

3

4

1 2 22 2

5 6

13

14

9 10

5 6

15 16

11

12

13

33 34 49

50

11

9

25 26

51 52

5 6

1 2

17 18

7 8

3 4

19 13

20

3

4

1 2 22 2

5 6

19

20

17 18

5 6

27 28

25

26

13

33 34 37

38

11

9

41 42

45 46

3 4

1 2

5 6

11 12

9 10

13 14

3 4

1 2

5 6

8

11 12

9 10

13 14

16

Page 12: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

Figure 13. 4-D Peano Cube

Figure 14. Demonstration of Rollup of 4-dimension Peano Cube (Kd: dimension which rollup is along)

According to the ordering and characteristics of Peano Cube, we developed the rollup algorithm for Peano cube, which is shown in Figure 15. With the rollup algorithm, we can easily calculate aggregation for any dimensional Peano Cube along any dimensions.

// d – number of dimension of the data cube// n – size of the data cube; kd – the dimension to roll up along// R[] – the result of roll up; start – the start index of R[]// val – the values in the data cube at the cursor positionAlgorithm rollup (d, n, kd, R[], R_start) Case: n = 1: R[start] += val n = 2d: For i = start TO (start + n/2-1) DO Skip (i-start) % 2kd-1 + 2kd * R[i] += val

Skip

R[i] += val n > 2d: For i = 0 TO 2d-1-1 DO ssc = n / 2d // ssc: size_sub_quadrant ddm = // ddm: domain_of_dimension Skip ((i + start) % 2kd-1 + 2kd * )*ssc rollup (d, n/2d, kd, R[], start + i*ssc/ddm );

Skip ssc +

rollup (d, n/2d, kd, R[], start + i*ssc/ddm ); EndcaseEnd rollup

Kd = 1 1 + 2 3 + 4 5 + 6 7 + 8 9 +10 11+12 13+14 15+16

Kd = 2 1 + 3 2 + 4 5 + 7 6 + 8 9 +11 10+12 13+15 14+16

Kd = 3 1 + 5 2 + 6 3 + 7 4 + 8 9 +13 10+14 11+15 12+16

Kd = 4 1 + 9 2 +10 3 +11 4 +12 5 +13 6 +14 7 +15 8 +16

Page 13: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

Figure 15. Rollup Algorithm for Peano Cube

The rollup process we have discussed so far is for one Peano Cube. As described in section 3.1, Peano Cube is bitwise data cube. For multidimensional data with n attributes, there are n x m Peano Cubes, where m is number of bits of each attribute. The attribute “Yield” in Figure 6 are composed of three Peano Cube (See Figure 16).

Figure 16. Peano Cubes for Attribute “Yield”

Suppose we want to aggregate “Yield” along dimension T (pointing down). First we need to rollup each Peano Cube using the algorithm in Figure 15, and get the results, S2[ ], S1[ ] and S0[ ] as shown in Figure17. With the rollup results of each Peano Cube, S2, S1, and S0, we can get rollup results, S, as follows: S[i] = S2[i] x 22 + S1[i] x 21 + S0[i] x 20 (i = 1, 2, 7, 4)

For i=1, S[1] = 1 x 22 + 1 x 21 + 2 x 20 = 8. Similarly we get S[2] = 7, S[7] = 7 and S[4] = 7.

S2[ ] = {1, 1, 1 0} S1 [ ] = {0, 1, 1 1} S0 [ ] = {2, 1, 1 1} S [ ] = {8, 7, 7, 7}

4

4

3

3 2

1

001

100

111

011

011 010

1 repub traffic" attribute 1 one Peano Cube. 1

0 1

1

0

0 0

0

0 0

1 1

1 1

0

1 0

1 1

1 0

1

0 0

1 1

1 1

0

1 0

1 1

1 0

1

0 1

1

0

0 0

0

1 1

1 0

1 1

1 1

2

1

1 1

8

7

7 3

1

7

Page 14: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

Figure 17. Rollup of “Yield” along Dimension T

3.2.2.1 Nearest Neighbor Query

The nearest neighbor query is very important for spatial data warehouse to answer the questions like: “find all galaxies brighter than magnitude 22”, or “find other objects like this one”, etc. Top-k neighborhood query is a general case of the nearest neighbor query.

Given spatial data, R(A1, A2, A7...), Ai represents spatial attributes or features of the spatial points. The nearest neighbor query is to find the nearest point to a query point. For example, “select points such that are the nearest neighbors of x." The calculation of top-k neighborhood query in Peano Cube is accomplished by P-tree ANDing using a HOBBit ring R(x, 0, r). The nearest neighbors of x with HOBBit ring R(x, 0, r) are shown Figure 18, where x has 4 nearest neighbors.

Figure 18. The Nearest Neighbors of X

The attribute-P-trees for x within the HOBBit ring, R(x, 0, r), are then defined by

Pvi,r = Pxi,1 & Pxi,2 & … & Pxi,r ( i = 1, 2, 7, …, n)

The tuple-P-tree for x within the HOBBit ring, R(x, 0, r), are then defined by

Px,r = Pv1,r & Pv2,r &Pv7,r & … & Pvn,r

If the root count of Px,r is the number of the nearest neighbors. The Qids with value 1 in Px,r indicate the positions of the nearest neighbors. The approach of searching nearest neighbors using HOBBit ring is very fast especially for spatial data.

4. Performance Analysis

We have implemented Peano Cube and conducted a series of experiments to validate our performance analysis of the algorithm. Our experiments were implemented in the C++ language on a 1GHz Pentium PC machine with 1GB main memory, running on XP-windows. The test data includes the aerial TIFF image (with Red, Green and Blue band reflectance values), moisture, and nitrate map of the Oakes Irrigation Test Area in North Dakota. The data is prepared in five sizes, that is, 128x128, 256x256, 512x512, 1024x1024, 2048x2048. The data sets are available at [33].

• • • • • • • • • • r • •• • • • • • • • • • • • • • • • • • • • • •• • • • •

• • • •

x

Page 15: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

In this experiment, we compare our Peano Cube with two other frequently used techniques. They are simple query evaluation technique (SQ) and bitmap-based query evaluation algorithm (BBQ). SQ implements OLAP operations by scanning the entire data set each time. BBQ uses bitmap index structure to carry out OLAP operations. We now present the experiment results for different OLAP operations described in Section 7. The response time is calculated based on average CPU time of 70 runs. The goal is to validate our claim that Peano Cube has high scalability in terms of data size.

1) Experiments of Equality Select Slice

In this set of experiments, we want accomplish the query: select * from DataCube where yield = 200. The experiment results are shown in Figure 19.

Figure 19. Response Time Comparison for Equality Select Slice

5. Conclusion

In this paper, we present a general data warehousing architecture, Peano Cube, to facilitate OLAP operations, e.g., slice/dice, rollup, range query, and nearest neighbor query. The fast logical operations of Peano Trees (P-trees) are used to accomplish queries. A natural distance metrics for P-trees, HOBBit metrics are employed to achieve efficient nearest neighborhood query. Predicate P-trees are used to find the “big hole” of 0 quadrants by performing logical operations. Only the mixed chunks need to be loaded, thus reduce the amount of data scan. For spatial data, there is a small portion of the mixed chunks because of high compression rate of Peano Cube. Experiments indicate OLAP operations using Peano Cube is much faster than traditional data cube.

One future research works is to extend our Peano Cube into parallel data warehouse system. It appears particularly promising by partitioning a large cube horizontally or vertically into small cubes that can improve the query performance by parallelism. Besides, our method also has potential applications in other areas, such as DNA micro array and medical image analysis.

REFERENCES

1. E.F. Codd, S.B. Codd, and C.T. Salley, Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate. Technical report, 1997.

2. A. Shoshani, OLAP and Statistical Databases: Similarities and Differences, in Proc. ACM PODS '97, 185-196.

3. Panos Vassiliadis, Timos Sellis. A Survey on Logical Models for OLAP Databases Technical Report DWQ—NTUA—701, accepted for publication at SIGMOD RECORD.

4. Torsten Priebe, Günther Pernul. Towards OLAP Security Design - Survey and Research Issues. In Proc. DOLAP'00.

5. R. Agrawal, A. Gupta, S. Sarawagi, Modeling Multidimensional Databases, Research Report, IBM Almaden Research Center, San Jose, California, 1995. Appeared in Proc. ICDE '97. Abstract

6. C. Li and X.S. Wang, A data model for supporting on-line analytical processing, in Proc. Conf. on Information and Knowledge Management, November 1996, 81-88.

7. M. Gyssens and L.V.S. Lakshmanan, A foundation for multi-dimensional databases, Proc. VLDB '97, 106-115.

8. L. Cabibbo, R. Torlone, Querying Multidimensional Databases, Proc. 6th DBPL Workshop, 1997, 257-269.

Page 16: Efficient Proximal Support Vector Machine with   Web viewIn another word, it is the left most ... Technical Report, 1997, available at  . William Perrizo, Peano

9. M. Golfarelli, D. Maio, S. Rizzi, Conceptual design of data warehouses from E/R schemes, Proc. 71st Hawaii Intl. Conf. on System Sciences, 1998.

10. W. Lehner, Modeling Large Scale OLAP Scenarios, 6th International Conference on Extending Database Technology (EDBT'98), Valencia, Spain, 27-27, March, 1998.

11. W. Lehner, J. Albrecht, H. Wedekind, Multidimensional Normal Forms, 10th International Conference on Scientific and Statistical Data Management (SSDBM'98), Capri, Italy, July 1-7, 1998.

12. Torben Bach Pedersen and Christian S. Jensen, Multidimensional Data Modeling for Complex Data. In Proc. ICDE' 99. Extended version.

13. Karl Hahn, Carsten Sapia, Markus Blaschka. Automatically Generating OLAP Schemata from Conceptual Graphical Models. In Proc. DOLAP'00.

14. Andreas Bauer, Wolfgang Hümmer, Wolfgang Lehner. An Alternative Relational OLAP Modeling Approach . In Proc. DAWAK'00.

15. Tapio Niemi, Jyrki Nummenmaa, Peter Thanisch. Functional Dependencies in Controlling Sparsity of OLAP Cubes. In Proc. DAWAK'00.

16. Matteo Golfarelli, Dario Maio, Stefano Rizzi. Applying Vertical Fragmentation Techniques in Logical Design of Multidimensional Databases. In Proc. DAWAK'00.

17. Carlos Hurtado, Alberto Mendelzon. Reasoning about Summarizability in Heterogeneous Multidimensional Schemas. In Proc. ICDT'01.

18. Mauricio Minuto, Alejandro Vaisman. Efficient Intensional Redefinition of Aggregation Hierarchies in Multidimensional Databases. To be presented at DOLAP 2001. Atlanta, Ga. November 2001. Full paper.

19. Y. Kotidis. A Data Warehousing Architecture for Enabling Service Provisioning Process. In Proc. VLDB'01.

20. Lin K., Jagadish H. V., Faloutsos C.: ‘The TV-Tree: An Index Structure for High-Dimensional Data’, VLDB Journal, Vol. 7,pp. 517-542, 1995.

21. Berchtold S., Keim D., Kriegel H.-P.: ‘The X-Tree: An Index Structure for High-Dimensional Data’, Proc. 22nd Int. Conf. on Very Large Databases (VLDB), 1996, pp. 28-79.

22. Katayama N., Satoh S.: ‘The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries’, Proc. ACMSIGMOD Int. Conf. on Management of Data, 1997.

23. J. Gray, A. Bosworth, A. Layman, H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-tab, and Sub-Totals. Data Mining and Knowledge Discovery 1, 29-57, 1997

24. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int, Conf. Management of Data (SIGMOD’97), pages 159-170, Tucson, AZ, May 1997.

25. P. O’Neil G. Graefe, Multi-Table Joins Through Bitmapped Join Indices, SIGMOD Record 24(7), September 1995

26. P. Valduriez, Join Indices, ACM TODS, 12(2), June 1987.27. M.C. Wu, A. Buchmann, "Encoded Bitmap Indexing for Data Warehouses." Proceedings of the 14th

International Conference on Data Engineering, Orlando, Florida, USA, Feb. 1998, pp. 220--270.28. Sihem Amer-Yahia, Theodore Johnson: Optimizing Queries on Compressed Bitmaps. VLDB 2000:

729-77829. Codd E., Codd S., Salley C. Providing OLAP (On-Line Analytical Processing) to User-Analysts: An

IT Mandate. Technical Report, 1997, available at http://www.arborsoft.com/essbase/wht_ppr/coddps.zip.

30. William Perrizo, Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001.31. Maleq Khan, Qin Ding, William Perrizo, k-Nearest Neighbor Classification on Spatial Data Streams

Using P-Trees, PAKDD 2002, Spriger-Verlag, LNAI 2776, 2002, pp. 517-528.32. C++ Boost, http://www.boost.org/.33. TIFF image data sets. Available at http://midas-10cs.ndsu.nodak.edu/data/images/.