a system for fast low-cost access to very large catalogs

A System for Fast Low-Cost Access to Very Large Catalogs

Michael Riggs, Vashundara Puttagunta, Maitreyee Pasad, Konstantinos Kalpakis, and Jeanne Behnke

CSEE, UMBC and NASA GSFC

URL: http://www.csee.umbc.edu/~kalpakis/monet

2

Introduction• Motivation

– Many catalogs in astronomy, few old and most of the new ones, contain data for very large number of spatial objects

– Fast methods to access these catalogs are needed– Budget constraints limit the amount of $$ allocated to S/W and H/W to support

fast access to the catalogs– Thus, low-cost solutions that achieve efficient and effective access to very large

catalogs are needed

• We demonstrate

– how to achieve fast access to catalog data on a low cost desktop for a very large

catalog using an Object-Relational Database System (ORDBMS)

– how to enhance the functionality of the system by extending the ORDBMS

with IDL

3

Test Catalog Used

• We use the USNO-A2.0 catalog (called the Monet catalog here) which has approximately 500 million stars (points)

Image of the Monet catalog found at: http:/www.nofs.navy.mil/projects/pmm/universe.html

4

Schemas Used• We experiment with a sample from the Monet catalog and the following four

approaches (schemas/datablades/indexes)

– Relational only

• Traditional relational datatypes with B-tree indexes (clustered) on dec-RA and RA-dec (B schemas)

– Object-Relational

• Datatypes from the Geodetic Datablade (created for the EOSDIS project for geo-location) with an R-tree index on geo-location (G schemas)

• Datatypes from Shapes2 Informix datablade with R-tree index on points (S schemas)

• Datatypes from our custom-build SimpleShape datablade with R-tree index on points built bottom up using Hilbert coding of the point’s coordinates (I schemas)

5

Types of Queries Tested

• Spatial Window (Online catalog access)

Find all stars within a user-defined bounding rectangle

• Spatial Or-Window (Online catalog access)

Find all the stars within at least one of two different bounding rectangles

• Spatial Self-Join (Catalog mining)

Find all stars that are within a specified box centered on each star

• Spatial Multi-Join (Catalog Correlations)

– Spatial chain-join

Find a spatial region in a “chain-joined” sequence of tables (catalogs)

– Spatial Star-join

Find a spatial region in a “star-joined” sequence of tables (catalogs)

6

Our Custom Datablade

• The custom-made datablade SimpleShape extends the Informix Dynamic Server 2000 by providing

– a fixed length opaque type (UDT) storing 4 small floats (16 bytes total)

– full R-tree index support including new bottom-up index building

– full B-tree index support based on 2 dimensional Hilbert Curves

– the SimplePoint UDT holds the coordinates of a 2 dimensional point

– the SimpleBox UDT holds the lower left and upper right SimplePoints of a rectangle to make a box

– input/output functions and constructors

7

Format of the USNO-A2.0

• Each star in the catalog comes packaged as three integers.

– (int RA, int DEC, int MAG)

– RA = (RA in decimal hours) * 15 * 3600 * 100

– DEC = ((DEC in decimal degrees) + 90) * 3600 * 100

– MAG = combination of field, blue, red values plus flags

• The catalog data are stored in a database table with schema (coordinates SimplePoint, magnitude integer)

– the actual data in the table are stored in their original (compressed) format as above

– user-defined functions (UDFs) change coords and magnitude into meaningful fields

8

Specializing the SimpleShape to the USNO-A2.0 Catalog

• the RA and DEC fields are stored in a SimplePoint point (dec, ra) of type SimplePoint

• the data format is not changed from packaged monet data

• data conversion takes place in the DBMS when displaying results

• the MAG field is stored as a single integer field in DBMS directly

• additional utility UDFs are created

– Ra() takes a SimplePoint and returns RA in decimal hours

– Dec() takes a SimplePoint and returns DEC in decimal degrees

– MonetBox() contructs a SimpleBox from RA and DEC in decimal hours and decimal degrees respectively

– Blue() takes MAG and returns the blue magnitude

– Red() takes MAG and returns the red magnitude

– Field() takes MAG and returns the field of the star

9

Custom Data Loader

• Custom High Performance Loader

– Built from the Virtual Table Interface (VTI) of Informix Server

– Maps binary OS file into database table

– Only select queries with NO where clause can be issued on table

• Reasons for development

– Lack of Informix High Performance Loader for PC platform

– Bug in regular table load command for PC platform

– USNO-A2.0 is distributed in binary format

– absense of string conversions make loading fast

10

Indexing for the Monet Data

• R-tree index on coordinates

– built using new bottom-up algorithm and the 2nd order Hilbert values of the points

– allows fast window queries on entire catalog

– faster build times than B-tree index

– takes about 1 day to index the entire catalog

• Second order Hilbert values

– space filling curve converts two dimensions into one dimension

– points are sorted based on their Hilbert value and stored adjacently in R-tree leaf pages

– Hilbert value code and R-tree bottom-up index building algorithms are built into SimpleShape software

11

Sample SQL Statements for the SimpleShape Datablade

• Sample SQL statements using the SimpleShape datablade– create table test (product_id int, location SimplePoint)

– insert into test values (1, SimplePoint(2.333,5.77))

– insert into test values (2, ‘point(4.123, 8.234)’ )

– create table products (product_id int, location_x int, location_y int)

– insert into test select product_id, SimplePoint(location_x, location_y) from products

– select * from test where within (location, ‘box(0, 0, 100, 100)’ )

• Sample SQL statements for the USNO-A2.0 specialization

– create table monet_data(coords SimplePoint, mag int)

– select Ra(coords), Dec(coords), Red(mag), Blue(mag), Field(mag) from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) )

– select (Blue(mag) - Red(mag)) as energy, count(*) as count from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) group by energy order by count desc

12

More Sample SQL Statements

• Custom High Performance Loader– create table load_data(value1 int, value2 int) using vti_load(file=‘/tmp/data.bin’);

– select * from load_data

– insert into real_table select * from load_data

• Loading Monet Data– create table monet_all (coords SimplePoint, mag int)

– create table monet_load(ra int, dec int, mag int) using vti_load(file=‘/tmp/zone0000.cat’)

– insert into monet_all select SimplePoint(dec,ra), mag from monet_load

• Indexing Monet Data– create index monet_all_idx on monet_all (coords SIMPLE_OPS) using rtree

• Selecting Monet Data– select * from monet_all where within (coords, MonetBox(dec,ra,dec,ra))

13

System Configuration

• The experiments with our custom datablade were done on a system with the following configuration

– Hardware (costing less than $2K)

• AMD Duron 650MHz Processor

• FIC model AZ11 motherboard w/IDE ATA-66 controller

• 256MB of PC100 RAM

• Two IBM Ultra/100 7200RPM IDE disks model DTLA307045

– Software

• Linux Mandrake version 7.0

• Updated Linux Kernel version 2.4.0-test4

• Informix Dynamic Server 2000 version 9.20.UC1

• Custom datablade and data loading software

14

System Configuration• The experiments with other datablades and the traditional relational solution

were done on a system with the following configuration

– Hardware

• Sun UltraSparc 60 with 512 MB memory

• two 8GB and two 4GB SCSI disks

– Software

• Solaris 2.6

• Informix Dynamic Server 9.14 with the Universal DataServer Option

• Geodetic and Shapes2 datablades

15

Experimental Results

Schema 100K stars 500M stars

Relational 45.829 secs 65 hrsGeodetic 233.618 secs 334 hrsShapes2 161.714 secs 226 hrsSimpleShape 27.216 secs 32 hrs

Schema 100K stars 500M starsGeodetic 448 secs 607 hrsShapes2 256 secs 371 hrsSimpleShape 8 secs 18 hrs

Elapsed time in creating R-tree indexes for the three datablades

Elapsed time to load data from the Monet catalog

16


• Loading and Indexing all the 500M stars

Schema Load Time Index Time Size Index SizeB-tree R-tree B-tree R-tree

Relational 3 6 13GB 56GBGeodetic 14 90 25 190GB 66GB 104GBShapes2 10 57 15 140GB 86GB 77GBSimpleShape 1 1 15GB 14GB 20GB

All times in days

The numbers for the Relational, Geodetic, and Shapes2 schemas are estimates based on regression from measurements for up to 140K stars.

•Beware: Results are not for the faint of heart!

17


Loading times for the Monet catalog using the SimpleShape datablade

18


Elapsed time Page Reads

Bulding an R-tree index for the Monet catalog using the SimpleShape datablade

19


• Number of page reads for various queries and schemas for 60K stars

• The query window(s) used are 0.0666x0.0666 decimal degrees

Query SchemaRelational Geodetic Shapes2 SimpleShape

Spatial Self-Join 4440 1005023 60443 6775Spatial Or-Window 22 35 44 142-Chain Spatial Join 23 49 87 193-Star Spatial Join 24 50 43 12

20


• Performance of the custom-datablade (schema) for queries against a table with 60K stars

Query Elapsed time (secs) Page readsSpatial Window 0.071 38.9Spatial Self-Join 3072.000 6775.0Spatial OR Window 0.163 13.62-Chain Spatial Join 2.377 18.53-Chain Spatial Join 9.492 26.14-Chain Spatial Join 44.396 27.63-Star Spatial Join 9.532 11.94-Star Spatial Join 44.370 24.1

23

IDL Extension for the Informix Server

• Powerful data-mining applications can be built by providing an interface between an object-relational DBMS (Informix) and a statistical/scientific-computing package (IDL).

• The IDL extension to Informix allows users to tap the data-analysis and visualization capabilities of IDL through Informix.

• Implementation is based on UDFs and the RPC mechanism.

Informix DBMS

IDL server

Efficient storage and retrieval Indexing and clustering of dataConcurrency controlSecurity

Data-analysis routinesVisualization capabilities

RP

C C

alls

IDL user functions

Data-analysis routinesVisualization capabilities

24

• The IDL interface provides two main functions

• idl_exec is a simple user defined function that executes the idl-script for each row of the table

• idl_agg is a user defined aggregate function.

• The functions are used along with casting functions that cast the pointer returned, to appropriate data type.

Computing distance from the origin using idl_exec

Data type Last IDL Data type Cast returned statement returned functionfrom IDL in script by function.

Integer $RETURN pointer to int toInt.

Double $RETURN pointer to float toFloat.

Binary arr $BLOB LO_Handle toBlob.

structure $RETURN pointer to ROW toRow of lvarchars

2D Array $RETURN ptr to collection toList of rows

Select toFloat ( idl_exec ( ROW( x , y ), "r = double( sqrt ( ($1)^2 + ($2)^2 ) ) ; $RETURN r ” ) ) from Points;

Columns of the table that are used as arguments by idl_exec.$1 x ; $2 y

$RETURN used to return a double

The return variable is a double, i.e. float in Informix. So we use toFloat as casting function.

IDL script

25

select toList ( idl_agg( ROW(ra, dec), ROW( "count=-1", "count=count+1 &if count eq 0 then y = [$1, $2] else y = [[y], [$1,$2]]", "w = CLUST_WTS(y,N_CLUSTERS=3); $RETURN w ” ) )) :: list(ROW(a float, b float) not null)from monet_temp ;

The return variable is a 2D-array. toList is casting function for the pointer to collection of rows that idl_exec returns

Casts the collection into required list.

The initialization IDL script ; executed once for each group

Iteration script; executed once for every row in the group; here it assigns values to the matrix

Final script; executed once at the end of each group; here it clusters data using some IDL routines

$RETURN used to return a collection of rows of two doubles each.

Clustering using idl_agg: We use the CLUST_WTS function in IDL that computes the weights (the cluster centers) of an mxn array and returns an array of cluster centers. idl_agg, in this case, returns a collection of rows that can be made to look like a virtual table.

26

Getting a plot using idl_agg: We use a IDL UDF, MONET_JPEG that creates a plot with 1st two arguments (ra and dec) as the axis and the next two (red & blue) determining the color of the point. MONET_JPEG returns the jpeg of the plot in binary. The binary array is returned as a blob by idl_agg.

select toBlob ( idl_agg ( ROW( ra, dec, blue, red ) , ROW( 'flag=0&i=long(-1)&temp=dblarr(4,1000)', 'i=i+1&if i ge 1000 then begin if flag eq 0 then begin a=temp & flag=1&end else a=[[a],[temp]] & i=0 & end & temp(*,i) = [$1,$2,$3,$4]', 'if i gt 0 then begin if flag eq 0 then a=temp(*,0:i) else a=[[a], [temp(*,0:i)]] & end & b = MONET_JPEG(a(1,*),a(0,*),a(2,*),a(3,*)); $BLOB b’ ) ) ) from monet_all;

Since we have $BLOB, idl_agg returns an LO_Handle to the blob. Hence toBlob is used for casting.

Initialization script

Iteration script

Final script

$BLOB used to return a blob that contains the jpeg image

a system for fast low-cost access to very large catalogs

Documents

monet catalog

flagsthe catalog data

idltest catalog usedwe

fast access

fast lowcost access

effective access

btree indexes

simpleshape datablade