temple university – cis dept. cis616– principles of data management

Temple University – CIS Dept.CIS616– Principles of Data Management

V. Megalooikonomou

Spatial Access Methods (SAMs)

(based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)

General Overview

Multimedia Indexing Spatial Access Methods (SAMs)

k-d trees Point Quadtrees MX-Quadtree z-ordering R-trees

SAMs - Detailed outline

spatial access methods problem dfn k-d trees point quadtrees MX-quadtrees z-ordering R-trees

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer spatial queries (like??)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ queries)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ within ε)

SAMs - motivation

Q: applications?

SAMs - motivation

salary

traditional DB GIS

SAMs - motivation

salary

traditional DB GIS

SAMs - motivation

CAD/CAMfind elementstoo closeto each other

SAMs - motivation

CAD/CAM

day1 365

SAMs - motivation

eg, avg

eg,. std

SAMs: solutions

K-d trees point quadtrees MX-quadtrees z-ordering R-trees (grid files)

Q: how would you organize, e.g., n-dim points, on disk? (C points per disk page)

k-d trees

Used to store k dimensional point data It is not used to store region data A 2-d tree (i.e., for k=2) stores 2-

dimensional point data while a 3-d tree stores 3-dimensional point data, etc.

2-d trees – node structure

Binary trees Info: information field Xval,Yval: coordinates of a point associated with the

node Llink, Rlink: pointers to children Properties (N: node):

If level N even -> for all nodes M in the subtree rooted at N.Llink: M.Xval < N.Xval for all nodes P in the subtree rooted at N.Rlink: P.Xval >= N.Xval

If level N odd -> Similarly use Yvals

2-d trees – Example

2-d trees: Insertion/Search

To insert a node N into the tree pointed by T If N and T agree on Xval, Yval then

overwrite T Else, branch left if N.Xval < T.xval, right

otherwise (even levels) Similarly for odd levels (branching on

Yvals)

2-d trees – Example of Insertion

City (Xval, Yval)

Banja Luka (19, 45)

Derventa (40, 50)

Toslic (38, 38)

Tuzla (54, 35)

Sinj (4, 4)

Splitting of region by Banja LukaSplitting of region by Derventa

Splitting of region by Toslic Splitting of region by Sinj

2-d trees: Deletion

Deletion of point (x,y) from T If N is a leaf node easy Otherwise either Tl (left subtree) or Tr

(right subtree) is non-empty Find a “candidate replacement” node R in Tl or

Replace all of N’s non-link fields by those of R Recursively delete R from Ti

Recursion guaranteed to terminate - Why?

2-d trees: Deletion

Finding candidate replacement nodes for deletion Replacement node R must bear same

spatial relation to all nodes in Tl and Tr as node N

2-d trees: Range Queries

Q: Given a point (xc, yc) and a distance r find all points in the 2-d tree that lie within the circle

A: Each node N in a 2-d tree implicitly represents a region RN – If the circle (specified by the query) has no intersection with RN then there is no point in searching the subtree rooted at node N

spatial access methods problem dfn k-d trees point quadtrees z-ordering R-trees

Point Quadtrees

Represent point data Always split regions into 4 parts 2-d tree: a node N splits a region into two

by drawing one line through the point (N.xval, N.yval)

Point quadtree: a node N splits a region by drawing a horizontal and a vertical line through the point (N.xval, N.yval)

Four parts: NW, SW, NE, and SE quadrants Q: Quadtree nodes have 4 children?

Point Quadtrees

Nodes in point quadtrees represent regions

Point quadtrees - Insertion

City (Xval, Yval)

Banja Luka (19, 45)

Derventa (40, 50)

Toslic (38, 38)

Tuzla (54, 35)

Sinj (4, 4)Splitting of region by Banja LukaSplitting of region by Derventa

Splitting of region by Toslic Splitting of region by SinjSplitting of region by Tuzla

Point Quadtrees - Insertion

Point quadtrees: Deletion Deletion of point (x,y) from T

If N is a leaf node easy Otherwise a subtree (N.NW, N.SW, N.NE. N.SE) is

non-empty Find a “candidate replacement” node R in one of the

subtrees such that: Every other node R1 in N.NW is to the NW of R Every other node R2 in N.SW is to the SW of R etc…

In general, it may not always be possible to find such as replacement node

Q: What happens in the worst case?

Point quadtrees: Deletion Deletion of point (x,y) from T

If N is a leaf node easy Otherwise a subtree (N.NW, N.SW, N.NE. N.SE) is non-

empty Find a “candidate replacement” node R in one of the

subtrees such that: Every other node R1 in N.NW is to the NW of R Every other node R2 in N.SW is to the SW of R etc…

In general, it may not always be possible to find such as replacement node

Q: What happens in the worst case? May require all nodes to be reinserted

Point quadtrees: Range Searches

Each node in a point quadtree represents a region

Do not search regions that do not intersect the circle defined by the query

MX-Quadtrees

Drawbacks of 2-d trees, point quadtrees: shape of tree depends upon the order in

which objects are inserted into the tree splits may be uneven depending upon where

the point (N.xval, N.yval) is located inside the region (represented by N)

MX-quadtrees: shape (and height) of tree independent of number of nodes and order of insertion

MX-Quadtrees

Assumption: the map is represented as a grid of size (2k x 2k) for some k

When a region gets “split” it splits down the middle

MX-Quadtrees - Insertion

After insertion of A, B, C, and D respectively

MX-Quadtrees - Insertion

After insertion of A, B, C, and D respectively

MX-Quadtrees - Deletion

Fairly easy – why? All point are represented at the

leaf level Total time for deletion: O(k)

MX-Quadtrees –Range Queries

Same as in point quadtrees One difference:

Checking to see if a point is in the circle defined by the range query needs to be performed at the leaf level (points are stored at the leaf level)

z-ordering

Hint: reduce the problem to 1-d points(!!)

Q1: why?A:Q2: how?

z-ordering

Hint: reduce the problem to 1-d points (!!)

Q1: why?A: B-trees!Q2: how?

z-ordering

Q2: how?A: assume finite granularity; z-

ordering = bit-shuffling = N-trees = Morton keys = geo-coding = ...

z-ordering

Q2: how?A: assume finite granularity (e.g.,

232x232 ; 4x4 here)Q2.1: how to map n-d cells to 1-d

cells?

z-ordering

Q2.1: how to map n-d cells to 1-d cells?

z-ordering

Q2.1: how to map n-d cells to 1-d cells?

A: row-wiseQ: is it good?

z-ordering

Q: is it good?A: great for ‘x’ axis; bad for ‘y’ axis

z-ordering

Q: How about the ‘snake’ curve?

z-ordering

Q: How about the ‘snake’ curve?A: still problems:

z-ordering

Q: Why are those curves ‘bad’?A: no distance preservation (~

clustering)Q: solution?

z-ordering

Q: solution? (w/ good clustering, and easy to compute, for 2-d and n-d?)

z-ordering

Q: solution? (w/ good clustering, and easy to compute, for 2-d and n-d?)

A: z-ordering/bit-shuffling/linear-quadtrees

‘looks’ better:• few long jumps;• scoops out the whole quadrant before leaving it• a.k.a. space filling curves

z-ordering

z-ordering/bit-shuffling/linear-quadtrees

Q: How to generate this curve (z = f(x,y) )?

A: 3 (equivalent) answers!

z-ordering

Q: How to generate this curve (z = f(x,y))?

A1: ‘z’ (or ‘N’) shapes, RECURSIVELY

order-1 order-2... order (n+1)

z-ordering

Notice: self similar (we’ll see about

fractals, soon) method is hard to use: z =? f(x,y)

order-1 order-2... order (n+1)

z-ordering

Method #2?

z-ordering

bit-shuffling

11100100

z =( 0 1 0 1 )2 = 5

z-ordering

bit-shuffling

11100100

z =( 0 1 0 1 )2 = 5

How about the reverse:

(x,y) = g(z) ?

z-ordering

bit-shuffling

11100100

z =( 0 1 0 1 )2 = 5

How about n-d spaces?

z-ordering

Method #3?

z-ordering

linear-quadtrees : assign N->1, S->0 e.t.c.

z-ordering

... and repeat recursively. Eg.: zgray-cell

= WN;WN = (0101)2 = 5

z-ordering

Drill: z-value of grey cell, with the three methods?

z-ordering

method#1: 14method#2: shuffle(11;10)= (1110)2 = 14

z-ordering

method#1: 14method#2: shuffle(11;10)= (1110)2 = 14method#3: EN;ES = ... = 14

z-ordering - Detailed outline

spatial access methods z-ordering

main idea - 3 methods use w/ B-trees; algorithms (range, knn

queries ...) non-point (eg., region) data analysis; variations

R-trees

z-ordering - usage & algo’s

Q1: How to store on disk?A:Q2: How to answer range queries etc

Q1: How to store on disk?A: treat z-value as primary key; feed to

B-tree

PGHz cname etc

5 SF12 PGH

MAJOR ADVANTAGES w/ B-tree: already inside commercial systems (no

coding /debugging!) concurrency & recovery is ready

PGHz cname etc

5 SF12 PGH

R-trees

z-ordering - variations

Q: is z-ordering the best we can do?

Q: is z-ordering the best we can do?A: probably not - occasional long

‘jumps’Q: then?

Q: is z-ordering the best we can do?A: probably not - occasional long

‘jumps’Q: then? A1: Gray codes

A2: Hilbert curve! (a.k.a. Hilbert-Peano curve)

‘Looks’ better (never long jumps). How to derive it?

order-1 order-2 ... order (n+1)

Q: function for the Hilbert curve ( h = f(x,y) )?

A: bit-shuffling, followed by post-processing, to account for rotations. Linear on # bits. See textbook, for pointers to

code/algorithms (eg., [Jagadish, 90])

Q: how about Hilbert curve in 3-d? n-d?

A: Exists (and is not unique!). Eg., 3-d, order-1 Hilbert curves (Hamiltonian paths on cube)#1 #2

R-trees ...

z-ordering - analysis

Q: How many pieces (‘quad-tree blocks’) per region?

A: proportional to perimeter (surface etc)

(How long is the coastline, say, of England?

Paradox: The answer changes with the yard-stick -> fractals ...)

Q: Should we decompose a region to full detail (and store in B-tree)?

A: NO! approximation with 1-3 pieces/z-values is best [Orenstein90]

Q: how to measure the ‘goodness’ of a curve?

A: e.g., avg. # of runs, for range queries

4 runs 3 runs(#runs ~ #disk accesses on B-tree)

Q: So, is Hilbert really better?A: 27% fewer runs, for 2-d (similar for

Q: are there formulas for #runs, #of quadtree blocks etc?

A: Yes ([Jagadish; Moon+ etc] see textbook)

z-ordering - fun observations

Hilbert and z-ordering curves: “space filling curves”: eventually, they visit every point

in n-d space - therefore:

... they show that the plane has as many points as a line (-> headaches for 1900’s mathematics/topology). (fractals, again!)

Observation #2: Hilbert (like) curve for video encoding [Y. Matias+, CRYPTO ‘87]:

Given a frame, visit its pixels in randomized

hilbert order; compress; and transmit

In general, Hilbert curve is great for preserving distances, clustering, vector quantization etc

Conclusions

z-ordering is a great idea (n-d points -> 1-d points; feed to B-trees)

used by TIGER system and (most probably) by other GIS products

works great with low-dim points

SAMs – Detailed Outline

SAMs - more detailed outline

R-trees main idea; file structure (algorithms: insertion/split) (deletion) (search: range, nn, spatial joins) variations (packed; hilbert;...)

R-trees

z-ordering: cuts regions to pieces -> dup. elim.

how could we avoid that? Idea: Minimum Bounding

Rectangles

R-trees

[Guttman 84] Main idea: allow parents to overlap! => guaranteed 50% utilization => easier insertion/split algorithms. (only deal with Minimum Bounding

Rectangles - MBRs)

R-trees

eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page

R-trees

eg., w/ fanout 4:

F GD E

H I JA B C

R-trees

eg., w/ fanout 4:

P1 P2 P3 P4

F GD E

H I JA B C

R-trees - format of nodes

{(MBR; obj-ptr)} for leaf nodes

P1 P2 P3 P4

x-low; x-highy-low; y-high

objptr ...

R-trees - format of nodes

{(MBR; node-ptr)} for non-leaf nodes

P1 P2 P3 P4

x-low; x-highy-low; y-high

nodeptr ...

R-trees - range search?

P1 P2 P3 P4

F GD E

H I JA B C

R-trees - range search?

P1 P2 P3 P4

F GD E

H I JA B C

R-trees - range search

Observations: every parent node completely covers its

‘children’ a child MBR may be covered by more than

one parent - it is stored under ONLY ONE of them. (i.e., no need for dup. elim.)

a point query may follow multiple branches.

everything works for any dimensionality

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins performance analysis variations (packed; hilbert;...)

R-trees - insertion

eg., rectangle ‘X’

P1 P2 P3 P4

F GD E

H I JA B CX

R-trees - insertion

eg., rectangle ‘X’

P1 P2 P3 P4

F GD E

H I JA B CX

R-trees - insertion

eg., rectangle ‘Y’

P1 P2 P3 P4

F GD E

H I JA B CY

R-trees - insertion

eg., rectangle ‘Y’: extend suitable parent.

P1 P2 P3 P4

F GD E

H I JA B CY

R-trees - insertion

Q: how to measure ‘suitability’?

R-trees - insertion

Q: how to measure ‘suitability’? A: by increase in area (volume)

(more details: later, under ‘performance analysis’)

Q: what if there is no room? how to split?

R-trees - insertion

eg., rectangle ‘W’

P1 P2 P3 P4

F GD E

H I JA B C

R-trees - insertion

eg., rectangle ‘W’ - focus on ‘P1’ - how to split?

R-trees - insertion

eg., rectangle ‘W’ - focus on ‘P1’ - how to split?

K • (A1: plane sweep,

until 50% of rectangles)

• A2: ‘linear’ split

• A3: quadratic split

• A4: exponential split

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’

‘closest’ ‘seed’ Q: how to measure ‘closeness’?

‘closest’ ‘seed’ Q: how to measure ‘closeness’? A: by increase of area (volume)

‘closest’ ‘seed’ smart idea: pre-sort rectangles

according to delta of closeness (ie., schedule easiest choices first!)

R-trees - insertion - pseudocode

decide which parent to put new rectangle into (‘closest’ parent)

if overflow, split to two, using (say,) the quadratic split algorithm propagate the split upwards, if

necessary update the MBRs of the affected

parents.

R-trees - insertion - observations

many more split algorithms exist (next!)

R-trees - deletion

delete rectangle if underflow

R-trees - deletion

delete rectangle if underflow

temporarily delete all siblings (!); delete the parent node and re-insert them

R-trees - range search

pseudocode: check the root for each branch, if its MBR intersects the query

rectangle apply range-search (or print out, if

this is a leaf)

R-trees - nn search

Q: How? (find near neighbor; refine...)

R-trees - nn search

A1: depth-first search; then, range query

R-trees - nn search

A2: [Roussopoulos+, sigmod95]: priority queue, with promising MBRs,

and their best and worst-case distance

main idea:

R-trees - nn search

consider only P2 and P4, for illustration

R-trees - nn search

worst of P2

best of P4=> P4 is useless

for 1-nn

R-trees - nn search

worst of P2

what is really the worst of, say, P2?

R-trees - nn search

what is really the worst of, say, P2? A: the smallest of the two red

segments!

R-trees - nn search

variations: [Hjaltason & Samet] incremental nn: build a priority queue scan enough of the tree, to make sure you

have the k nn to find the (k+1)-th, check the queue, and

scan some more of the tree ‘optimal’ (but, may need too much

memory)

R-trees - spatial joins

Spatial joins: find (quickly) all counties intersecting

Assume that they are both organized in R-trees:

for each parent P1 of tree T1 for each parent P2 of tree T2 if their MBRs intersect, process them recursively (ie., check

their children)

Improvements - variations:- [Seeger+, sigmod 92]: do some pre-

filtering; do plane-sweeping to avoid N1 * N2 tests for intersection

- [Lo & Ravishankar, sigmod 94]: ‘seeded’ R-trees

(FYI, many more papers on spatial joins, without R-trees: [Koudas+ Sevcik], e.t.c.)

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins variations (packed; hilbert;...)

R-trees - variations

Guttman’s R-trees sparked much follow-up work

can we do better splits? what about static datasets (no

ins/del/upd)? what about other bounding shapes?

can we do better splits? i.e, defer splits?

A: R*-trees [Kriegel+, SIGMOD90] defer splits, by forced-reinsert, i.e.:

instead of splitting, temporarily delete some entries, shrink overflowing MBR, and re-insert those entries

Which ones to re-insert? How many?

A: R*-trees [Kriegel+, SIGMOD90] defer splits, by forced-reinsert, i.e.:

instead of splitting, temporarily delete some entries, shrink overflowing MBR, and re-insert those entries

Which ones to re-insert? How many? A: 30%

Q: Other ways to defer splits?

Q: Other ways to defer splits?A: Push a few keys to the closest

sibling node (closest = ??)

R*-trees: Also try to minimize area AND perimeter, in their split.

Performance: higher space utilization; faster than plain R-trees. One of the most successful R-tree variants.

ins/del/upd)? Hilbert R-trees

what about other bounding shapes?

what about static datasets (no ins/del/upd)?

Q: Best way to pack points?

Q: Best way to pack points? A1: plane-sweep great for queries on ‘x’; terrible for ‘y’

Q: Best way to pack points? A1: plane-sweep great for queries on ‘x’; bad for ‘y’

Q: Best way to pack points? A1: plane-sweep great for queries on ‘x’; terrible for ‘y’ Q: how to improve?

A: plane-sweep on HILBERT curve!

A: plane-sweep on HILBERT curve! In fact, it can be made dynamic

(how?), as well as to handle regions (how?)

A: [Kamel+, VLDB94]

ins/del/upd)? what about other bounding shapes?

what about other bounding shapes? (and why?)

A1: arbitrary-orientation lines (cell-tree, [Guenther]

A2: P-trees (polygon trees) (MB polygon: 0, 90, 45, 135 degree lines)

A3: L-shapes; holes (hB-tree) A4: TV-trees [Lin+, VLDB-Journal 1994] A5: SR-trees [Katayama+, SIGMOD97]

(used in Informedia)

R-trees - conclusions

Popular method; like multi-d B-trees guaranteed utilization good search times (for low-dim. at least) R*-, Hilbert- and SR-trees: still used IBM (Informix) ships DataBlade with R-

References

Guttman, A. (June 1984). R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD, Boston, Mass.

Jagadish, H. V. (May 23-25, 1990). Linear Clustering of Objects with Multiple Attributes. ACM SIGMOD Conf., Atlantic City, NJ.

Lin, K.-I., H. V. Jagadish, et al. (Oct. 1994). “The TV-tree - An Index Structure for High-dimensional Data.” VLDB Journal 3: 517-542.

References, cont’d

Pagel, B., H. Six, et al. (May 1993). Towards an Analysis of Range Query Performance. Proc. of ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), Washington, D.C.

Robinson, J. T. (1981). The k-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes. Proc. ACM SIGMOD.

Roussopoulos, N., S. Kelley, et al. (May 1995). Nearest Neighbor Queries. Proc. of ACM-SIGMOD, San Jose, CA.

temple university – cis dept. cis616– principles of data management

xval n

node n

ndim points

d treesused

childrenproperties n

d tree stores

c points

xvalif level n odd

Documents

temple university – cis dept. cis616– principles of...

array - cis 1068 program design and abstraction zhen jiang...

inheritance - cis 1068 program design and abstraction zhen...

1 dept. of cis, temple univ. cis616/661 – principles of...

dept. of cis, temple univ. cis616/661 – principles of data...

welcome to tuj · undergraduate orientation presented by...

game programming, math and the real world rolf lakaemper,...

fall 2004, cis, temple university cis527: data warehousing,...

temple university – cis dept. cis664– knowledge...

laurie shteir temple university cis dept. fasttrack to...

temple university – cis dept. cis616– principles of data...

temple university – cis dept. cis616– principles of data...

1 dept. of cis, temple univ. cis616/661 – principles of...

cis 1068 program design and abstraction zhen jiang cis dept....

temple university – cis dept. cis616– principles of...

temple university – cis dept. cis616– principles of...

cis 1166 temple university 1.2

dept. of cis, temple univ. cis661 – principles of data...

temple university – cis dept. cis661– principles of...

attribute - cis 1068 program design and abstraction zhen...