temple university – cis dept. cis616– principles of data management

164
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Spatial Access Methods (SAMs) (based on notes by Silberchatz,Korth, and Sudarshan and

Upload: adam-carson

Post on 03-Jan-2016

33 views

Category:

Documents


2 download

DESCRIPTION

Temple University – CIS Dept. CIS616– Principles of Data Management. V. Megalooikonomou Spatial Access Methods (SAMs) ( based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU ). General Overview. Multimedia Indexing Spatial Access Methods (SAMs) k-d trees - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Temple University – CIS Dept. CIS616– Principles of Data Management

Temple University – CIS Dept.CIS616– Principles of Data Management

V. Megalooikonomou

Spatial Access Methods (SAMs)

(based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)

Page 2: Temple University – CIS Dept. CIS616– Principles of Data Management

General Overview

Multimedia Indexing Spatial Access Methods (SAMs)

k-d trees Point Quadtrees MX-Quadtree z-ordering R-trees

Page 3: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - Detailed outline

spatial access methods problem dfn k-d trees point quadtrees MX-quadtrees z-ordering R-trees

Page 4: Temple University – CIS Dept. CIS616– Principles of Data Management

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer spatial queries (like??)

Page 5: Temple University – CIS Dept. CIS616– Principles of Data Management

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ queries)

Page 6: Temple University – CIS Dept. CIS616– Principles of Data Management

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ queries)

Page 7: Temple University – CIS Dept. CIS616– Principles of Data Management

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ queries)

Page 8: Temple University – CIS Dept. CIS616– Principles of Data Management

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ queries)

Page 9: Temple University – CIS Dept. CIS616– Principles of Data Management

Spatial Access Methods - problem

Given a collection of geometric objects (points, lines, polygons, ...)

organize them on disk, to answer point queries range queries k-nn queries spatial joins (‘all pairs’ within ε)

Page 10: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - motivation

Q: applications?

Page 11: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - motivation

salary

age

traditional DB GIS

Page 12: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - motivation

salary

age

traditional DB GIS

Page 13: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - motivation

CAD/CAMfind elementstoo closeto each other

Page 14: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - motivation

CAD/CAM

Page 15: Temple University – CIS Dept. CIS616– Principles of Data Management

day1 365

day1 365

S1

Sn

F(S1)

F(Sn)

SAMs - motivation

eg, avg

eg,. std

Page 16: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs: solutions

K-d trees point quadtrees MX-quadtrees z-ordering R-trees (grid files)

Q: how would you organize, e.g., n-dim points, on disk? (C points per disk page)

Page 17: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - Detailed outline

spatial access methods problem dfn k-d trees point quadtrees MX-quadtrees z-ordering R-trees

Page 18: Temple University – CIS Dept. CIS616– Principles of Data Management

k-d trees

Used to store k dimensional point data It is not used to store region data A 2-d tree (i.e., for k=2) stores 2-

dimensional point data while a 3-d tree stores 3-dimensional point data, etc.

Page 19: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees – node structure

Binary trees Info: information field Xval,Yval: coordinates of a point associated with the

node Llink, Rlink: pointers to children Properties (N: node):

If level N even -> for all nodes M in the subtree rooted at N.Llink: M.Xval < N.Xval for all nodes P in the subtree rooted at N.Rlink: P.Xval >= N.Xval

If level N odd -> Similarly use Yvals

Page 20: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees – Example

Page 21: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees: Insertion/Search

To insert a node N into the tree pointed by T If N and T agree on Xval, Yval then

overwrite T Else, branch left if N.Xval < T.xval, right

otherwise (even levels) Similarly for odd levels (branching on

Yvals)

Page 22: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees – Example of Insertion

City (Xval, Yval)

Banja Luka (19, 45)

Derventa (40, 50)

Toslic (38, 38)

Tuzla (54, 35)

Sinj (4, 4)

Splitting of region by Banja LukaSplitting of region by Derventa

Splitting of region by Toslic Splitting of region by Sinj

Page 23: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees: Deletion

Deletion of point (x,y) from T If N is a leaf node easy Otherwise either Tl (left subtree) or Tr

(right subtree) is non-empty Find a “candidate replacement” node R in Tl or

Tr

Replace all of N’s non-link fields by those of R Recursively delete R from Ti

Recursion guaranteed to terminate - Why?

Page 24: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees: Deletion

Finding candidate replacement nodes for deletion Replacement node R must bear same

spatial relation to all nodes in Tl and Tr as node N

Page 25: Temple University – CIS Dept. CIS616– Principles of Data Management

2-d trees: Range Queries

Q: Given a point (xc, yc) and a distance r find all points in the 2-d tree that lie within the circle

A: Each node N in a 2-d tree implicitly represents a region RN – If the circle (specified by the query) has no intersection with RN then there is no point in searching the subtree rooted at node N

Page 26: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - Detailed outline

spatial access methods problem dfn k-d trees point quadtrees z-ordering R-trees

Page 27: Temple University – CIS Dept. CIS616– Principles of Data Management

Point Quadtrees

Represent point data Always split regions into 4 parts 2-d tree: a node N splits a region into two

by drawing one line through the point (N.xval, N.yval)

Point quadtree: a node N splits a region by drawing a horizontal and a vertical line through the point (N.xval, N.yval)

Four parts: NW, SW, NE, and SE quadrants Q: Quadtree nodes have 4 children?

Page 28: Temple University – CIS Dept. CIS616– Principles of Data Management

Point Quadtrees

Nodes in point quadtrees represent regions

Page 29: Temple University – CIS Dept. CIS616– Principles of Data Management

Point quadtrees - Insertion

City (Xval, Yval)

Banja Luka (19, 45)

Derventa (40, 50)

Toslic (38, 38)

Tuzla (54, 35)

Sinj (4, 4)Splitting of region by Banja LukaSplitting of region by Derventa

Splitting of region by Toslic Splitting of region by SinjSplitting of region by Tuzla

Page 30: Temple University – CIS Dept. CIS616– Principles of Data Management

Point Quadtrees - Insertion

Page 31: Temple University – CIS Dept. CIS616– Principles of Data Management

Point quadtrees: Deletion Deletion of point (x,y) from T

If N is a leaf node easy Otherwise a subtree (N.NW, N.SW, N.NE. N.SE) is

non-empty Find a “candidate replacement” node R in one of the

subtrees such that: Every other node R1 in N.NW is to the NW of R Every other node R2 in N.SW is to the SW of R etc…

Replace all of N’s non-link fields by those of R Recursively delete R from Ti

In general, it may not always be possible to find such as replacement node

Q: What happens in the worst case?

Page 32: Temple University – CIS Dept. CIS616– Principles of Data Management

Point quadtrees: Deletion Deletion of point (x,y) from T

If N is a leaf node easy Otherwise a subtree (N.NW, N.SW, N.NE. N.SE) is non-

empty Find a “candidate replacement” node R in one of the

subtrees such that: Every other node R1 in N.NW is to the NW of R Every other node R2 in N.SW is to the SW of R etc…

Replace all of N’s non-link fields by those of R Recursively delete R from Ti

In general, it may not always be possible to find such as replacement node

Q: What happens in the worst case? May require all nodes to be reinserted

Page 33: Temple University – CIS Dept. CIS616– Principles of Data Management

Point quadtrees: Range Searches

Each node in a point quadtree represents a region

Do not search regions that do not intersect the circle defined by the query

Page 34: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - Detailed outline

spatial access methods problem dfn k-d trees point quadtrees MX-quadtrees z-ordering R-trees

Page 35: Temple University – CIS Dept. CIS616– Principles of Data Management

MX-Quadtrees

Drawbacks of 2-d trees, point quadtrees: shape of tree depends upon the order in

which objects are inserted into the tree splits may be uneven depending upon where

the point (N.xval, N.yval) is located inside the region (represented by N)

MX-quadtrees: shape (and height) of tree independent of number of nodes and order of insertion

Page 36: Temple University – CIS Dept. CIS616– Principles of Data Management

MX-Quadtrees

Assumption: the map is represented as a grid of size (2k x 2k) for some k

When a region gets “split” it splits down the middle

Page 37: Temple University – CIS Dept. CIS616– Principles of Data Management

MX-Quadtrees - Insertion

After insertion of A, B, C, and D respectively

Page 38: Temple University – CIS Dept. CIS616– Principles of Data Management

MX-Quadtrees - Insertion

After insertion of A, B, C, and D respectively

Page 39: Temple University – CIS Dept. CIS616– Principles of Data Management

MX-Quadtrees - Deletion

Fairly easy – why? All point are represented at the

leaf level Total time for deletion: O(k)

Page 40: Temple University – CIS Dept. CIS616– Principles of Data Management

MX-Quadtrees –Range Queries

Same as in point quadtrees One difference:

Checking to see if a point is in the circle defined by the range query needs to be performed at the leaf level (points are stored at the leaf level)

Page 41: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - Detailed outline

spatial access methods problem dfn k-d trees point quadtrees MX-quadtrees z-ordering R-trees

Page 42: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: how would you organize, e.g., n-dim points, on disk? (C points per disk page)

Hint: reduce the problem to 1-d points(!!)

Q1: why?A:Q2: how?

Page 43: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: how would you organize, e.g., n-dim points, on disk? (C points per disk page)

Hint: reduce the problem to 1-d points (!!)

Q1: why?A: B-trees!Q2: how?

Page 44: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q2: how?A: assume finite granularity; z-

ordering = bit-shuffling = N-trees = Morton keys = geo-coding = ...

Page 45: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q2: how?A: assume finite granularity (e.g.,

232x232 ; 4x4 here)Q2.1: how to map n-d cells to 1-d

cells?

Page 46: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q2.1: how to map n-d cells to 1-d cells?

Page 47: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q2.1: how to map n-d cells to 1-d cells?

A: row-wiseQ: is it good?

Page 48: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: is it good?A: great for ‘x’ axis; bad for ‘y’ axis

Page 49: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: How about the ‘snake’ curve?

Page 50: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: How about the ‘snake’ curve?A: still problems:

2^32

2^32

Page 51: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: Why are those curves ‘bad’?A: no distance preservation (~

clustering)Q: solution?

2^32

2^32

Page 52: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: solution? (w/ good clustering, and easy to compute, for 2-d and n-d?)

Page 53: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Q: solution? (w/ good clustering, and easy to compute, for 2-d and n-d?)

A: z-ordering/bit-shuffling/linear-quadtrees

‘looks’ better:• few long jumps;• scoops out the whole quadrant before leaving it• a.k.a. space filling curves

Page 54: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

z-ordering/bit-shuffling/linear-quadtrees

Q: How to generate this curve (z = f(x,y) )?

A: 3 (equivalent) answers!

Page 55: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

z-ordering/bit-shuffling/linear-quadtrees

Q: How to generate this curve (z = f(x,y))?

A1: ‘z’ (or ‘N’) shapes, RECURSIVELY

order-1 order-2... order (n+1)

Page 56: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Notice: self similar (we’ll see about

fractals, soon) method is hard to use: z =? f(x,y)

order-1 order-2... order (n+1)

Page 57: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

z-ordering/bit-shuffling/linear-quadtrees

Q: How to generate this curve (z = f(x,y) )?

A: 3 (equivalent) answers!

Method #2?

Page 58: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

bit-shuffling

0001

1011

11100100

x

y

x0 0

y1 1

z =( 0 1 0 1 )2 = 5

Page 59: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

bit-shuffling

0001

1011

11100100

x

y

x0 0

y1 1

z =( 0 1 0 1 )2 = 5

How about the reverse:

(x,y) = g(z) ?

Page 60: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

bit-shuffling

0001

1011

11100100

x

y

x0 0

y1 1

z =( 0 1 0 1 )2 = 5

How about n-d spaces?

Page 61: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

z-ordering/bit-shuffling/linear-quadtrees

Q: How to generate this curve (z = f(x,y) )?

A: 3 (equivalent) answers!

Method #3?

Page 62: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

linear-quadtrees : assign N->1, S->0 e.t.c.

0 1

0

1

00...

01...

10...

11...

W E

N

S

Page 63: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

... and repeat recursively. Eg.: zgray-cell

= WN;WN = (0101)2 = 5

0 1

0

1

00...

01...

10...

11...

W E

N

S

00 11

Page 64: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Drill: z-value of grey cell, with the three methods?

0 1

0

1

W E

N

S

Page 65: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Drill: z-value of grey cell, with the three methods?

0 1

0

1

W E

N

S

method#1: 14method#2: shuffle(11;10)= (1110)2 = 14

Page 66: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering

Drill: z-value of grey cell, with the three methods?

0 1

0

1

W E

N

S

method#1: 14method#2: shuffle(11;10)= (1110)2 = 14method#3: EN;ES = ... = 14

Page 67: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - Detailed outline

spatial access methods z-ordering

main idea - 3 methods use w/ B-trees; algorithms (range, knn

queries ...) non-point (eg., region) data analysis; variations

R-trees

Page 68: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - usage & algo’s

Q1: How to store on disk?A:Q2: How to answer range queries etc

Page 69: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - usage & algo’s

Q1: How to store on disk?A: treat z-value as primary key; feed to

B-tree

SF

PGHz cname etc

5 SF12 PGH

Page 70: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - usage & algo’s

MAJOR ADVANTAGES w/ B-tree: already inside commercial systems (no

coding /debugging!) concurrency & recovery is ready

SF

PGHz cname etc

5 SF12 PGH

Page 71: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - Detailed outline

spatial access methods z-ordering

main idea - 3 methods use w/ B-trees; algorithms (range, knn

queries ...) non-point (eg., region) data analysis; variations

R-trees

Page 72: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

Q: is z-ordering the best we can do?

Page 73: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

Q: is z-ordering the best we can do?A: probably not - occasional long

‘jumps’Q: then?

Page 74: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

Q: is z-ordering the best we can do?A: probably not - occasional long

‘jumps’Q: then? A1: Gray codes

Page 75: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

A2: Hilbert curve! (a.k.a. Hilbert-Peano curve)

Page 76: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

‘Looks’ better (never long jumps). How to derive it?

Page 77: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

‘Looks’ better (never long jumps). How to derive it?

order-1 order-2 ... order (n+1)

Page 78: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

Q: function for the Hilbert curve ( h = f(x,y) )?

A: bit-shuffling, followed by post-processing, to account for rotations. Linear on # bits. See textbook, for pointers to

code/algorithms (eg., [Jagadish, 90])

Page 79: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - variations

Q: how about Hilbert curve in 3-d? n-d?

A: Exists (and is not unique!). Eg., 3-d, order-1 Hilbert curves (Hamiltonian paths on cube)#1 #2

Page 80: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - Detailed outline

spatial access methods z-ordering

main idea - 3 methods use w/ B-trees; algorithms (range, knn

queries ...) non-point (eg., region) data analysis; variations

R-trees ...

Page 81: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

Q: How many pieces (‘quad-tree blocks’) per region?

A: proportional to perimeter (surface etc)

Page 82: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

(How long is the coastline, say, of England?

Paradox: The answer changes with the yard-stick -> fractals ...)

Page 83: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

Q: Should we decompose a region to full detail (and store in B-tree)?

Page 84: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

Q: Should we decompose a region to full detail (and store in B-tree)?

A: NO! approximation with 1-3 pieces/z-values is best [Orenstein90]

Page 85: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

Q: how to measure the ‘goodness’ of a curve?

Page 86: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

Q: how to measure the ‘goodness’ of a curve?

A: e.g., avg. # of runs, for range queries

4 runs 3 runs(#runs ~ #disk accesses on B-tree)

Page 87: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - analysis

Q: So, is Hilbert really better?A: 27% fewer runs, for 2-d (similar for

3-d)

Q: are there formulas for #runs, #of quadtree blocks etc?

A: Yes ([Jagadish; Moon+ etc] see textbook)

Page 88: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - fun observations

Hilbert and z-ordering curves: “space filling curves”: eventually, they visit every point

in n-d space - therefore:

order-1 order-2 ... order (n+1)

Page 89: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - fun observations

... they show that the plane has as many points as a line (-> headaches for 1900’s mathematics/topology). (fractals, again!)

order-1 order-2 ... order (n+1)

Page 90: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - fun observations

Observation #2: Hilbert (like) curve for video encoding [Y. Matias+, CRYPTO ‘87]:

Given a frame, visit its pixels in randomized

hilbert order; compress; and transmit

Page 91: Temple University – CIS Dept. CIS616– Principles of Data Management

z-ordering - fun observations

In general, Hilbert curve is great for preserving distances, clustering, vector quantization etc

Page 92: Temple University – CIS Dept. CIS616– Principles of Data Management

Conclusions

z-ordering is a great idea (n-d points -> 1-d points; feed to B-trees)

used by TIGER system and (most probably) by other GIS products

works great with low-dim points

Page 93: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs – Detailed Outline

spatial access methods problem dfn k-d trees point quadtrees MX-quadtrees z-ordering R-trees

Page 94: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - more detailed outline

R-trees main idea; file structure (algorithms: insertion/split) (deletion) (search: range, nn, spatial joins) variations (packed; hilbert;...)

Page 95: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees

z-ordering: cuts regions to pieces -> dup. elim.

how could we avoid that? Idea: Minimum Bounding

Rectangles

Page 96: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees

[Guttman 84] Main idea: allow parents to overlap! => guaranteed 50% utilization => easier insertion/split algorithms. (only deal with Minimum Bounding

Rectangles - MBRs)

Page 97: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees

eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page

A

B

C

DE

FG

H

I

J

Page 98: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees

eg., w/ fanout 4:

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

F GD E

H I JA B C

Page 99: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees

eg., w/ fanout 4:

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

Page 100: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - format of nodes

{(MBR; obj-ptr)} for leaf nodes

P1 P2 P3 P4

A B C

x-low; x-highy-low; y-high

...

objptr ...

Page 101: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - format of nodes

{(MBR; node-ptr)} for non-leaf nodes

P1 P2 P3 P4

A B C

x-low; x-highy-low; y-high

...

nodeptr ...

Page 102: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - range search?

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

Page 103: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - range search?

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

Page 104: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - range search

Observations: every parent node completely covers its

‘children’ a child MBR may be covered by more than

one parent - it is stored under ONLY ONE of them. (i.e., no need for dup. elim.)

a point query may follow multiple branches.

everything works for any dimensionality

Page 105: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - more detailed outline

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins performance analysis variations (packed; hilbert;...)

Page 106: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘X’

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B CX

Page 107: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘X’

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B CX

X

Page 108: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘Y’

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B CY

Page 109: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘Y’: extend suitable parent.

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B CY

Y

Page 110: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘Y’: extend suitable parent.

Q: how to measure ‘suitability’?

Page 111: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘Y’: extend suitable parent.

Q: how to measure ‘suitability’? A: by increase in area (volume)

(more details: later, under ‘performance analysis’)

Q: what if there is no room? how to split?

Page 112: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘W’

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

W

K

K

Page 113: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘W’ - focus on ‘P1’ - how to split?

A

B

C

P1

W

K

Page 114: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion

eg., rectangle ‘W’ - focus on ‘P1’ - how to split?

A

B

C

P1

W

K • (A1: plane sweep,

until 50% of rectangles)

• A2: ‘linear’ split

• A3: quadratic split

• A4: exponential split

Page 115: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’

seed1

seed2

R

Page 116: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’ Q: how to measure ‘closeness’?

Page 117: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’ Q: how to measure ‘closeness’? A: by increase of area (volume)

Page 118: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’

seed1

seed2

R

Page 119: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’

seed1

seed2

R

Page 120: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion & split

pick two rectangles as ‘seeds’; assign each rectangle ‘R’ to the

‘closest’ ‘seed’ smart idea: pre-sort rectangles

according to delta of closeness (ie., schedule easiest choices first!)

Page 121: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion - pseudocode

decide which parent to put new rectangle into (‘closest’ parent)

if overflow, split to two, using (say,) the quadratic split algorithm propagate the split upwards, if

necessary update the MBRs of the affected

parents.

Page 122: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - insertion - observations

many more split algorithms exist (next!)

Page 123: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - more detailed outline

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins performance analysis variations (packed; hilbert;...)

Page 124: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - deletion

delete rectangle if underflow

??

Page 125: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - deletion

delete rectangle if underflow

temporarily delete all siblings (!); delete the parent node and re-insert them

Page 126: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - more detailed outline

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins performance analysis variations (packed; hilbert;...)

Page 127: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - range search

pseudocode: check the root for each branch, if its MBR intersects the query

rectangle apply range-search (or print out, if

this is a leaf)

Page 128: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4q

Page 129: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

Q: How? (find near neighbor; refine...)

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4q

Page 130: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

A1: depth-first search; then, range query

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4q

Page 131: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

A1: depth-first search; then, range query

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4q

Page 132: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

A1: depth-first search; then, range query

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4q

Page 133: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

A2: [Roussopoulos+, sigmod95]: priority queue, with promising MBRs,

and their best and worst-case distance

main idea:

Page 134: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4q

consider only P2 and P4, for illustration

Page 135: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

DE

H

J

P2P4q

worst of P2

best of P4=> P4 is useless

for 1-nn

Page 136: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

DE

P2q

worst of P2

what is really the worst of, say, P2?

Page 137: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

P2q

what is really the worst of, say, P2? A: the smallest of the two red

segments!

Page 138: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - nn search

variations: [Hjaltason & Samet] incremental nn: build a priority queue scan enough of the tree, to make sure you

have the k nn to find the (k+1)-th, check the queue, and

scan some more of the tree ‘optimal’ (but, may need too much

memory)

Page 139: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - more detailed outline

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins performance analysis variations (packed; hilbert;...)

Page 140: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - spatial joins

Spatial joins: find (quickly) all counties intersecting

lakes

Page 141: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - spatial joins

Assume that they are both organized in R-trees:

Page 142: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - spatial joins

for each parent P1 of tree T1 for each parent P2 of tree T2 if their MBRs intersect, process them recursively (ie., check

their children)

Page 143: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - spatial joins

Improvements - variations:- [Seeger+, sigmod 92]: do some pre-

filtering; do plane-sweeping to avoid N1 * N2 tests for intersection

- [Lo & Ravishankar, sigmod 94]: ‘seeded’ R-trees

(FYI, many more papers on spatial joins, without R-trees: [Koudas+ Sevcik], e.t.c.)

Page 144: Temple University – CIS Dept. CIS616– Principles of Data Management

SAMs - more detailed outline

R-trees main idea; file structure algorithms: insertion/split deletion search: range, nn, spatial joins variations (packed; hilbert;...)

Page 145: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

Guttman’s R-trees sparked much follow-up work

can we do better splits? what about static datasets (no

ins/del/upd)? what about other bounding shapes?

Page 146: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

Guttman’s R-trees sparked much follow-up work

can we do better splits? i.e, defer splits?

Page 147: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

A: R*-trees [Kriegel+, SIGMOD90] defer splits, by forced-reinsert, i.e.:

instead of splitting, temporarily delete some entries, shrink overflowing MBR, and re-insert those entries

Which ones to re-insert? How many?

Page 148: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

A: R*-trees [Kriegel+, SIGMOD90] defer splits, by forced-reinsert, i.e.:

instead of splitting, temporarily delete some entries, shrink overflowing MBR, and re-insert those entries

Which ones to re-insert? How many? A: 30%

Page 149: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

Q: Other ways to defer splits?

Page 150: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

Q: Other ways to defer splits?A: Push a few keys to the closest

sibling node (closest = ??)

Page 151: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

R*-trees: Also try to minimize area AND perimeter, in their split.

Performance: higher space utilization; faster than plain R-trees. One of the most successful R-tree variants.

Page 152: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

Guttman’s R-trees sparked much follow-up work

can we do better splits? what about static datasets (no

ins/del/upd)? Hilbert R-trees

what about other bounding shapes?

Page 153: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

what about static datasets (no ins/del/upd)?

Q: Best way to pack points?

Page 154: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

what about static datasets (no ins/del/upd)?

Q: Best way to pack points? A1: plane-sweep great for queries on ‘x’; terrible for ‘y’

Page 155: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

what about static datasets (no ins/del/upd)?

Q: Best way to pack points? A1: plane-sweep great for queries on ‘x’; bad for ‘y’

Page 156: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

what about static datasets (no ins/del/upd)?

Q: Best way to pack points? A1: plane-sweep great for queries on ‘x’; terrible for ‘y’ Q: how to improve?

Page 157: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

A: plane-sweep on HILBERT curve!

Page 158: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

A: plane-sweep on HILBERT curve! In fact, it can be made dynamic

(how?), as well as to handle regions (how?)

A: [Kamel+, VLDB94]

Page 159: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

Guttman’s R-trees sparked much follow-up work

can we do better splits? what about static datasets (no

ins/del/upd)? what about other bounding shapes?

Page 160: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

what about other bounding shapes? (and why?)

A1: arbitrary-orientation lines (cell-tree, [Guenther]

A2: P-trees (polygon trees) (MB polygon: 0, 90, 45, 135 degree lines)

Page 161: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - variations

A3: L-shapes; holes (hB-tree) A4: TV-trees [Lin+, VLDB-Journal 1994] A5: SR-trees [Katayama+, SIGMOD97]

(used in Informedia)

Page 162: Temple University – CIS Dept. CIS616– Principles of Data Management

R-trees - conclusions

Popular method; like multi-d B-trees guaranteed utilization good search times (for low-dim. at least) R*-, Hilbert- and SR-trees: still used IBM (Informix) ships DataBlade with R-

trees

Page 163: Temple University – CIS Dept. CIS616– Principles of Data Management

References

Guttman, A. (June 1984). R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD, Boston, Mass.

Jagadish, H. V. (May 23-25, 1990). Linear Clustering of Objects with Multiple Attributes. ACM SIGMOD Conf., Atlantic City, NJ.

Lin, K.-I., H. V. Jagadish, et al. (Oct. 1994). “The TV-tree - An Index Structure for High-dimensional Data.” VLDB Journal 3: 517-542.

Page 164: Temple University – CIS Dept. CIS616– Principles of Data Management

References, cont’d

Pagel, B., H. Six, et al. (May 1993). Towards an Analysis of Range Query Performance. Proc. of ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), Washington, D.C.

Robinson, J. T. (1981). The k-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes. Proc. ACM SIGMOD.

Roussopoulos, N., S. Kelley, et al. (May 1995). Nearest Neighbor Queries. Proc. of ACM-SIGMOD, San Jose, CA.