spatial data requires special data structures, similar to b-trees

50

Upload: sabina-sullivan

Post on 21-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Spatial data requires special data structures, similar to B-trees
Page 2: Spatial data requires special data structures, similar to B-trees

Spatial data requires special data structures, similar to B-trees

Page 3: Spatial data requires special data structures, similar to B-trees

Types of spatial data Example of spatial and geometric data –

splines and Voronoi diagrams One-dimensional index – interval trees R-trees T-tree variants

Page 4: Spatial data requires special data structures, similar to B-trees

Point Data◦ Points in a multidimensional space◦ E.g., Raster data such as satellite imagery,

where each pixel stores a measured value◦ E.g., Feature vectors extracted from text

Region Data◦ Objects have spatial extent with location and

boundary◦ DB typically uses geometric approximations

constructed using line segments, polygons, etc., called vector data.

Page 5: Spatial data requires special data structures, similar to B-trees

Different types of sampling are used to collect data

Page 6: Spatial data requires special data structures, similar to B-trees
Page 7: Spatial data requires special data structures, similar to B-trees
Page 8: Spatial data requires special data structures, similar to B-trees
Page 9: Spatial data requires special data structures, similar to B-trees
Page 10: Spatial data requires special data structures, similar to B-trees

Low-pass filter – the value for the cell is computed as average of other cells

High-pass-continuous surface –low pass

Page 11: Spatial data requires special data structures, similar to B-trees

Window size has effect on filtering

Page 12: Spatial data requires special data structures, similar to B-trees
Page 13: Spatial data requires special data structures, similar to B-trees

Spatial Range Queries◦ Find all cities within 50 miles of Madison◦ Query has associated region (location, boundary)◦ Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries◦ Find the 10 cities nearest to Madison◦ Results must be ordered by proximity

Spatial Join Queries◦ Find all cities near a lake◦ Expensive, join condition involves regions and proximity

Page 14: Spatial data requires special data structures, similar to B-trees

Geographic Information Systems (GIS)◦ E.g., ESRI’s ArcInfo; OpenGIS Consortium◦ Geospatial information◦ All classes of spatial queries and data are common

Computer-Aided Design/Manufacturing◦ Store spatial objects such as surface of airplane fuselage◦ Range queries and spatial join queries are common

Multimedia Databases◦ Images, video, text, etc. stored and retrieved by content◦ First converted to feature vector form; high

dimensionality◦ Nearest-neighbor queries are the most common

Page 15: Spatial data requires special data structures, similar to B-trees

B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g.,

an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal.

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

11 12 13

70605040302010

80

B+ treeorder

Page 16: Spatial data requires special data structures, similar to B-trees

A multidimensional index clusters entries so as to exploit “nearness” in multidimensional space.

Keeping track of entries and maintaining a balanced index structure presents a challenge!

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

Spatialclusters

70605040302010

80

B+ treeorder

11 12 13

Page 17: Spatial data requires special data structures, similar to B-trees

Spatial queries (GIS, CAD).◦ Find all hotels within a radius of 5 miles from the

conference venue.◦ Find the city with population 500,000 or more that is

nearest to Kalamazoo, MI.◦ Find all cities that lie on the Nile in Egypt.◦ Find all parts that touch the fuselage (in a plane

design). Multidimensional range queries.

◦ 50 < age < 55 AND 80K < sal < 90K

Page 18: Spatial data requires special data structures, similar to B-trees

Similarity queries (content-based retrieval).◦ Given a face, find the five

most similar faces/expressions.

Page 19: Spatial data requires special data structures, similar to B-trees

Geometric, 1-dimensional tree Interval is defined by (x1,x2) Split at the middle (5), again at the middle

(3,7), again at the middle (2,8) All intervals intersecting a middle point are

stored at the corresponding root.

1 2 3 4 5 6 7 8 9

(4,6) (4,8)

(6,9)

(7.5,8.5)

(2,4)

Page 20: Spatial data requires special data structures, similar to B-trees

Finding intervals – by finding x1, x2 against the nodes

Find interval containing specific value – from the root

Sort intervals within each node of the tree according to their coorsinates

Cost of the “stabbing query”– finding all intervals containing the specified value is O(log n + k), where k is the number of reported intervals.

Page 21: Spatial data requires special data structures, similar to B-trees

Constructs the minimal bounding box (mbb)

Check validity (predicate) on mbb Refinement step verifies if actual

objects satisfy the predicate.

Page 22: Spatial data requires special data structures, similar to B-trees

An index based on spatial location needed.◦ One-dimensional indexes don’t support

multidimensional searching efficiently. ◦ Hash indexes only support point queries; want

to support range queries as well.◦ Must support inserts and deletes gracefully.

Ideally, want to support non-point data as well (e.g., lines, shapes).

The R-tree meets these requirements, and variants are widely used today.

Page 23: Spatial data requires special data structures, similar to B-trees

The R-tree is a tree-structured index that remains balanced on inserts and deletes.

Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.

Example in 2-D:

X

Y

Root ofR Tree

Leaf level

Page 24: Spatial data requires special data structures, similar to B-trees

Leaf entry = < n-dimensional box, rid >◦ This is Alternative (2), with key value being a box.◦ Box is the tightest bounding box for a data object.

Non-leaf entry = < n-dim box, ptr to child node >◦ Box covers all boxes in child node (in fact, subtree).

All leaves at same distance from root. Nodes can be kept 50% full (except root).

◦ Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

Page 25: Spatial data requires special data structures, similar to B-trees

R8R9

R10

R11

R12

R17

R18

R19

R13

R14

R15

R16

R1

R2

R3

R4

R5

R6

R7

Leaf entry

Index entry

Spatial objectapproximated by bounding box R8

Page 26: Spatial data requires special data structures, similar to B-trees

R1 R2

R3 R4 R5 R6 R7

R8 R9 R10 R11R12 R13R14 R15R16 R17R18R19

Page 27: Spatial data requires special data structures, similar to B-trees

Start at root.1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr.2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q.

Note: May have to search several subtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)

Page 28: Spatial data requires special data structures, similar to B-trees

It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.

But why not use convex polygons to approximate query regions more accurately?◦ Will reduce overlap with nodes in tree, and reduce

the number of nodes fetched by avoiding some branches altogether.

◦ Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

Page 29: Spatial data requires special data structures, similar to B-trees

Start at root and go down to “best-fit” leaf L.◦ Go to child whose box needs least enlargement to

cover B; resolve ties by going to smallest area child.

If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.◦ Adjust entry for L in its parent so that the box now

covers (only) L1.◦ Add an entry (in the parent node of L) for L2. (This

could cause the parent node to recursively split.)

Page 30: Spatial data requires special data structures, similar to B-trees

The entries in node L plus the newly inserted entry must be distributed between L1 and L2.

Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.

Idea: Redistribute so as to minimize area of L1 plus area of L2.

Exhaustive algorithm is too slow; quadratic and linear heuristics are popular in research. GOOD SPLIT!

BAD!

Page 31: Spatial data requires special data structures, similar to B-trees

The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a node overflows, instead of splitting:◦ Remove some (say, 30% of the) entries and reinsert them

into the tree. ◦ Could result in all reinserted entries fitting on some existing

pages, avoiding a split. R* trees also use a different heuristic, minimizing box

perimeters rather than box areas during insertion. Another variant, the R+ tree, avoids overlap by

inserting an object into multiple leaves if necessary.◦ Searches now take a single path to a leaf, at cost of

redundancy.

Page 32: Spatial data requires special data structures, similar to B-trees

The Generalized Search Tree (GiST) abstracts the “tree” nature of a class of indexes including B+ trees and R-tree variants.◦ Striking similarities in insert/delete/search and even

concurrency control algorithms make it possible to provide “templates” for these algorithms that can be customized to obtain the many different tree index structures.

◦ B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs.

◦ GiST provides an alternative for implementing other tree indexes.

Page 33: Spatial data requires special data structures, similar to B-trees

Typically, high-dimensional datasets are collections of points, not regions.◦ E.g., Feature vectors in multimedia applications.◦ Very sparse

Nearest neighbor queries are common.◦ R-tree becomes worse than sequential scan for most

datasets with more than a dozen dimensions. As dimensionality increases contrast (ratio of

distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful.

Page 34: Spatial data requires special data structures, similar to B-trees

Spatial data management has many applications, including GIS, CAD/CAM, multimedia indexing.◦ Point and region data◦ Overlap/containment and nearest-neighbor

queries Many approaches to indexing spatial data

◦ R-tree approach is widely used in GIS systems◦ Other approaches include Grid Files, Quad trees,

and techniques based on “space-filling” curves.◦ For high-dimensional datasets, unless data has

good “contrast”, nearest-neighbor may not be well-separated

Page 35: Spatial data requires special data structures, similar to B-trees

Deletion consists of searching for the entry to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries.

Overall, works quite well for 2 and 3 D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used.

Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection.

Page 36: Spatial data requires special data structures, similar to B-trees

"Print Gallery," by M.C. Escher. Curious about the blank spot in the middle of Escher’s 1956 lithograph, Hendrik Lenstra set out to learn whether the artist had encountered a mathematical problem he couldn’t solve. ©2002 Cordon Art B.V., Baarn, Holland. All rights reserved.

Page 37: Spatial data requires special data structures, similar to B-trees

Fixed grid: Stored as a 2D array,

each entry contains a link to a list of points (object) stored in a grid.

a,b

Page 38: Spatial data requires special data structures, similar to B-trees

Too many points in one grid cell:Solution A –overflow (linked list)Solution B- Split the cell and increase index!

Page 39: Spatial data requires special data structures, similar to B-trees

Rectangles may share different grid cells Rectangle duplicates are stored Grid cells are of fixed size

Page 40: Spatial data requires special data structures, similar to B-trees

In a grid file, the index is dynamically increased in size when overflow happens.

The space is split by a vertical or a horizontal line, and then further subdivided when overflow happens!

Index is dynamically growing Boundaries of cells of different sizes are

stores, thus point and stabbing queries are easy

Page 41: Spatial data requires special data structures, similar to B-trees

Instead of using an array as an index, use tree!

Quadtree decomposition – cells are indexed by using quaternary B-tree.

All cells are squares, not polygons. Search in a tree is faster!

Page 42: Spatial data requires special data structures, similar to B-trees

First three levels of a quad tree

Page 43: Spatial data requires special data structures, similar to B-trees

Project #32: PICTURE REPRESENTATION USING QUAD TREES, McGill University:

8 x 8 pixel picture

represented in a quad

tree

Page 44: Spatial data requires special data structures, similar to B-trees

Example of a grid file

Page 45: Spatial data requires special data structures, similar to B-trees

B+ index – actual references to rectangles are stored in the leaves, saving more space+ access time

Label nodes according to Z or “pi” order

Page 46: Spatial data requires special data structures, similar to B-trees

Level of detail increases as the number of quadtree decompositions increases!

Decompositions have indexes of a form:00,01,02,03,10,11,12,13, 2,300301 ,302 ,303 ,31 ,32 ,33 ◦ Stores as Bplus tree

Page 47: Spatial data requires special data structures, similar to B-trees
Page 48: Spatial data requires special data structures, similar to B-trees
Page 49: Spatial data requires special data structures, similar to B-trees

What is spatial data structure? What is the difference between grid and

grid file? Explain how z or p ordering works? Define interval trees Provide example of R-tree List R-tree variants How spatial index structure differs from

regular B+ tree?

Page 50: Spatial data requires special data structures, similar to B-trees

Text 1 instructor’s resources McGill University web space Wikepedia (z order images) Face recognition research SPARCS lab project on image processing