database design and implementation 07.multidim

Upload: sushmsn

Post on 10-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Database design and Implementation 07.multidim

    1/201

    Database Systems Implementation, Bongki Moon 1

    Multidimensional Access Methodsfor Point and Spatial Data

    Ramakrishnan&Gehrke: Chap. 28.1-28.3

    Various articles

    Database Systems Implementation, Bongki Moon 2

    Single-Dimensional Indexes

    B+ trees are fundamentally single-dimensionalindexes.

    When we create a composite search key B+ tree,e.g., an index on , we effectively linearizethe 2-dimensional space since we sort entries firstby age and then by sal.

    Consider entries:, ,

    11 12 13

    70605040302010

    80

    B+ treeorder

  • 8/8/2019 Database design and Implementation 07.multidim

    2/20

    2

    Database Systems Implementation, Bongki Moon 3

    Multidimensional Queries

    Multidimensional equality queries (e.g.) age = 50 AND sal = 70K

    Multidimensional range queries 50 < age < 55 AND 80K < sal < 90K

    Similarity queries (NN; content-based retrieval) Given a face, find the five most similar faces.

    Spatial queries (GIS, CAD)

    Find all hotels within 5 miles from the conference venue. Find the city with population 500,000 or more that isnearest to Kalamazoo, MI.

    Find all cities that lie on the Nile in Egypt. Find all parts that touch the fuselage (in a plane design).

    Database Systems Implementation, Bongki Moon 4

    Desired Properties

    A multidimensional index clusters entries so as toexploit nearness in multidimensional space.

    Symmetric: efficient access w.r.t. all indexed

    attributes.

    Consider entries:, ,

    11 12 13

    70605040302010

    80

    Spatialclusters

  • 8/8/2019 Database design and Implementation 07.multidim

    3/20

    3

    Database Systems Implementation, Bongki Moon 5

    Whats the difficulty?

    An index based on spatial location needed.

    Must support range queries as well aspoint queries.

    Must support inserts and deletes gracefully.

    Need to support non-point data as well.

    (e.g.) lines, shapes, polygons, etc.

    Numerous index structures have been proposed. Points: Grid file, k-d-B tree, hB-tree, etc.

    Spatial: R-tree and its variants (R+-, R*-, P-trees).

    Database Systems Implementation, Bongki Moon 6

    K-d Tree [Bentley 1975]

    Structurally, a binary tree. Different (alternate) discriminatorat each level. Not necessarily balanced. Efficient exact-matching, range queries.

    But, main memory data structure.

  • 8/8/2019 Database design and Implementation 07.multidim

    4/20

    4

    Database Systems Implementation, Bongki Moon 7

    Point Quad-tree [Finkel&Bentley 1974]

    Essentially, multidimensional (binary) search tree. Fanout is 2d for a d-dimensional space.

    Like k-d trees, decomposes the space by hyperplanes.

    Searching is similar to binary search tree.

    At each level, all d key values are compared.

    To insert a new data point, first perform a search

    then, insert the new data as a new leaf node,

    its corresponding partition is divided into 2d-waysubspaces.

    Not necessarily balanced.

    Database Systems Implementation, Bongki Moon 8

    Point quad-tree: example

  • 8/8/2019 Database design and Implementation 07.multidim

    5/20

    5

    Database Systems Implementation, Bongki Moon 9

    Grid Files [Nievergelt 1984]

    Dynamic version of multi-attribute hashing.

    Superimpose a d-dimensional orthogonal grid.

    The subspace represented by each grid cell(bucket) may be of different shape and size.

    Structure

    Scales (for x and y dimensions): stay in memory.

    Directory: mapping grid buckets to data blocks.

    Each bucket corresponds to a disk block. But adjacentbuckets can share the same disk block.

    Database Systems Implementation, Bongki Moon 10

    Grid Files (contd)

    For any exact-match query, only 2 disk accesses. One for directory, and the other for a data block.

    Symmetric structure w.r.t. the index attributesallows efficient range queries w.r.t. all attributes.

    Adapts to non-uniform data distributions. More cuts in scales in the dense data regions.

    Typically cuts occur on pre-defined points in scales.

    If an insertion causes overflow in a data block, If the block is associated with several buckets, split.

    O/W, introduce a new split on a scale; Very expensive.

    How about deletion?

  • 8/8/2019 Database design and Implementation 07.multidim

    6/20

    6

    Database Systems Implementation, Bongki Moon 11

    Grid File: example

    Database Systems Implementation, Bongki Moon 12

    Grid File: disadvantages

    Split and Merge are quite complicated.

    If attributes are correlated, large directory and

    many empty buckets; A variant: Rotated grid file. The most serious problem with grid files is the

    size of directory.

    Directory growsfaster than lineareven for uniformlydistributed data set: O(n1+(d-1)/(dB+1)) [Regnier 1984].

    The Curse of Dimensionality!!

  • 8/8/2019 Database design and Implementation 07.multidim

    7/20

    7

    Database Systems Implementation, Bongki Moon 13

    k-d-B tree [Robinson 1981]

    Combines (adaptive) k-d tree and B-tree.

    Dynamic, multi-way, balanced, fixed-size page.

    Unlike B-tree, 50% utilization is not guaranteed.

    Node structures

    Internal node: a collection of .

    Leaf node: .

    Space decomposition

    Space decomposed in a manner similar to k-d tree. Subspaces at the same level are mutually disjoint (no

    overlap).

    Union of subspaces constitutes the complete space of theirparent.

    Database Systems Implementation, Bongki Moon 14

    k-d-B tree: example

  • 8/8/2019 Database design and Implementation 07.multidim

    8/20

    8

    Database Systems Implementation, Bongki Moon 15

    k-d-B tree: Insertion

    If a leaf node overflows, it is split; about half theentries are shifted to the new leaf node.

    Various heuristics for optimal split.

    Like B-tree, the split may be propagated up the tree.

    However, unlike B-tree, split of an internal node mayaffect the space decomposition of its child nodes.

    The hyperplane used for the new split will split all thedescendant nodes as well. Cascading splits!!

    So, impossible to guarantee minimum storage utilization.

    Database Systems Implementation, Bongki Moon 16

    hB-tree [Lomet&Salzberg 1989]

    A variant of the k-d-B tree.

    Unlike k-d-B tree, split may be done by more thanone hyperplane, allowing L-shaped subspaces.

    Structure Each node is a k-d tree, which represents a L-shaped or brick

    with holes (holey brick).

    The leaves of the k-d tree point to the lower-level nodes.

    More than one pointer can reference the same node.

    Split guarantees 1:2 data distribution in the worstcase.

    Splits are localized; avoid cascading splits.

  • 8/8/2019 Database design and Implementation 07.multidim

    9/20

    9

    Database Systems Implementation, Bongki Moon 17

    hB-tree: example

    The node u represents an L-shaped subspaceA+B+E+G. G in turn represents an L-shaped.

    w u

    Database Systems Implementation, Bongki Moon 18

    Summary: Point Data

    Data objects are points in a multidimensional space.

    Popular queries are Exact-match (point) query,

    Range (including partial-match) query, Nearest-neighbor query.

    K-d trees and quad-trees are main memory indexes.

    Grid file, k-d-B tree, hB-tree are for point data. Grid file: superlinear directory growth.

    K-d-B tree: cascading splits.

    hB-tree: 33% guaranteed utilization, duplicate references.

  • 8/8/2019 Database design and Implementation 07.multidim

    10/20

    10

    Database Systems Implementation, Bongki Moon 19

    The R-Tree [Guttman 1984]

    Like B-tree, R-tree remains balanced on insertsand deletes, and guarantees 50% storageutilization.

    Each key is a minimum bounding rectangle(MBR).

    Example in 2-D:

    X

    Y

    Root ofR Tree

    Leaflevel

    Database Systems Implementation, Bongki Moon 20

    Example: 2-D View

    R8R9

    R10

    R11

    R12

    R17

    R18

    R19

    R13

    R14

    R15

    R16

    R1

    R2

    R3

    R4

    R5

    R6

    R7

    Leaf entry

    Index entry

    Spatial objectapproximated bybounding box R8

  • 8/8/2019 Database design and Implementation 07.multidim

    11/20

    11

    Database Systems Implementation, Bongki Moon 21

    Example: Tree Structure

    R1 R2

    R3 R4 R5 R6 R7

    R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19

    Database Systems Implementation, Bongki Moon 22

    R-Tree Properties

    Internal node: a set of pairs

    An MBR covers all the MBRs in subtree.

    Leaf node: a set of pairs

    MBR is the tightest bounding box for a data object.

    Nodes can be kept 50% full (except root).

    Root node has at least two entries.

    Each node except root contains between m andMentries.The 50% utilization can be achieved by properly settingm.

    MBRs (even at the same level) are allowed to overlap.

  • 8/8/2019 Database design and Implementation 07.multidim

    12/20

    12

    Database Systems Implementation, Bongki Moon 23

    Intersection (or Range) Query

    Start at root.1. If current node is non-leaf, for each

    entry , if boxE overlaps Q,search subtree identified by ptr.

    2. If current node is leaf, for each entry, if E overlaps Q, rid identifies

    an object that might overlap Q.

    Note: May have to search severalseveral subtreessubtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)

    Database Systems Implementation, Bongki Moon 24

    Inserting an Entry

    Algorithm ChooseLeaf Start at root and go down to best-fit leaf node L.

    At each level, choose one among nodes whose MBR needs leastenlargement to cover the MBR of the new object.

    Tie-breaking by going to a child node with smallest area.

    Algorithm SplitNode If best-fit leaf L is already full, split L into L1 and L2.

    Algorithm AdjustTree Adjust the MBR of L (or L1) in its parent while ascending to the root.

    If L is split, add an entry for L2 to its parent. (This could cause theparent node to recursively split.)

  • 8/8/2019 Database design and Implementation 07.multidim

    13/20

    13

    Database Systems Implementation, Bongki Moon 25

    Splitting a Node During Insertion

    The entries in node L plus the newly inserted entry mustbe distributed between L1 and L2.

    Goal is to reduce likelihood of both L1 and L2 beingsearched on subsequent queries.

    Two alternative goals to avoid multiple paths from theroot and to prune the search space:

    Minimize coverage (area)

    Minimize overlap

    GOOD SPLIT!

    BAD!

    Database Systems Implementation, Bongki Moon 26

    Guttmans Split Algorithm

    Exhaustive : O(2M) to try all possible groupings.

    Quadratic : O(M2*d)

    First, find a pair (R,S) that maximizes

    area(MBR(R,S)) - area(MBR(R) MBR(S)).

    For the rest of entries,1) Choose an entry E such that |R S| is maximal, where x is the

    area increase required in GroupR or GroupS to include the entry E.

    2) Add E to a group so that coverage will be enlarged by the leastamount. Tie-breaking by adding to a group with smaller area,then with fewer entries.

    3) Repeat this until all the entries are assigned to groups.

  • 8/8/2019 Database design and Implementation 07.multidim

    14/20

    14

    Database Systems Implementation, Bongki Moon 27

    Guttmans Split Algorithm

    Linear : O(M*d)

    First, find a pair (R,S) for each dimension that maximizesdisti(R, S). Then, pick one among the d pairs with themaximum normalized distance. disti(R,S) is the distance between R and S along i-axis.

    Normalized distance is disti(R,S) / max_distancei.

    Then, add entries to groups so that coverage will beenlarged by the least amount.

    Database Systems Implementation, Bongki Moon 28

    Deleting an Entry

    Deletion consists of searching for the entry to bedeleted, removing it.

    If the node becomes underflow (i.e., contains lessthen the minimum entries), re-insert theremaining entries.

  • 8/8/2019 Database design and Implementation 07.multidim

    15/20

    15

    Database Systems Implementation, Bongki Moon 29

    R*-tree [Beckmann et al. 1990]

    Major insight : node splitting is critical for performance.

    Basically, a set of optimizing techniques for R-tree.

    Algorithm ChooseSubtree

    If the current node is a leaf, done.

    If the current node is a parent of leaf nodes, use minimum overlapenlargement for selecting a leaf node, with tie-breaking by least areaenlargement then smallest area. Complexity is O(M2). An O(MlogM)approximate algorithm was suggested.

    If the current node is a parent of internal nodes, use minimum area

    enlargement for selecting an internal node, with tie-breaking bysmallest area. [Same as Guttmans ChooseLeaf algorithm]

    Split algorithm that considers area, margin and overlap. Based on plane-sweeping paradigm; O(d*MlogM).

    Database Systems Implementation, Bongki Moon 30

    R*-tree : Deferred Splitting

    Algorithm Reinsert: a new idea Deferred Splitting

    Dynamic reorganization byforced-reinsert; at the firsttime overflow at any level, reinsert p out of M+1

    entries. The p entries whose centers are farthest from the

    center of the enclosing MBR are chosen for reinsertion.

    Significant performance improvement for point datasets.

  • 8/8/2019 Database design and Implementation 07.multidim

    16/20

    16

    Database Systems Implementation, Bongki Moon 31

    R*-tree : Node Splitting

    Node splitting based on Plane-Sweeping:

    (1) Sort MBRs by xlow coordinates (for each dimension).

    (2) Generate (M-2m+2) 2-way partitions:k=1: m M-m+1k=2: m+1 M-m..k=M-2m+2: M-m+1 m

    (3) For each dimension i, compute Si = margink, where margink is thesum of perimeters of two MBRs (by each 2-way partitioning).

    (4) Choose a dimensionj, such that Sj is minimal.

    (5) For the dimensionj, select one out of (M-2m+2) 2-way partitionssuch that overlap is minimized. Tie-break by the minimum area.

    Database Systems Implementation, Bongki Moon 32

    R*-tree : Performance Evaluation

  • 8/8/2019 Database design and Implementation 07.multidim

    17/20

    17

    Database Systems Implementation, Bongki Moon 33

    Summary on R-Trees

    Overall, works quite well for low dimensional (d < 10)data sets; widely used.

    Other variants: P-tree [Jagadish 1990]: use a convex polygon for MBR.

    SR-tree [Katayama 1997]: use a spherical bounding region.

    X-tree [Berchtold 1996]: proposed for high dimensional data sets.

    Parallel R-tree [Kamel&Faloutsos 1992]: for multi-disk systems.

    Hilbert R-tree [Kamel&Faloutsos 1994]: use the Hilbert space-filling curve for deferred splitting.

    Packed R-tree [Rous. 1985; Kamel 1993]: to bulk-load static data.

    Still, one of the hottest research tools!!!

    Database Systems Implementation, Bongki Moon 34

    Space-filling Curves: background

    Historical background: [Peano 1890]: showed there exists a continuous and

    surjective mapping, f: [0,1] [0,2]2.

    [Hilbert 1891]: demonstrated its existence by ageometric construction.

  • 8/8/2019 Database design and Implementation 07.multidim

    18/20

    18

    Database Systems Implementation, Bongki Moon 35

    Space-filling Curves forMultidimensional Indexing

    Main difficulty in multidimensional indexing: No total order that preserves spatial proximity.

    If total order found, then one-dimensional access methods can beused. UB-Tree (Universal B-Tree): z-ordering with a variant of B+-Tree

    Linear Mappingby a Space-Filling Curve First, partition the m-D space with a grid.

    Map the coordinates of each grid cell into a unique number by anSFC.

    A spatial object is represented by a list of grid cells intersected.

    The way of mapping determines how closely adjacent cells (ordata in the cells) are stored in an index, or a file on disk. [dataclustering]

    Database Systems Implementation, Bongki Moon 36

    Z-Curve by Bit-Interleaving

    Popular curves are: Z-curve (aka, Morton curve), Hilbert, Gray-coded.

    Z-curve (aka Morton curve) Easy to implement, adopted by commercial DBMS. A single common prefix can be used to represent a set

    of grid cells. [If z1 is the prefix of z2, then the region of z1encloses that of z2.]

    Clustering not as good as Hilbert curve.

  • 8/8/2019 Database design and Implementation 07.multidim

    19/20

    19

    Database Systems Implementation, Bongki Moon 37

    Z-curve for spatial objects: example

    Nine regionsapproximate thepolygon: {011100,011110,110100,01

    001*,0110**,1100**,000111,001***,10010*}

    01001*=010010+010011

    0110**=011000+011001+011010+011011

    Database Systems Implementation, Bongki Moon 38

    Data Clustering by SFC

    SFC preserves data locality while mapping.

    Clusteris a set of consecutive grid points.

    A 2x2 range query below retrieves 4 grid points in

    two clusters, which may amount to two disk seeks.

  • 8/8/2019 Database design and Implementation 07.multidim

    20/20

    Database Systems Implementation, Bongki Moon 39

    Clustering Properties of SFC

    Comparative studies show SFCs have different clusteringproperties. [Jagadish 1990] For a 2x2 range query, the avg number of clusters is

    2.65 (z-curve), 2.5 (Gray-coded), 2.0 (Hilbert curve).

    [Rong 1991] For a 2-d rectangular query, the avg number of clustersby z-curve is (perimeter + 2*one-side)/3.

    [Moon 1996] For a d-dimensional query region of any rectilinear

    polyhedron, the avg number of clusters by Hilbert curve is Sq/2d,where Sq is the hyper-surface area of the query.

    For a 2-d rectangular query, the avg number of clusters byHilbert curve isperimeter/4.