partition based spatial – merge join present by: tony tong (09049620) cleo tsang (09049630) june...
TRANSCRIPT
Partition Based Spatial – Merge Join
Present by:Tony Tong (09049620)
Cleo Tsang (09049630)June Yau (09030360)
Chelsie Chan (10104740)Oengus Lam (10104790)
1
Agenda1. Problem2. Definition of Spatial Join3. PBSM Algorithm
3.1 Filter Step3.2 Refinement Step3.3 Number of Partition3.4 Spatial Partitioning Function
4. Performance4.1 Indexed Nested Loops Join4.2 R-tree Based Join Algorithm4.3 Methodology4.4 None of Indices Pre-exist4.5 In the Presence of Pre-existing Index4.6 CPU Costs
5. Conclusion
2
1. Problem
• In a spatial database system, like GIS, join queries objects involve large amount of memory
• Have no pre computing data for datasets• Usually no index for intermediate result • Solve this join problem efficiently
3
2. Definition of Spatial Join
• An operation of combining 2 or more datasets based on their spatial relationship
Q: “Find all pairs of rivers and cities that intersect”
Join Result Pairs:<r1,c1><r1,c2><r2,c5>r2
r1
c1
c2
c3
c4
c5
4
3. PBSM Algorithm
• Partition Based Spatial-Merge Join (PBSM)• PBSM operates in 2 steps
5
Filter Step
Refinement Step
ID River_Name Length
r1 Margaret River 60km
r2 Brisbane River 344km
ID City_Name County
c1 Perth WA
c2 Brisbane QLD
Unique Identifer(OID)
Input R
Input C
3.1 Filter Step• Purpose:
– To find all objects whose MBR intersects the query rectangle
• For each input (R and C), – Creation of Minimum Bounding
Rectangle (MBR)
– Rough Estimation for Search Region
– Key-Pointer Element in New Input (Rkp and Ckp)(OID + MBR)
6
Input Rkp
<r1, mr1><r2, mr2>
r2r1
c1
c2
c3
c4
c5
Input Ckp
<c1, mc1><c2, mc2>
...
Key-Pointer Element
3.1 Filter Step• Spatial Join (1st Scenario)
– Rkp and Ckp fit into main memory
• Plane-Sweeping Technique– Sort by MBR.xl for Each Input
(Rkp and Ckp)
– Select the MBR in Either Input (e.g. Rkp) with Smallest MBR.xl
– Scan along the x-axis from MBR.xl to MBR.xu to check if MBRr ∩ MBRc
7
r2r1
c1
c2
c3
c4
c5
MBR.xlMBR.xu
Start with the first entry r1, sweep a vertical lineCheck if MBRr1 ∩ MBRc2,add (OIDr1, OIDc2) to result setCheck if MBRr1 ∩ MBRc1,add (OIDr1, OIDc1) to result setScan until MBR.xu,start the next entry
3.1 Filter Step• Spatial Join (2nd Scenario)
– Rkp and Ckp do not fit into main memory
• Spatial Partitioning Technique– Size of Each Partition for both Input
(Rkp and Ckp) can fit into memory simultaneously
– Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition
• Result Pair <OIDR, OIDC>
8
r2r1
c1
c2
c3
c4
c5
Partition 2 Partition 3
Partition 0 Partition 1
3.2 Refinement Step• Purpose
– #1: To eliminate duplicates induced by Partitioning
9
r2c4
c5
Partition 2 Partition 3
Partition 0 Partition 1
Result Pairs:Partition 1: <r2,c5>Partition 3: <r2,c5>, <r2,c4>
– #2: To examine the actual R and S tuples & see if the attributes satisfy join condition
3.2 Refinement Step• Procedure
– #1: Sort OID pairs:• Primary Sort Key: OIDR
• Secondary Sort Key: OIDC
– #2: Read R tuples first, then C
3.3 Number of Partitions• Number of Partition P is computed as:-
whereP: Number of partition
R : Cardinality of RC : Cardinality of C
Sizekey-ptr: Size of a key-pointer element (in bytes)M: Size of main memory (in bytes)
11
M
SizeCRP ptrkey*)(
3.4 Spatial Partitioning Function • Non-Uniform Distributed & Clustered
Spatial Features
• By “Regular” Partitioning Method Large differences in size of partitions
12
Partition 0 Partition 1
Partition 2 Partition 3Universe
3.4 Spatial Partitioning Function Step 1: Regular decomposition of universe into
NT tiles, where NT > = PStep 2: Apply Tile-to-Partition Mapping Scheme
Round Robin OR Hashing
13
“Regular” Partitioning Method “Tile-based” Partitioning Method + Round Robin Mapping Scheme
Partition 0 Partition 1
Partition 2 Partition 3
Tile0/Part0 Tile1/Part1 Tile2/Part2 Tile3/Part0
Tile4/Part1 Tile5/Part2
Tile6/Part0 Tile7/Part1
Tile8/Part2 Tile9/Part0 Tile10/Part1 Tile11/Part2
3.4 Spatial Partitioning Function • What is the PERFECT Spatial Partitioning
Function ?
• Considerations:– Number of Tiles– Tile-to-Partition Mapping Scheme
(Round Robin OR Hashing)
• Data set used for investigation:– Tiger Road Data (62.4MB, 456,613 tuples)– Sequoia Polygon Data (21.9MB, 58,115 tuples)
14
It assigns equal number of tuples to each partition
3.4 Spatial Partitioning Function • Observation:
Partitioning Function improves as No. of Tiles increases More uniform distribution
15
The PERFECT Partitioning Function has a coefficient of variation = 0
Spatial Partitioning FunctionAlternatives: Tiger Road Data
3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,
↑ Replication Overhead
16
Replication Overhead:Tiger Road Data (16 Partitions)
Replication Overhead: Sequoia Polygon Data(16 Partitions)
Number of Tiles = An integral multiple of Number of Partitions
Tile0/Part0 Tile1/Part1 Tile2/Part2
Tile3/Part0 Tile4/Part1 Tile5/Part2
Tile6/Part0 Tile7/Part1 Tile8/Part2
r1
c1
c2
Scenario:• No. of Tiles = 9• P = 3• Tile-to-Partition Mapping Scheme = Round Robin
The entire column is being mapped to a single partition ↓ ∴ Replications by partitioning, ↓ Replication overheads
3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,
↑ Replication Overhead
18
Replication Overhead:Tiger Road Data (16 Partitions)
Replication Overhead: Sequoia Polygon Data(16 Partitions)
Number of Tiles = An integral multiple of Number of Partitions
4. Performance
V.SV.SIndexed
Nested Loops Join
Indexed Nested Loops
Join
PBSM Join(1024 tiles)PBSM Join(1024 tiles) V.SV.S
R-tree Based Join
R-tree Based Join
19
4.1 Index Nested Loops Join
20
Build an index in R (the smaller input)• Reads the extent R• Extracts the key-pointer (<MBR,OID>)
Sort the key-pointer by MBR
Build R-tree for the key-pointer
Scan on C• For each C, fetch each R
4.2 R-tree Based Join Algorithm
21
Build an R-tree index in both R and C
Find MBR with union set which is not null• Visit the roots• Move down the levels until leaf nodes
Find ID pairs with dataset union which is not null
4.3 Methodology
• Database System: Paradise• Machine: Sun SPARC-10/51
– 64 MB of memory– SunOS Release 4.1.3– One Seagate 2GB disk
22
TIGER file
• Road, Hydrography and Rail data of the United States etc…
• 2 join queries– Road with Hydrography– Between the Road and the Rail data
23
Data Type # of Objects Total Size R-tree Size
Road 656,613 62.4 MB 24.0 MB
Hydrography 122,149 25.2 MB 6.5 MB
Rail 16,844 2.4 MB 1.0 MB
Sequoia 2000 Storage Benchmark
• Polygon– Regions of homogeneous landuse characteristics
in California• Islands
– Holes in the polygon data
24
Data Type # of Objects Total Size R-tree Size
Polygons 58,115 21.9 MB 3.0 MB
Islands 21,007 6.2 MB 1.1 MB
4.4 NONE OF INDICES PRE-EXIST
25
TIGER: Join Road with Hydrograhy
26
PBSM is 48-98% faster
than the R-Tree Based; 93-
300% faster than the Idx.
Nested Loops.
TIGER: Join Road with Rail
27
Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested
Loops performs
better than R-Tree Based.
Cluster
• Data continuously , i.e. not randomly distributed
• Data are mostly in sequential order in real life
• Less computationally expensive
28
Clustered TIGER: Join Road with Hydrography
29
PBSM is 40% faster than the R-Tree Based; and 60-80%
faster than the Idx. Nested
Loops.
Costs• Index Building Cost
– Cost of extracting the key-pointers from the input– Sorting the key-pointers– Building the index using the sorted key pointers– If Input is clustered
No sorting key-pointers Cost of building index
• Tree Joining Cost• Refinement Step Cost
30
Sequoia Data
31
PBSM is 13-27% faster
than the R-Tree Based; and 17-
114% faster than the Idx.
Nested Loops.
Summary
• PBSM is better than R-tree and the Indexed Nested Loops based algorithm
• When sizes of 2 inputs differ significantly,Indexed Nested Loops is better than the R-tree based algorithm
• All algorithms improve if join inputs are clustered
32
4.5 IN THE PRESENCE OF PRE-EXISTING INDEX
33
When indices pre-exist on
both the inputs, the R-
tree based algorithm has
the best performance
When indices pre-exist on
both the inputs, the R-
tree based algorithm has
the best performance
TIGER: Join Road with Hydrograhy
34
When an index exists only on the smaller input, PBSM
performs best.
When an index exists only on the smaller input, PBSM
performs best.
TIGER: Join Road with Rail
35
4.6 CPU Cost
• Insert a table• CPU cost > I/O cost• System
– CPU intensive– Much less I/O is needed
36
5. Principal Behind
• Divide and Conquer• Optimization on memory size
6. Playback of this presentation
• Efficient PBSM algorithm• Comparison among different algorithm• Performance Analysis• Clustered Data• Indexed Data
38
Question ?
39