partition based spatial – merge join present by: tony tong (09049620) cleo tsang (09049630) june...

Partition Based Spatial – Merge Join

Present by:Tony Tong (09049620)

Cleo Tsang (09049630)June Yau (09030360)

Chelsie Chan (10104740)Oengus Lam (10104790)

1

Agenda1. Problem2. Definition of Spatial Join3. PBSM Algorithm

3.1 Filter Step3.2 Refinement Step3.3 Number of Partition3.4 Spatial Partitioning Function

4. Performance4.1 Indexed Nested Loops Join4.2 R-tree Based Join Algorithm4.3 Methodology4.4 None of Indices Pre-exist4.5 In the Presence of Pre-existing Index4.6 CPU Costs

5. Conclusion

2

1. Problem

• In a spatial database system, like GIS, join queries objects involve large amount of memory

• Have no pre computing data for datasets• Usually no index for intermediate result • Solve this join problem efficiently

3

2. Definition of Spatial Join

• An operation of combining 2 or more datasets based on their spatial relationship

Q: “Find all pairs of rivers and cities that intersect”

Join Result Pairs:<r1,c1><r1,c2><r2,c5>r2

r1

c1

c2

c3

c4

c5

4

3. PBSM Algorithm

• Partition Based Spatial-Merge Join (PBSM)• PBSM operates in 2 steps

5

Filter Step

Refinement Step

ID River_Name Length

r1 Margaret River 60km

r2 Brisbane River 344km

ID City_Name County

c1 Perth WA

c2 Brisbane QLD

Unique Identifer(OID)

Input R

Input C

3.1 Filter Step• Purpose:

– To find all objects whose MBR intersects the query rectangle

• For each input (R and C), – Creation of Minimum Bounding

Rectangle (MBR)

– Rough Estimation for Search Region

– Key-Pointer Element in New Input (Rkp and Ckp)(OID + MBR)

6

Input Rkp

<r1, mr1><r2, mr2>

r2r1

c1

c2

c3

c4

c5

Input Ckp

<c1, mc1><c2, mc2>

...

Key-Pointer Element

3.1 Filter Step• Spatial Join (1st Scenario)

– Rkp and Ckp fit into main memory

• Plane-Sweeping Technique– Sort by MBR.xl for Each Input

(Rkp and Ckp)

– Select the MBR in Either Input (e.g. Rkp) with Smallest MBR.xl

– Scan along the x-axis from MBR.xl to MBR.xu to check if MBRr ∩ MBRc

7

r2r1

c1

c2

c3

c4

c5

MBR.xlMBR.xu

Start with the first entry r1, sweep a vertical lineCheck if MBRr1 ∩ MBRc2,add (OIDr1, OIDc2) to result setCheck if MBRr1 ∩ MBRc1,add (OIDr1, OIDc1) to result setScan until MBR.xu,start the next entry

3.1 Filter Step• Spatial Join (2nd Scenario)

– Rkp and Ckp do not fit into main memory

• Spatial Partitioning Technique– Size of Each Partition for both Input

(Rkp and Ckp) can fit into memory simultaneously

– Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition

• Result Pair <OIDR, OIDC>

8

r2r1

c1

c2

c3

c4

c5

Partition 2 Partition 3


3.2 Refinement Step• Purpose

– #1: To eliminate duplicates induced by Partitioning

9

r2c4

c5



Result Pairs:Partition 1: <r2,c5>Partition 3: <r2,c5>, <r2,c4>

– #2: To examine the actual R and S tuples & see if the attributes satisfy join condition

3.2 Refinement Step• Procedure

– #1: Sort OID pairs:• Primary Sort Key: OIDR

• Secondary Sort Key: OIDC

– #2: Read R tuples first, then C

3.3 Number of Partitions• Number of Partition P is computed as:-

whereP: Number of partition

R : Cardinality of RC : Cardinality of C

Sizekey-ptr: Size of a key-pointer element (in bytes)M: Size of main memory (in bytes)

11

M

SizeCRP ptrkey*)(

3.4 Spatial Partitioning Function • Non-Uniform Distributed & Clustered

Spatial Features

• By “Regular” Partitioning Method Large differences in size of partitions

12


Partition 2 Partition 3Universe

3.4 Spatial Partitioning Function Step 1: Regular decomposition of universe into

NT tiles, where NT ＞＝ PStep 2: Apply Tile-to-Partition Mapping Scheme

Round Robin OR Hashing

13

“Regular” Partitioning Method “Tile-based” Partitioning Method + Round Robin Mapping Scheme



Tile0/Part0 Tile1/Part1 Tile2/Part2 Tile3/Part0

Tile4/Part1 Tile5/Part2

Tile6/Part0 Tile7/Part1

Tile8/Part2 Tile9/Part0 Tile10/Part1 Tile11/Part2

3.4 Spatial Partitioning Function • What is the PERFECT Spatial Partitioning

Function ?

• Considerations:– Number of Tiles– Tile-to-Partition Mapping Scheme

(Round Robin OR Hashing)

• Data set used for investigation:– Tiger Road Data (62.4MB, 456,613 tuples)– Sequoia Polygon Data (21.9MB, 58,115 tuples)

14

It assigns equal number of tuples to each partition

3.4 Spatial Partitioning Function • Observation:

Partitioning Function improves as No. of Tiles increases More uniform distribution

15

The PERFECT Partitioning Function has a coefficient of variation = 0

Spatial Partitioning FunctionAlternatives: Tiger Road Data

3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,

↑ Replication Overhead

16

Replication Overhead:Tiger Road Data (16 Partitions)

Replication Overhead: Sequoia Polygon Data(16 Partitions)

Number of Tiles = An integral multiple of Number of Partitions

Tile0/Part0 Tile1/Part1 Tile2/Part2



r1

c1

c2

Scenario:• No. of Tiles = 9• P = 3• Tile-to-Partition Mapping Scheme = Round Robin

The entire column is being mapped to a single partition ↓ ∴ Replications by partitioning, ↓ Replication overheads

3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,

↑ Replication Overhead

18

Replication Overhead:Tiger Road Data (16 Partitions)

Replication Overhead: Sequoia Polygon Data(16 Partitions)

Number of Tiles = An integral multiple of Number of Partitions

4. Performance

V.SV.SIndexed

Nested Loops Join

Indexed Nested Loops

Join

PBSM Join(1024 tiles)PBSM Join(1024 tiles) V.SV.S

R-tree Based Join

R-tree Based Join

19

4.1 Index Nested Loops Join

20

Build an index in R (the smaller input)• Reads the extent R• Extracts the key-pointer (<MBR,OID>)

Sort the key-pointer by MBR

Build R-tree for the key-pointer

Scan on C• For each C, fetch each R

4.2 R-tree Based Join Algorithm

21

Build an R-tree index in both R and C

Find MBR with union set which is not null• Visit the roots• Move down the levels until leaf nodes

Find ID pairs with dataset union which is not null

4.3 Methodology

• Database System: Paradise• Machine: Sun SPARC-10/51

– 64 MB of memory– SunOS Release 4.1.3– One Seagate 2GB disk

22

TIGER file

• Road, Hydrography and Rail data of the United States etc…

• 2 join queries– Road with Hydrography– Between the Road and the Rail data

23

Data Type # of Objects Total Size R-tree Size

Road 656,613 62.4 MB 24.0 MB

Hydrography 122,149 25.2 MB 6.5 MB

Rail 16,844 2.4 MB 1.0 MB

Sequoia 2000 Storage Benchmark

• Polygon– Regions of homogeneous landuse characteristics

in California• Islands

– Holes in the polygon data

24

Data Type # of Objects Total Size R-tree Size

Polygons 58,115 21.9 MB 3.0 MB

Islands 21,007 6.2 MB 1.1 MB

4.4 NONE OF INDICES PRE-EXIST

25

TIGER: Join Road with Hydrograhy

26

PBSM is 48-98% faster

than the R-Tree Based; 93-

300% faster than the Idx.

Nested Loops.

TIGER: Join Road with Rail

27

Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested

Loops performs

better than R-Tree Based.

Cluster

• Data continuously , i.e. not randomly distributed

• Data are mostly in sequential order in real life

• Less computationally expensive

28

Clustered TIGER: Join Road with Hydrography

29

PBSM is 40% faster than the R-Tree Based; and 60-80%

faster than the Idx. Nested

Loops.

Costs• Index Building Cost

– Cost of extracting the key-pointers from the input– Sorting the key-pointers– Building the index using the sorted key pointers– If Input is clustered

No sorting key-pointers Cost of building index

• Tree Joining Cost• Refinement Step Cost

30

Sequoia Data

31

PBSM is 13-27% faster

than the R-Tree Based; and 17-

114% faster than the Idx.

Nested Loops.

Summary

• PBSM is better than R-tree and the Indexed Nested Loops based algorithm

• When sizes of 2 inputs differ significantly,Indexed Nested Loops is better than the R-tree based algorithm

• All algorithms improve if join inputs are clustered

32

4.5 IN THE PRESENCE OF PRE-EXISTING INDEX

33

When indices pre-exist on

both the inputs, the R-

tree based algorithm has

the best performance

When indices pre-exist on

both the inputs, the R-

tree based algorithm has

the best performance

TIGER: Join Road with Hydrograhy

34

When an index exists only on the smaller input, PBSM

performs best.

When an index exists only on the smaller input, PBSM

performs best.

TIGER: Join Road with Rail

35

4.6 CPU Cost

• Insert a table• CPU cost > I/O cost• System

– CPU intensive– Much less I/O is needed

36

5. Principal Behind

• Divide and Conquer• Optimization on memory size

6. Playback of this presentation

• Efficient PBSM algorithm• Comparison among different algorithm• Performance Analysis• Clustered Data• Indexed Data

38

Question ?

39

partition based spatial – merge join present by: tony tong (09049620) cleo tsang (09049630) june...

Documents

partition slide

number of partition

mbr r mbr c

mbr r1 mbr c1

mbr r1 mbr c2

pbsm algorithm partition

partition result pair

filter step spatial