partition based spatial – merge join present by: tony tong (09049620) cleo tsang (09049630) june...

39
Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790) 1

Upload: bret-burger

Post on 29-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Partition Based Spatial – Merge Join

Present by:Tony Tong (09049620)

Cleo Tsang (09049630)June Yau (09030360)

Chelsie Chan (10104740)Oengus Lam (10104790)

1

Page 2: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Agenda1. Problem2. Definition of Spatial Join3. PBSM Algorithm

3.1 Filter Step3.2 Refinement Step3.3 Number of Partition3.4 Spatial Partitioning Function

4. Performance4.1 Indexed Nested Loops Join4.2 R-tree Based Join Algorithm4.3 Methodology4.4 None of Indices Pre-exist4.5 In the Presence of Pre-existing Index4.6 CPU Costs

5. Conclusion

2

Page 3: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

1. Problem

• In a spatial database system, like GIS, join queries objects involve large amount of memory

• Have no pre computing data for datasets• Usually no index for intermediate result • Solve this join problem efficiently

3

Page 4: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

2. Definition of Spatial Join

• An operation of combining 2 or more datasets based on their spatial relationship

Q: “Find all pairs of rivers and cities that intersect”

Join Result Pairs:<r1,c1><r1,c2><r2,c5>r2

r1

c1

c2

c3

c4

c5

4

Page 5: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3. PBSM Algorithm

• Partition Based Spatial-Merge Join (PBSM)• PBSM operates in 2 steps

5

Filter Step

Refinement Step

ID River_Name Length

r1 Margaret River 60km

r2 Brisbane River 344km

ID City_Name County

c1 Perth WA

c2 Brisbane QLD

Unique Identifer(OID)

Input R

Input C

Page 6: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.1 Filter Step• Purpose:

– To find all objects whose MBR intersects the query rectangle

• For each input (R and C), – Creation of Minimum Bounding

Rectangle (MBR)

– Rough Estimation for Search Region

– Key-Pointer Element in New Input (Rkp and Ckp)(OID + MBR)

6

Input Rkp

<r1, mr1><r2, mr2>

r2r1

c1

c2

c3

c4

c5

Input Ckp

<c1, mc1><c2, mc2>

...

Key-Pointer Element

Page 7: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.1 Filter Step• Spatial Join (1st Scenario)

– Rkp and Ckp fit into main memory

• Plane-Sweeping Technique– Sort by MBR.xl for Each Input

(Rkp and Ckp)

– Select the MBR in Either Input (e.g. Rkp) with Smallest MBR.xl

– Scan along the x-axis from MBR.xl to MBR.xu to check if MBRr ∩ MBRc

7

r2r1

c1

c2

c3

c4

c5

MBR.xlMBR.xu

Start with the first entry r1, sweep a vertical lineCheck if MBRr1 ∩ MBRc2,add (OIDr1, OIDc2) to result setCheck if MBRr1 ∩ MBRc1,add (OIDr1, OIDc1) to result setScan until MBR.xu,start the next entry

Page 8: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.1 Filter Step• Spatial Join (2nd Scenario)

– Rkp and Ckp do not fit into main memory

• Spatial Partitioning Technique– Size of Each Partition for both Input

(Rkp and Ckp) can fit into memory simultaneously

– Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition

• Result Pair <OIDR, OIDC>

8

r2r1

c1

c2

c3

c4

c5

Partition 2 Partition 3

Partition 0 Partition 1

Page 9: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.2 Refinement Step• Purpose

– #1: To eliminate duplicates induced by Partitioning

9

r2c4

c5

Partition 2 Partition 3

Partition 0 Partition 1

Result Pairs:Partition 1: <r2,c5>Partition 3: <r2,c5>, <r2,c4>

– #2: To examine the actual R and S tuples & see if the attributes satisfy join condition

Page 10: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.2 Refinement Step• Procedure

– #1: Sort OID pairs:• Primary Sort Key: OIDR

• Secondary Sort Key: OIDC

– #2: Read R tuples first, then C

Page 11: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.3 Number of Partitions• Number of Partition P is computed as:-

whereP: Number of partition

R : Cardinality of RC : Cardinality of C

Sizekey-ptr: Size of a key-pointer element (in bytes)M: Size of main memory (in bytes)

11

M

SizeCRP ptrkey*)(

Page 12: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.4 Spatial Partitioning Function • Non-Uniform Distributed & Clustered

Spatial Features

• By “Regular” Partitioning Method Large differences in size of partitions

12

Partition 0 Partition 1

Partition 2 Partition 3Universe

Page 13: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.4 Spatial Partitioning Function Step 1: Regular decomposition of universe into

NT tiles, where NT > = PStep 2: Apply Tile-to-Partition Mapping Scheme

Round Robin OR Hashing

13

“Regular” Partitioning Method “Tile-based” Partitioning Method + Round Robin Mapping Scheme

Partition 0 Partition 1

Partition 2 Partition 3

Tile0/Part0 Tile1/Part1 Tile2/Part2 Tile3/Part0

Tile4/Part1 Tile5/Part2

Tile6/Part0 Tile7/Part1

Tile8/Part2 Tile9/Part0 Tile10/Part1 Tile11/Part2

Page 14: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.4 Spatial Partitioning Function • What is the PERFECT Spatial Partitioning

Function ?

• Considerations:– Number of Tiles– Tile-to-Partition Mapping Scheme

(Round Robin OR Hashing)

• Data set used for investigation:– Tiger Road Data (62.4MB, 456,613 tuples)– Sequoia Polygon Data (21.9MB, 58,115 tuples)

14

It assigns equal number of tuples to each partition

Page 15: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.4 Spatial Partitioning Function • Observation:

Partitioning Function improves as No. of Tiles increases More uniform distribution

15

The PERFECT Partitioning Function has a coefficient of variation = 0

Spatial Partitioning FunctionAlternatives: Tiger Road Data

Page 16: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,

↑ Replication Overhead

16

Replication Overhead:Tiger Road Data (16 Partitions)

Replication Overhead: Sequoia Polygon Data(16 Partitions)

Number of Tiles = An integral multiple of Number of Partitions

Page 17: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Tile0/Part0 Tile1/Part1 Tile2/Part2

Tile3/Part0 Tile4/Part1 Tile5/Part2

Tile6/Part0 Tile7/Part1 Tile8/Part2

r1

c1

c2

Scenario:• No. of Tiles = 9• P = 3• Tile-to-Partition Mapping Scheme = Round Robin

The entire column is being mapped to a single partition ↓ ∴ Replications by partitioning, ↓ Replication overheads

Page 18: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,

↑ Replication Overhead

18

Replication Overhead:Tiger Road Data (16 Partitions)

Replication Overhead: Sequoia Polygon Data(16 Partitions)

Number of Tiles = An integral multiple of Number of Partitions

Page 19: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4. Performance

V.SV.SIndexed

Nested Loops Join

Indexed Nested Loops

Join

PBSM Join(1024 tiles)PBSM Join(1024 tiles) V.SV.S

R-tree Based Join

R-tree Based Join

19

Page 20: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4.1 Index Nested Loops Join

20

Build an index in R (the smaller input)• Reads the extent R• Extracts the key-pointer (<MBR,OID>)

Sort the key-pointer by MBR

Build R-tree for the key-pointer

Scan on C• For each C, fetch each R

Page 21: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4.2 R-tree Based Join Algorithm

21

Build an R-tree index in both R and C

Find MBR with union set which is not null• Visit the roots• Move down the levels until leaf nodes

Find ID pairs with dataset union which is not null

Page 22: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4.3 Methodology

• Database System: Paradise• Machine: Sun SPARC-10/51

– 64 MB of memory– SunOS Release 4.1.3– One Seagate 2GB disk

22

Page 23: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

TIGER file

• Road, Hydrography and Rail data of the United States etc…

• 2 join queries– Road with Hydrography– Between the Road and the Rail data

23

Data Type # of Objects Total Size R-tree Size

Road 656,613 62.4 MB 24.0 MB

Hydrography 122,149 25.2 MB 6.5 MB

Rail 16,844 2.4 MB 1.0 MB

Page 24: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Sequoia 2000 Storage Benchmark

• Polygon– Regions of homogeneous landuse characteristics

in California• Islands

– Holes in the polygon data

24

Data Type # of Objects Total Size R-tree Size

Polygons 58,115 21.9 MB 3.0 MB

Islands 21,007 6.2 MB 1.1 MB

Page 25: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4.4 NONE OF INDICES PRE-EXIST

25

Page 26: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

TIGER: Join Road with Hydrograhy

26

PBSM is 48-98% faster

than the R-Tree Based; 93-

300% faster than the Idx.

Nested Loops.

Page 27: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

TIGER: Join Road with Rail

27

Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested

Loops performs

better than R-Tree Based.

Page 28: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Cluster

• Data continuously , i.e. not randomly distributed

• Data are mostly in sequential order in real life

• Less computationally expensive

28

Page 29: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Clustered TIGER: Join Road with Hydrography

29

PBSM is 40% faster than the R-Tree Based; and 60-80%

faster than the Idx. Nested

Loops.

Page 30: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Costs• Index Building Cost

– Cost of extracting the key-pointers from the input– Sorting the key-pointers– Building the index using the sorted key pointers– If Input is clustered

No sorting key-pointers Cost of building index

• Tree Joining Cost• Refinement Step Cost

30

Page 31: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Sequoia Data

31

PBSM is 13-27% faster

than the R-Tree Based; and 17-

114% faster than the Idx.

Nested Loops.

Page 32: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Summary

• PBSM is better than R-tree and the Indexed Nested Loops based algorithm

• When sizes of 2 inputs differ significantly,Indexed Nested Loops is better than the R-tree based algorithm

• All algorithms improve if join inputs are clustered

32

Page 33: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4.5 IN THE PRESENCE OF PRE-EXISTING INDEX

33

Page 34: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

When indices pre-exist on

both the inputs, the R-

tree based algorithm has

the best performance

When indices pre-exist on

both the inputs, the R-

tree based algorithm has

the best performance

TIGER: Join Road with Hydrograhy

34

Page 35: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

When an index exists only on the smaller input, PBSM

performs best.

When an index exists only on the smaller input, PBSM

performs best.

TIGER: Join Road with Rail

35

Page 36: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

4.6 CPU Cost

• Insert a table• CPU cost > I/O cost• System

– CPU intensive– Much less I/O is needed

36

Page 37: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

5. Principal Behind

• Divide and Conquer• Optimization on memory size

Page 38: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

6. Playback of this presentation

• Efficient PBSM algorithm• Comparison among different algorithm• Performance Analysis• Clustered Data• Indexed Data

38

Page 39: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)

Question ?

39