xiaodan wang department of computer science johns hopkins university processing data intensive...

24
Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Upload: ami-sullivan

Post on 04-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Xiaodan WangDepartment of Computer ScienceJohns Hopkins University

Processing Data Intensive Queries

in Scientific Database Federations

Page 2: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Problem

Data avalanche in scientific databases– Exponential growth in data size (Pan-STARRS)– Accumulation of data at multiple data sources (clustered and

federated databases)

Exploring massive, widely distributed data– Joins to find correlations across multiple databases– Queries are data intensive: large transfers over the network,

and scan large portions of the data– Query throughput limits scale of exploration

To improve overall query throughput but potentially sacrifice performance of individual queries

Page 3: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Target Application

SkyQuery Federation of Astronomy Databases– Dozens of multi-terabyte databases across three Continents– Queries that perform full db scans lasting hours or days– Intermediate join results that are hundreds of MBs– Scalability concerns both in data size and number of sites

Cross-match– Probabilistic spatial join across multiple databases– Join results are accumulated, shipped from site to site, and

delivered to scientists

Page 4: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Cross-Match Workload

A forward looking analysis shows that network dominates 90% of performance

A quarter of the cross-match queries execute for minutes to several hours

Page 5: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Incorporating Network Structure

Page 6: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Network-Aware Join Processing

Capture heterogeneity in global-scale federations– Metric to exploit high throughput paths– Decentralized, local optimizations using aggregate stats– Routing at the application layer– Two-approximate, MST-based solution with extensions

that employ semi-joins and explore bushy plans– Clustering to explore trade-offs with computation cost

Over a ten-fold reduction in network utilization for large joins

Page 7: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

A Case for Batch Processing

Top ten buckets accessed by 61% of queries and reuse occur close temporally

2% of buckets capture more than half of the workload and should be cached

Page 8: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

LifeRaft: Data-Driven Batch Proc.

Eliminate redundant I/O to improve query throughput

Batch queries with that exhibit data sharing

– Pre-process queries to identify data sharing

– Co-schedule queries that access the same data

– Access contentious data first to maximize sharing

– Improves performance by two-fold

Page 9: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Discussion

Cache replacement for LifeRaft– Benefits contentions data regions that experience reuse

(Cache hit for LifeRaft is 40% compared with 7% for arrival order processing)

– Evaluate strategies that exploit I/O behavior of batch workloads (segmented strategy)

Buffering and workload overflow– Large intermediate join results– Migrate pairs of workload and bucket

Better support for interactive queries– Short and selective queries that focus on small region– Indefinite queuing times in presence of batch workloads

Page 10: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Discussion (cont.)

Batch processing in a distributed environment– Network-aware scheduling does not consider

computation cost– Batch processing for a single system environment

Federating LifeRaft– Coordinate exec. of query that join multiple DBs– Batch proc. requires databases to buffer results– Maximize overall batch size while alleviating

memory used for buffering and network cost

Page 11: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Exploring Alt. Join Schedules

Page 12: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Discussion (cont.)

Explore both join schedules and opportunities for batching simultaneously– Bushy and semi-join plans increase computation while

clustering decrease computation– Skew in join workload (ie. sites close to end user)– Quantify trade-offs with computation cost (ie. number of

buckets in batch processing)

Users submit cross-match queries in batches Applying LifeRaft to other data-intensive, temporal-

spatial data such as Turbulence database

Page 13: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Supplementary Slides

Page 14: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Cross-Match Queries

Join by increasing cardinality (count *)– Minimal I/O– Fewer bytes on the network

Query

Mediator

Probe Query

ResultResult

Result

Count: 30Count: 100Count: 800

Page 15: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Spanning Tree Approximation (STA)

B

C

A

D

E

F

GH

Page 16: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

STA: Find MST

B

C

A

D

E

F

GH

Page 17: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

STA: Join Using Paths on the MST

B

C

A

D

E

F

GH

1

2

3

54

6

7

9

8 10

12

11

13

Page 18: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Filter and refine

Partition data into buckets

Page 19: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Scheduling Behavior

Qi – Qi1, Qi2, Qi3

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj Qk

Sub-divide queries by bucket:

Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8

Assumptions:- Inter-query time of 1 sec- I/O for each bucket of 1 sec- Cache size of 2- Join cost is negligibleQj – Qj5, Qj6 , Qj7, Qj8

Qk

Page 20: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Arrival order with no sharing

Qi1

B1

Qi Arr

Qi2

B2

Qi3

B3

Qj1

B1

Qj Arr Qk Arr

Qj3

B3

Qi End

Qj4

B4

Qj6

B6

Qj7

B7

Qj8

B8

Qj End

Qk1

B1

Qk4

B4

Qk8

B8

Qk End

Qi – 3 sec

Completion Times:

Qj – 8 sec Qk – 13 sec Avg – 8 sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Tp – .2 qry/sec

Page 21: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Age based scheduling (bias 1)

Qi1

B1

Qi Arr

Qi2

B2

Qi5

B5

Qi3Qj3

B3

Qj Arr Qk Arr Qi EndQj End

Qk End

Qj1Qk1

B1

Qj4Qk4

B4

Qj6Qk6

B6

Qi – 3 sec

Completion Times:

Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Qj8Qk8

B8

Qj7Qk7

B7

Page 22: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Contention based scheduling (bias 0)

Qi1

B1

Qi Arr

Qi2

B2

Qi3Qj3

B3

Qj Arr Qk Arr Qi EndQj End

Qk5

B5

Qk End

Qj1Qk1Qj4Qk4

B1 B4

Qj6Qk6

B6

Qj7Qk7

B7

Qi – 7 sec

Completion Times:

Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Qj8Qk8

B8

(5.6) (.33)

Page 23: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Parameter tuning using trade-off curves

Page 24: Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations

Tuning theage bias

Throughput performance gap grows while response time gap is insensitive to saturation

Increasing age bias is more attractive at low saturation