liferaft: data-driven, batch processing for the exploration of scientific databases
DESCRIPTION
LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/1.jpg)
Xiaodan Wang, Randal BurnsDepartment of Computer ScienceJohns Hopkins University
Tanu MalikCyber CenterPurdue University
LifeRaft: Data-Driven, Batch Processing for the Exploration of
Scientific Databases
![Page 2: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/2.jpg)
LifeRaft: Data-Driven, Batch Processing
BETTER LUCK NEXT TIME!
![Page 3: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/3.jpg)
LifeRaft: Data-Driven, Batch Processing
ProblemQ1
Q2
Q3
Q4
![Page 4: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/4.jpg)
LifeRaft: Data-Driven, Batch Processing
GoalsEliminate redundant I/O to improve query throughput
Batch queries with that exhibit data sharing– Pre-process queries to identify data sharing– Co-schedule queries that access the same data– Access contentious data first to maximize sharing
Starvation resistance– Avoid indefinite queuing times (response time)– Enforce some constraints on completion order
![Page 5: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/5.jpg)
LifeRaft: Data-Driven, Batch Processing
Target Applications Data intensive scan queries
– Executed against a clustered index– Clustered and federated databases (e.g. joins that correlate
multiple nodes) Peta-scale astronomy (Pan-STARRS)
– Data are partitioned spatially– Many queries scan full DB and last hours or days
Cross-match– Probabilistic spatial join across multiple databases
![Page 6: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/6.jpg)
LifeRaft: Data-Driven, Batch Processing
Filter and Refine Filter queries
– Pre-process queries to determine join buckets– Buckets B1,…,Bn and queries Q1,…, Qm
– Workload Wij denote objects from Qi that overlap Bj
Refinement– Read buckets one-at-a-time– Sort-merge join (sort by HTM ID)– Query specific predicates applied on output tuples
![Page 7: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/7.jpg)
LifeRaft: Data-Driven, Batch Processing
Workload Throughput Metric
Greedily in order of decreasing workload throughput Exploits data regions that experience contention May starve requests
– Favors buckets experiencing frequent reuse– No guarantee a particular bucket or query receives service
![Page 8: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/8.jpg)
LifeRaft: Data-Driven, Batch Processing
Aged Workload Throughput Metric
Inspired by disk-drive head scheduling Balance arrival order (low response time) with
contention (high throughput) Adaptive trade-offs based on workload saturation
– Maximize rate at which objects are joined during saturated workloads
– Enforce completion order (queuing times) to prevent indefinite starvation during low saturation
![Page 9: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/9.jpg)
LifeRaft: Data-Driven, Batch Processing
Scheduling Behavior
Qi – Qi1, Qi2, Qi3
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj Qk
Sub-divide queries by bucket:
Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8
Assumptions:- Inter-query time of 1 sec- I/O for each bucket of 1 sec- Cache size of 2- Join cost is negligibleQj – Qj5, Qj6 , Qj7, Qj8
Qk
![Page 10: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/10.jpg)
LifeRaft: Data-Driven, Batch Processing
Arrival order with no sharing
Qi1
B1
Qi Arr
Qi2
B2
Qi3
B3
Qj1
B1
Qj Arr Qk Arr
Qj3
B3
Qi End
Qj4
B4
Qj6
B6
Qj7
B7
Qj8
B8
Qj End
Qk1
B1
Qk4
B4
Qk8
B8
Qk End
Qi – 3 secCompletion Times:
Qj – 8 sec Qk – 13 sec Avg – 8 sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
…
Tp – .2 qry/sec
![Page 11: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/11.jpg)
LifeRaft: Data-Driven, Batch Processing
Age based scheduling (bias 1)
Qi1
B1
Qi Arr
Qi2
B2
Qi5
B5
Qi3Qj3
B3
Qj Arr Qk Arr Qi EndQj End
Qk End
Qj1Qk1
B1
Qj4Qk4
B4
Qj6Qk6
B6
Qi – 3 secCompletion Times:
Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
Qj8Qk8
B8
Qj7Qk7
B7
![Page 12: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/12.jpg)
LifeRaft: Data-Driven, Batch Processing
Contention based scheduling (bias 0)
Qi1
B1
Qi Arr
Qi2
B2
Qi3Qj3
B3
Qj Arr Qk Arr Qi EndQj End
Qk5
B5
Qk End
Qj1Qk1Qj4Qk4
B1 B4
Qj6Qk6
B6
Qj7Qk7
B7
Qi – 7 secCompletion Times:
Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
Qj8Qk8
B8
(5.6) (.33)
![Page 13: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/13.jpg)
LifeRaft: Data-Driven, Batch Processing
Throughput Performance
![Page 14: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/14.jpg)
LifeRaft: Data-Driven, Batch Processing
Tuning theage bias
Throughput performance gap grows while response time gap is insensitive to saturation
Increasing age bias is more attractive at low saturation
![Page 15: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/15.jpg)
LifeRaft: Data-Driven, Batch Processing
Parameter tuning using trade-off curves
![Page 16: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/16.jpg)
LifeRaft: Data-Driven, Batch Processing
Discussion Impact of caching strategies Workload overflow
– Large intermediate join results– Migrate pairs of workload and bucket
Beyond completion order– Higher priority for interactive queries
Batch processing in a clustered environmentP. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.
![Page 17: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/17.jpg)
LifeRaft: Data-Driven, Batch Processing
WHAT ABOUT US?
![Page 18: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/18.jpg)
LifeRaft: Data-Driven, Batch Processing
Filter and refine Partition data into buckets
![Page 19: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/19.jpg)
LifeRaft: Data-Driven, Batch Processing
Average Response Time
![Page 20: LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases](https://reader034.vdocuments.us/reader034/viewer/2022042616/56815fd7550346895dceda89/html5/thumbnails/20.jpg)
LifeRaft: Data-Driven, Batch Processing
Outline
Motivation– Goals for data-driven, batch scheduling– Target application (SkyQuery)
LiftRaft scheduler– Filter and refine queries– Throughput maximizing metric– Starvation resistance– Differences in outcomes
Workload adaptive parameter selection