karl schnaitter and neoklis polyzotis (uc santa cruz) serge abiteboul (inria and university of paris...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz)
Serge Abiteboul (INRIA and University of Paris 11)
Tova Milo (University of Tel Aviv)
Automatic Index Selectionfor Shifting Workloads
A heavily used database needs indexes But an evolving workload makes it tough
Popularity shifts over time Applications come and go
The problem: on-line index selection Maintain a dynamic set of indexes that
improve performance of current queries
On-line Index Selection
Can off-line techniques be adapted? Off-line algorithms need a fixed query load This is not given in the on-line problem
We could try:
Attempted On-line Solution
Gather query workload W
Optimize indexes for Wcan be very slow
Loop
no obviousimplementation
creation cost not considered
COLT: Continuous On-Line Tuning A system for on-line index selection
Features: Continuously tracks query load Evaluates indexes with what-if optimizer Adaptively controls overhead Selects indexes based on recent trends
Prototype implementation in PostgreSQL
Our Solution
We make the following assumptions Data is in a relational database Workload of SELECT statements Restricted to single-column indexes Given a budget for index storage
This problem still has major challenges
Lifting these assumptions in future work
Problem Assumptions
System Architecture
Parser
Executor Database
query & plan
indexes to profile
index benefits
CREATE INDEX ...DROP INDEX ...
OptimizerWhat-if
Interface
query
COLT
epoch
(between epochs)
queries
COLT Internal Organization
What-ifInterface indexes to profile
index benefits
query & plan
recentqueries
COLT
candidateindexes
Similar queries are grouped in clusters
Each query is placed in a cluster on arrival Aggregate benefit statistics for each cluster
Confidence intervals of past measurements
Organizing the Queries
Cluster 1 Cluster 2
Query 1
Query 3
Query 2
Query 4
Query 5
Organizing the Indexes
Cold Set
Hot Set
Materialized Set
Relevant, but not promising candidates
Benefit measured with crude metric
Promising candidates for materialization
Profiled accurately with what-if calls
Indexes materialized by COLT
Disk storage must fit in budget
Profiled with reverse what-if calls
new candidate discarded
Challenge: Selective ProfilingWhich hot and materialized indexes are profiled w.r.t. the current query?
Challenge: Index SelectionHow are index candidates selected forthe cold, hot, and materialized sets?
Key Challenges
Selective Profiling (1/2)
map to cluster
relevant for profiling
C1 C2 C3 C4
query
Cold Hot Mat
indexes to profilerandom sample focus on indexes
with uncertain benefit in cluster
Profiling budget:max what-if calls per epoch
Budget is adjusted after each epoch Set proportional to potential of hot indexes Potential based on optimistic assumptions
Result:
Selective Profiling (2/2)
stable workload suspend profiling
shifting workload intensify profiling
How are new candidates chosen? Use a crude benefit metric
Cost of index access vs. sequential scan Approximate and cheap to compute
When each query arrives: Compute crude benefit of relevant indexes Indexes with benefit > 0 become candidates
Index Selection (1/3) new candidate
Cold
How are hot indexes chosen? At the end of each epoch:
Get crude benefit of hot and cold indexes Find cut-off point to be in hot set:
Index Selection (2/3)
HOT
Crude Benefit
COLD
Cut-off point derived from two-cluster model
Cold
Hot
How to choose materialized set? At the end of each epoch:
Predict benefit of hot and materialized indexes
Index Selection (3/3)
ObservedBenefit
time
PredictedBenefit
Cost of materialization is discounted Select indexes to maximize total benefit
Hot
Materialized
Performance of COLT
Experimented with a
4-phase workload
COLT adapts to each phase
Off-line chooses best static index set
Phase 1 Phase 2 Phase 3 Phase 4Minimum TimeCOLT Extra TimeOff-line Extra Time
Overhead of Profiling
Overhead peaks atstart of each phase
Decreases when system is well tuned
Average < 1 what-ifcall per query
Phase 1 Phase 2 Phase 3 Phase 4
On-line index selection harder than off-line Efficiency is a higher priority Essentially need to guess the future Index creation is an issue
COLT is our solution Solves a constrained problem Potentially extensible to other domains
Only some components would change
Closing Remarks
Thank You
Epoch length = 10 What-if limit = 20
Or less if very worried about overhead Averaging window
Used when predicting future benefit Indicates the number of epochs that give
a good picture of the query distribution Not easy to set
Setting System Parameters
Bruno and Chaudhuri, ICDE 2007 Avoids extra optimizer calls completely Heuristics to capture index interaction Very different mechanisms for index selection
Sattler, Schallehn, and Geist, IDEAS 2004 More similarities to COLT Different prediction of future benefit No control on profiling overhead
Other On-line Systems
80%
60%
40%
20%
100%
Performance with Noise
The worst-casescenario for COLT
Concentrated bursts
of noise queries
Performance lossin some cases