karl schnaitter and neoklis polyzotis (uc santa cruz) serge abiteboul (inria and university of paris...

Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz)

Serge Abiteboul (INRIA and University of Paris 11)

Tova Milo (University of Tel Aviv)

Automatic Index Selectionfor Shifting Workloads

A heavily used database needs indexes But an evolving workload makes it tough

Popularity shifts over time Applications come and go

The problem: on-line index selection Maintain a dynamic set of indexes that

improve performance of current queries

On-line Index Selection

Can off-line techniques be adapted? Off-line algorithms need a fixed query load This is not given in the on-line problem

We could try:

Attempted On-line Solution

Gather query workload W

Optimize indexes for Wcan be very slow

Loop

no obviousimplementation

creation cost not considered

COLT: Continuous On-Line Tuning A system for on-line index selection

Features: Continuously tracks query load Evaluates indexes with what-if optimizer Adaptively controls overhead Selects indexes based on recent trends

Prototype implementation in PostgreSQL

Our Solution

We make the following assumptions Data is in a relational database Workload of SELECT statements Restricted to single-column indexes Given a budget for index storage

This problem still has major challenges

Lifting these assumptions in future work

Problem Assumptions

System Architecture

Parser

Executor Database

query & plan

indexes to profile

index benefits

CREATE INDEX ...DROP INDEX ...

OptimizerWhat-if

Interface

query

COLT

epoch

(between epochs)

queries

COLT Internal Organization

What-ifInterface indexes to profile

index benefits

query & plan

recentqueries

COLT

candidateindexes

Similar queries are grouped in clusters

Each query is placed in a cluster on arrival Aggregate benefit statistics for each cluster

Confidence intervals of past measurements

Organizing the Queries

Cluster 1 Cluster 2

Query 1

Query 3

Query 2

Query 4

Query 5

Organizing the Indexes

Cold Set

Hot Set

Materialized Set

Relevant, but not promising candidates

Benefit measured with crude metric

Promising candidates for materialization

Profiled accurately with what-if calls

Indexes materialized by COLT

Disk storage must fit in budget

Profiled with reverse what-if calls

new candidate discarded

Challenge: Selective ProfilingWhich hot and materialized indexes are profiled w.r.t. the current query?

Challenge: Index SelectionHow are index candidates selected forthe cold, hot, and materialized sets?

Key Challenges

Selective Profiling (1/2)

map to cluster

relevant for profiling

C1 C2 C3 C4

query

Cold Hot Mat

indexes to profilerandom sample focus on indexes

with uncertain benefit in cluster

Profiling budget:max what-if calls per epoch

Budget is adjusted after each epoch Set proportional to potential of hot indexes Potential based on optimistic assumptions

Result:

Selective Profiling (2/2)

stable workload suspend profiling

shifting workload intensify profiling

How are new candidates chosen? Use a crude benefit metric

Cost of index access vs. sequential scan Approximate and cheap to compute

When each query arrives: Compute crude benefit of relevant indexes Indexes with benefit > 0 become candidates

Index Selection (1/3) new candidate

Cold

How are hot indexes chosen? At the end of each epoch:

Get crude benefit of hot and cold indexes Find cut-off point to be in hot set:

Index Selection (2/3)

HOT

Crude Benefit

COLD

Cut-off point derived from two-cluster model

Cold

Hot

How to choose materialized set? At the end of each epoch:

Predict benefit of hot and materialized indexes

Index Selection (3/3)

ObservedBenefit

time

PredictedBenefit

Cost of materialization is discounted Select indexes to maximize total benefit

Hot

Materialized

Performance of COLT

Experimented with a

4-phase workload

COLT adapts to each phase

Off-line chooses best static index set

Phase 1 Phase 2 Phase 3 Phase 4Minimum TimeCOLT Extra TimeOff-line Extra Time

Overhead of Profiling

Overhead peaks atstart of each phase

Decreases when system is well tuned

Average < 1 what-ifcall per query

Phase 1 Phase 2 Phase 3 Phase 4

On-line index selection harder than off-line Efficiency is a higher priority Essentially need to guess the future Index creation is an issue

COLT is our solution Solves a constrained problem Potentially extensible to other domains

Only some components would change

Closing Remarks

Thank You

Epoch length = 10 What-if limit = 20

Or less if very worried about overhead Averaging window

Used when predicting future benefit Indicates the number of epochs that give

a good picture of the query distribution Not easy to set

Setting System Parameters

Bruno and Chaudhuri, ICDE 2007 Avoids extra optimizer calls completely Heuristics to capture index interaction Very different mechanisms for index selection

Sattler, Schallehn, and Geist, IDEAS 2004 More similarities to COLT Different prediction of future benefit No control on profiling overhead

Other On-line Systems

80%

60%

40%

20%

100%

Performance with Noise

The worst-casescenario for COLT

Concentrated bursts

of noise queries

Performance lossin some cases

karl schnaitter and neoklis polyzotis (uc santa cruz) serge abiteboul (inria and university of paris...

Documents

materialized indexes

calls indexes

profiling slide

interface indexes

cluster slide

current query

index candidates

dynamic set of indexes