1 auto administration of databases based on clustering mujiba zaman jyotsna surabattula le gruenwald...

1

Auto administration of databases Auto administration of databases based on clusteringbased on clustering

Mujiba Zaman

Jyotsna Surabattula

Le Gruenwald

School of Computer ScienceThe University of Oklahoma

Norman, Oklahoma, 73019, USA{mujiba, jyotsna, ggruenwald}@ou.edu

2

Overview of the PresentationOverview of the Presentation

Motivation for auto-indexing Existing techniques Proposed Index Selection Technique Algorithm for Proposed Index Selection Technique Proposed Re-Indexing Technique Experiments Results Conclusions Future Work

3

Motivation of Auto-indexingMotivation of Auto-indexing

Index selection is an important part of physical database design

For large databases it is difficult for DBA’s to analyze data and find an optimal set of indices

The goal of auto-indexing includes– Analyze workload automatically to identify a good set of

indexes.– Create those indexes.– Automatically evaluate current indexes and re-index if

necessary

4

Existing TechniquesExisting Techniques

The index selection problem (ISP) has been approached in two different

ways to build index selection tools:

External Tools: • Some basic assumptions are made• Cost functions are formulated based on the assumptions• Attempt to minimize the cost function using combinatorial

optimization technique or by some heuristic method

Optimizer Based Tools: • Utilize the query optimizer to give cost estimates for various index

configurations and suggest a configuration with the least cost estimation.

5

Advantages and Disadvantages of Advantages and Disadvantages of existing techniquesexisting techniques

External Tools:

Advantages: Applications using the DBMS will not suffer processing delays

Disadvantages: Disconnected from the optimizer –Index suggested may not be used by the optimizer Becomes obsolete if optimizer changes

6

Advantages and Disadvantages of Advantages and Disadvantages of existing techniquesexisting techniques

Optimizer Based Tools:

Advantages: Suggested indexes will be used by the optimizer

Disadvantages :

Performs expensive operation of optimizer invocation Longer processing time for other applications using the DBMS when indexes are being suggested.

7

Proposed Index Selection TechniqueProposed Index Selection Technique

Combines both the approaches.

Based on the intuition that the attributes that occur more commonly and frequently in a group of similar queries are likely to be useful for indexing.

Use a Data Mining Clustering technique to group queries which are similar in terms of their use of attributes.

Extract Indexable attributes in all the queries in each group as indexes. These indexes can be single-column or multi-column

8

Proposed Index Selection TechniqueProposed Index Selection Technique

For multi-column indexes, the order of the columns is determined by assigning weights to attributes based on where they are used in the queries.

A clustered index is also chosen by assigning weights to the attributes depending on where they occur in the queries.

Extracted indexes are then submitted to the query optimizer for final selection for the given workload.

The indexes not selected by the optimizer are eliminated. The remaining indexes are the final indexes suggested by our tool.

Re-index if total table scan time using current indexes is higher than that using new indexes.

9

Algorithm for proposed techniqueAlgorithm for proposed technique 1. Input is a workload of queries2. Parse the workload to find out all the indexable attributes and build two matrix:

- Query-Attribute matrix (1: presence of an attribute; 0 otherwise)

- Attribute-frequency matrix (frequency of the attribute indicated by the number)

Consider the following queries:

Q1. Select T1.A, T1.B, Sum(T1.C) from T1, T4, where T1.A=T4.K and T1.C=T3.H and T1.B<=30 group by T1.C

Q2. Select Ave(G) from T3 where F like ‘this%’ and G between 10 and 20 order by G

10

Algorithm for proposed techniqueAlgorithm for proposed technique

A corresponding query attribute matrix with several other queries could be as follows:

Table 1 Query Attribute Matrix

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K

Q1 1 1 1 0 0 0 0 1 1

Q2 0 0 0 0 0 1 1 0 0

Q3 0 0 0 0 1 1 1 0 0

Q4 1 1 1 1 0 0 0 1 1

Q5 0 0 1 0 0 1 1 0 0

11


A corresponding attribute frequency matrix could be as follows:

Table 2 Attribute Frequency Matrix


Q1 1 1 3 0 0 0 0 1 1

Q2 0 0 0 0 0 1 3 0 0

Q3 0 0 0 0 2 3 1 0 0

Q4 2 5 1 3 0 0 0 4 4

Q5 0 0 3 0 0 5 2 0 0

Freq 3 6 7 3 2 9 6 5 5

12

Candidate Index SelectionCandidate Index Selection

3. Choose candidate index set

Freq > threshold1

OR

Freq * T > threshold2

T = table size/constant

Table 3. Attribute-frequency matrix

Table size of T1 = 5000Table size of T2 = 2000Table size of T3 = 100Table size of T4 = 20T = table size/100Let threshold1 = 5, threshold2 = 30


Q1 1 1 3 0 0 0 0 1 1

Q2 0 0 0 0 0 1 3 0 0

Q3 0 0 0 0 2 3 1 0 0

Q4 2 5 1 3 0 0 0 4 4

Q5 0 0 3 0 0 5 2 0 0

Freq 3 6 7 3 2 9 6 5 5

Freq*T 150 300 350 60 40 9 6 5 1

13

Non-Clustered Index SelectionNon-Clustered Index Selection

4 a. Assign weights to the candidate attribute set as follows:

Where Clause: 3

Group by/Order by: 2

Aggregate functions: 1

Table 4. Candidate Attribute Set

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G

Q1 3 3 3+2+1 0 0 0 0

Q2 0 0 0 0 0 3 3+2+1

Q3 0 0 0 0 6 5 4

Q4 2 5 1 3 0 0 0

Q5 0 0 3 0 0 5 2

TotalWeight

5 8 10 3 6 13 12

14


4 b. Order the columns in the candidate attribute set in each table in descending order of their weights for determining non-clustered indexes.

Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G

Q1 6 3 3 0 0 0 0

Q2 0 0 0 0 0 3 6

Q3 0 0 0 6 0 5 4

Q4 1 5 2 0 3 0 0

Q5 3 0 0 0 0 5 2

TotalWeight

10 8 5 6 3 13 12

Table 5. Ordered Candidate Attribute Set

15


4 c. Build a Query-Attribute Matrix with the ordered columns

Table 6. Query Attribute Matrix with ordered candidate index set

Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G

Q1 1 1 1 0 0 0 0

Q2 0 0 0 0 0 1 1

Q3 0 0 0 1 0 1 1

Q4 1 1 1 0 1 0 0

Q5 1 0 0 0 0 1 1

TotalWeight

10 8 5 6 3 13 12

16


5. Use a data mining clustering technique on the Query-Attribute matrix to group similar queries. A possible clustering result is[Q1 Q4] [Q2 Q3 Q5] (Table 6)

6. Extract the index sets for each table from the clusters obtained. These are all the common attributes from all the queries clustered together.

For the cluster [Q1, Q4] Indexes are: (T1.C, T1.B, T1.A) in that orderFor the cluster [Q2 Q3 Q5] Indexes are (T3.F, T3.G) in that order

17

Clustered Index SelectionClustered Index Selection

6. Choose single column clustered index as follows: a. During parsing assign the following weight to the attributes Range queries: 2 Join: 1 Group by/Order by: 1 b. Rank the attributes according to their weight

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G

Q1 1 2 1 0 0 0 0

Q2 0 0 0 0 0 0 2+1

Q3 0 0 0 0 1 1 1

Q4 1 3 0 2 0 0 0

Q5 0 0 1 0 0 1 1

TotalWeight

2 5 2 2 1 2 5

Rank 1 2 1 2 1 1 2

Table 7. Ranking according to weight for clustered index

18


c. Determine the selectivity of columns and rank them according to the selectivity for each table.

T1.A T1.B T1.C T2.D T2.E T3.F T3.G

Selectivity 1 0.85 0.9 0.88 0.55 0.65 0.82

Rank N/A 1 2 2 1 1 2

Table 8. Ranking according to selectivity for clustered index

19


d. Find the sum of rank with selectivity and rank with weight for each

column.The indexable attribute with the largest sum is suggested as clustered index for the table. If more than one indexable attributes have the highest sum then select the attribute with higher rank in weight.

Column name Rank withselectivity

Rank with weight Sum

T1.A N/A 1 N/A

T1.B 1 2 3

T1.C 2 1 3

T2.D 2 2 4

T2.E 1 1 2

T3.F 1 1 2

T3.G 2 2 4

Table 9. Sum of Ranks with Weight and Selectivity for Clustered Index

20


7. Provide both clustered and non-

clustered indexes to the query optimizer

8. Let the optimizer display the estimated

execution plan to execute the workload

in the database.

9. Select the indexes used by the

optimizer as the final suggested index.

Workload, database information

Identifying candidate indexes

Clustering

Candidate index suggestion and creation

Query optimizer index elimination

Exte

rna

l to th

e O

ptim

ize

r

21

Proposed Re-indexing TechniqueProposed Re-indexing Technique

An auto-indexing tool should be capable of re-indexing whenever the current indexes are no longer good

The DBMS can periodically monitor the cost of total table scan for a particular size of workload

When this value exceeds a limit the DBMS triggers the index selection tool to suggest new set of indexes.

The limit to trigger the tool can be determined from the relationship between increase in table scan cost and performance gain due to re-indexing. This can be set by the DBA.

22


Produce a Chart showing “Re-indexing performance gain vs. Table scan cost” => Guildelines for DBA by doing the following:

Assume the current workload is Wc, current index set is INc, compute total current table scan cost TSc.

Obtain a set of different workload samples over a long history: W2, W3,.. Wn.

Compute total table scan cost for each of workload sample using the current index set: TS2c, TS3c,…, TSnc.

23

Proposed Re-Indexing Technique (Cont.)Proposed Re-Indexing Technique (Cont.)

Run our index selection tool on W2, W3, …, Wn to get the corresponding recommended sets of indexes: IN2r, IN3r,…, INnr.

Compute total table scan cost for each workload using its new recommended set of indexes: TS2r, TS3r,…, TSnr.

Compute the percentage of performance improvement due to reindexing for each workload Wj: [(TSc – TSjr)/TSc)* 100] for j = 2,.., n.

24


Table scan cost vs improvement in re-indexing

0

10

20

30

40

50

60

70

80

0 500 1000 1500 2000

increase in table scan cost (times)

% im

prov

emen

t in

exec

utio

n tim

e du

e to

re-in

dexi

ng

25


If the DBA has the chart in advance, the DBA can set up the limit where he/she wants the DBMS to trigger index selection tool.

The existing index set is then compared with the new index set

Indexes which are part of new but not part of existing set are created, those which are part of existing set and not in new set are dropped and those which intersect remain

The process of dropping and creating indexes in the system follows similar methodology as Oracle’s Automated Index-Rebuild System which can be done either online or offline.

26

ExperimentsExperiments

Performance Metric : average query response time =time taken to execute the workload in minutes divided by the total number of queries in the workload

All experiments are conducted on the system Intel Pentium 4-M, CPU 2.0GHz, 512 MB RAM.

Experiments conducted on TPC-R benchmark with its 22 read-only queries

Experiments conducted on Microsoft SQL Server 2000

27

ExperimentsExperiments

Clustering algorithms used in experiment:

1. MACQueen’s k – means clustering algorithm• Used hamming distance as the distance function

2. KEROUAC (knowledge explicit, rapid, off beat and user-centered algorithm for clustering)

K-means is a well-established data clustering algorithm KEROUAC is a clustering algorithm for practical advantage: it

doesn’t require the final clusters number setting Both algorithms have low computational cost

28

Results With Results With KK-means-means

Chart 1

Number of clusters-k Vs Average query response time

0

0.5

1

1.5

2

2.5

3

3 5 8 11 14 16 18 20 22

Number of clusters-k

Ave

rag

e q

ue

ry r

esp

on

se

tim

e i

n m

inu

tes

threshold2=20

threshold2=60

Microsoft IST

No Index

Frequent Itemset

29

Results with KEROUACResults with KEROUAC

Chart 2

Granularity Factor-α Vs Average query response time

0

0.5

1

1.5

2

2.5

3

3 5 8 10 12 15 20 25

Granularity factor-α

Ave

rag

e q

ue

ry r

esp

on

se

tim

e i

n m

inu

tes

threshold2=20

threshold2=60

threshold2=100

Microsoft IST

No Index

Frequent Itemset

30

Results with Index Suggestion TimeResults with Index Suggestion Time

0123456789

minutes

240 500 730

Workload Size

Comparison of Index Suggestion Time

SQL Server

Our Tool

31

ConclusionsConclusions

Performance of the tool critically depends on the choice of (for KEROUAC) and k (for k-means) and the threshold value.

Increasing (k) will group queries with higher similarity to each other in a cluster.

Increasing (k) beyond some point has no effect (all identical queries are already in the same cluster).

We can therefore achieve desired results by choosing (k) = workload size.

Increasing threshold means more attributes eliminated from consideration

Experiments show that a threshold value about workloadsize/4 works good.

Our tool chooses these parameters to operate on the best performance range.

32

ConclusionsConclusions We compared our results with Microsoft SQL Server’s Index

Selection Tool (An optimizer Based Tool) and also with Frequent Itemsets Mining (An External Tool)

Best performance improvement using k – Means clustering compared with Frequent Itemsets Mining is 71.43 %

Best performance improvement using k – Means clustering compared with Microsoft Index Selection Tool is 16.2 %

Best performance improvement using KEROUAC clustering compared with Frequent Itemsets Mining is 73.26 %

Best performance improvement using KEROUAC clustering compared with Microsoft Index Selection Tool is 21.5 %

The index suggestion time for Microsoft Index Selection Tool was 4 times higher than our tool for a workload size of 240

33

Future WorkFuture Work

Test the dependence of the technique on different clustering algorithms

Test with different sizes of workload

Test with update queries in the workload

Test with index elimination technique

34

Thanks!

Questions?

35

Indexable AttributesIndexable Attributes

Indexable Attributes/Columns: Columns which belong to WHERE, GROUP BY, ORDER BY

clauses Operators {=, <, >, <=, >=, BETWEEN, IN}

Example:

SELECT * FROM table1,table2 WHERE table1.column1 = table2.column1

AND (table2.column2 BETWEEN 0 AND 1000)

the indexable attributes are:table1.column1, table2.column1, table2.column2

36

Selectivity of a columnSelectivity of a column

Selectivity: Selectivity ratio of a column/index

= number of unique values in a column/index of the table divided by total number of rows in that table

SELECT COUNT (DISTINCT (column name)) FROM table name;

If a column/index has high selectivity then it is more useful to the optimizer and has more chances to be picked up by the optimizer while executing a query.

37

Multicolumn IndexesMulticolumn Indexes

Multicolumn indexes:

Column involved in multicolumn index should be joined with AND clause and not with OR clause

Order of the columns in a multicolumn index is important. Order should be based on selectivity and also the first ordered column should be the most used column in queries.

An index (a, b, c) is used by queries involving a, b, c both a and b or a but not in any other combinations

Example:An index (major, minor) is suitable for the following querySELECT name FROM test WHERE major = constantAND minor = constant;

38

Frequent Itemset Mining TechniqueFrequent Itemset Mining Technique

Input is a workload of queries Extract indexable attributes. Create a query attribute matrix.

Attributes

Queries A B C D E

Q1 1 0 1 1 0

Q2 0 1 1 0 1

Q3 1 1 1 0 1

Q4 0 1 0 0 1

Q5 1 1 1 0 1

Q6 0 1 1 0 1

Table 1

Q1: SELECT * FROM T1, T2 WHERE A BETWEEN 1 AND 10 AND C = DQ2: SELECT * FROM T1, T2 WHERE B LIKE ‘%this%’ AND C=5 AND E<100Q3: SELECT * FROM T1, T2 WHERE A=30 AND B>3 GROUP BY C HAVING SUM(E)>2

Q4: SELECT * FROM T1 WHERE B>2 AND E IN (3,2,5)Q5: SELECT * FROM T1, T2 WHERE A=30 AND B>3 GROUP BY C HAVING SUM(E)>2Q6: SELECT * FROM T1, T2 WHERE B>3 GROUP BY C HAVING SUM(E)>2

39

“A closed itemset is a maximal set of items (attributes) that are common to a set of transactions (queries)”

The candidate indexes selected for minimal support greater than or equal to 2/6 in the example are as follows:[1]

{(AC,3/6), (BE, 5/6), (ABCE, 2/6), (BCE, 4/6)}

Attributes

Queries A B C D E

Q1 1 0 1 1 0

Q2 0 1 1 0 1

Q3 1 1 1 0 1

Q4 0 1 0 0 1

Q5 1 1 1 0 1

Q6 0 1 1 0 1

Table 2

Frequent Itemset Mining TechniqueFrequent Itemset Mining Technique

40

SQL-Server Index Selection ToolSQL-Server Index Selection Tool

Workload

Candidate indexselection

ConfigurationEnimeration

Multi-column IndexGeneration

What-if IndexCreation

CostEvaluation

SQL Server

Final Indexes

Figure 1

41


Candidate index selection:

For a given workload W that consists of n queries, n workloads Wi..Wn, each consisting of one query are generated.

For each workload Wi, the set of indexable columns of the query Ii is the starting candidate indexes.

Let Ci be the configuration picked by the index selection tool for Wi

The candidate index set for W is the union of all Ci

42


Configuration enumeration:Problem Pick k indexes from a set of n candidate indexes.

Algorithm

1 Let S=the best m index configuration using the naïve enumeration algorithm. If m=k then exit

2 Pick a new index I such that Cost (S U I) < Cost (S U I’) for any choice of I’ != I

3 If Cost (S U I )>= Cost (S) then exit

else S=S U I

4 If |S| = k then exit

5 Go to 2

43


Cost Evaluation:

Reduce the number of invocations of the optimizer by deriving costs from already evaluated costs.

A cost of a non atomic configuration can be derived from atomic configuration

“A configuration C is atomic for a workload if for some query in the workload there is a possible execution of a query in the workload by the query engine that uses all indexes in C”.

Not every atomic configuration needs to be evaluated for every single query in the workload

44


Multicolumn index generation:

For given K columns K! multicolumn indexes are possible

Iterative approach– First iteration single column indexes are considered– Only the selected single column indexes are input to the two

column indexes in the first iteration– This set of two-column indexes along with single column ones

are input to the third iteration and so on.

45

Clustered IndexClustered Index

Clustered indexes:

“A page allocated to an index is called a data page.”

For tables having clustered index the data rows of each data page are stored in order and the data pages are linked together by doubly-linked list.

For table having no clustered index the data rows are not stored in any particular order. Figure 2

46

Clustered IndexClustered Index

Accessing data with a clustered index:

Figure 3

47

Criteria to choose clustered indexCriteria to choose clustered index

Queries that return large result sets

Columns with a number of duplicate values that are searched frequently

Columns other than primary key that are frequently used in join clauses

Columns searched within a range of values

Columns used in ORDER BY or GROUP BY queries

48

MACQueen’s k – Means Clustering algorithm

1) First k data units are chosen as clusters of one member each.

2) Remaining m – k data units are assigned to the clusters whose centroid is nearest to the data unit under consideration. Centroid is recomputed after every gain in the cluster

3) Iterate through the data set assigning each data unit to its nearest cluster taking the existing cluster centroids as fixed seed points until a certain criteria is reached

49

MACQueen’s k – Means Clustering algorithm

Hamming distance is the number of positions in which two binary words differ

For k = 3; Q1, Q2, Q3 are initial clusters

Hamming distance of Q1 and Q4 is 5Q2 and Q4 is 1Q3 and Q4 is 2

Clusters at step 2 are cluster1: [Q1]cluster2: [Q2, Q4, Q6] cluster3: [Q3,Q5]

Attributes

Queries A B C D E

Q1 1 0 1 1 0

Q2 0 1 1 0 1

Q3 1 1 1 0 1

Q4 0 1 0 0 1

Q5 1 1 1 0 1

Q6 0 1 1 0 1

Table 3

50

KEROUAC: Knowledge Explicit, Rapid, OFF beat and User-centered Algorithm for Clustering

New Condorcet Criterion (NCC) :

NCC(Pz) = Sim(Ei,Ej) + α x Dissim(Ei)

NCC represents the degree of dissimilarity of objects belonging

to the same cluster and the degree of similarity between clusters.

It is desired to be minimum.

α is called the granularity factor

KEROUAC Clustering algorithm

51

KEROUAC Clustering algorithmAttributes

Queries A B C D E

Q1 1 1 0 0 0

Q2 1 0 1 0 0

Q3 0 1 1 0 0

Q4 1 1 1 0 0

Q5 0 0 1 1 1

Q6 0 0 0 1 1

Table 4

1, 2, 3, 4, 5, 6

1, 3, 4 2, 5, 6

According to B

1, 4 3

According to A

2 5, 6

According to D and E

Neighbor Neighbor

• Calculate NCC for all the neighbors and pick the neighbor with least NCC

• Repeat until a reduction in NCC is not possible

Figure 4

52

Proposed index selection algorithmProposed index selection algorithm

Step 1: Input workload

Step 2: Parse the workload to find indexable attributes = > Query-attribute matrix

Step 3: Identify candidate indexable attributes based on frequency of attributes in the workload and threshold1 and threshold2

Step 4: Compute the total weight of each attribute using the clause weight assignment policy and order the attributes in each table in decreasing order of weights in order to identify non-clustered indexes in Step 6 => Ordered query-attribute matrix.

Step 5: Perform a data mining clustering technique on the ordered query-attribute matrix to group similar queries based on their use of attributes.

53

Proposed Index Selection Algorithm Proposed Index Selection Algorithm (Cont.)(Cont.)

Step 6: Extract index sets for each table from the clustering results in Step5.

Step 7: Select a single attribute clustered index for each table by computing the sum of range query’s weight and selectivity ranking for each attribute and selecting the attribute with the highest sum.

Step 8: Provide the query optimizer with the virtual set of indexes chosen in

steps 5 and 7 (including both non-clustered and clustered indexes).

Step 9: Let the optimizer display the estimated execution plan to execute the workload.

Step 10: Select the indexes used by the optimizer as the final suggested index.

1 auto administration of databases based on clustering mujiba zaman jyotsna surabattula le gruenwald...

Documents

optimizer index

suggested indexes

index selection tools

current indexes

final indexes

new indexes

remaining indexes

clustered index