1 auto administration of databases based on clustering mujiba zaman jyotsna surabattula le gruenwald...
TRANSCRIPT
1
Auto administration of databases Auto administration of databases based on clusteringbased on clustering
Mujiba Zaman
Jyotsna Surabattula
Le Gruenwald
School of Computer ScienceThe University of Oklahoma
Norman, Oklahoma, 73019, USA{mujiba, jyotsna, ggruenwald}@ou.edu
2
Overview of the PresentationOverview of the Presentation
Motivation for auto-indexing Existing techniques Proposed Index Selection Technique Algorithm for Proposed Index Selection Technique Proposed Re-Indexing Technique Experiments Results Conclusions Future Work
3
Motivation of Auto-indexingMotivation of Auto-indexing
Index selection is an important part of physical database design
For large databases it is difficult for DBA’s to analyze data and find an optimal set of indices
The goal of auto-indexing includes– Analyze workload automatically to identify a good set of
indexes.– Create those indexes.– Automatically evaluate current indexes and re-index if
necessary
4
Existing TechniquesExisting Techniques
The index selection problem (ISP) has been approached in two different
ways to build index selection tools:
External Tools: • Some basic assumptions are made• Cost functions are formulated based on the assumptions• Attempt to minimize the cost function using combinatorial
optimization technique or by some heuristic method
Optimizer Based Tools: • Utilize the query optimizer to give cost estimates for various index
configurations and suggest a configuration with the least cost estimation.
5
Advantages and Disadvantages of Advantages and Disadvantages of existing techniquesexisting techniques
External Tools:
Advantages: Applications using the DBMS will not suffer processing delays
Disadvantages: Disconnected from the optimizer –Index suggested may not be used by the optimizer Becomes obsolete if optimizer changes
6
Advantages and Disadvantages of Advantages and Disadvantages of existing techniquesexisting techniques
Optimizer Based Tools:
Advantages: Suggested indexes will be used by the optimizer
Disadvantages :
Performs expensive operation of optimizer invocation Longer processing time for other applications using the DBMS when indexes are being suggested.
7
Proposed Index Selection TechniqueProposed Index Selection Technique
Combines both the approaches.
Based on the intuition that the attributes that occur more commonly and frequently in a group of similar queries are likely to be useful for indexing.
Use a Data Mining Clustering technique to group queries which are similar in terms of their use of attributes.
Extract Indexable attributes in all the queries in each group as indexes. These indexes can be single-column or multi-column
8
Proposed Index Selection TechniqueProposed Index Selection Technique
For multi-column indexes, the order of the columns is determined by assigning weights to attributes based on where they are used in the queries.
A clustered index is also chosen by assigning weights to the attributes depending on where they occur in the queries.
Extracted indexes are then submitted to the query optimizer for final selection for the given workload.
The indexes not selected by the optimizer are eliminated. The remaining indexes are the final indexes suggested by our tool.
Re-index if total table scan time using current indexes is higher than that using new indexes.
9
Algorithm for proposed techniqueAlgorithm for proposed technique 1. Input is a workload of queries2. Parse the workload to find out all the indexable attributes and build two matrix:
- Query-Attribute matrix (1: presence of an attribute; 0 otherwise)
- Attribute-frequency matrix (frequency of the attribute indicated by the number)
Consider the following queries:
Q1. Select T1.A, T1.B, Sum(T1.C) from T1, T4, where T1.A=T4.K and T1.C=T3.H and T1.B<=30 group by T1.C
Q2. Select Ave(G) from T3 where F like ‘this%’ and G between 10 and 20 order by G
10
Algorithm for proposed techniqueAlgorithm for proposed technique
A corresponding query attribute matrix with several other queries could be as follows:
Table 1 Query Attribute Matrix
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 1 0 0 0 0 1 1
Q2 0 0 0 0 0 1 1 0 0
Q3 0 0 0 0 1 1 1 0 0
Q4 1 1 1 1 0 0 0 1 1
Q5 0 0 1 0 0 1 1 0 0
11
Algorithm for proposed techniqueAlgorithm for proposed technique
A corresponding attribute frequency matrix could be as follows:
Table 2 Attribute Frequency Matrix
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
12
Candidate Index SelectionCandidate Index Selection
3. Choose candidate index set
Freq > threshold1
OR
Freq * T > threshold2
T = table size/constant
Table 3. Attribute-frequency matrix
Table size of T1 = 5000Table size of T2 = 2000Table size of T3 = 100Table size of T4 = 20T = table size/100Let threshold1 = 5, threshold2 = 30
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
Freq*T 150 300 350 60 40 9 6 5 1
13
Non-Clustered Index SelectionNon-Clustered Index Selection
4 a. Assign weights to the candidate attribute set as follows:
Where Clause: 3
Group by/Order by: 2
Aggregate functions: 1
Table 4. Candidate Attribute Set
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 3 3 3+2+1 0 0 0 0
Q2 0 0 0 0 0 3 3+2+1
Q3 0 0 0 0 6 5 4
Q4 2 5 1 3 0 0 0
Q5 0 0 3 0 0 5 2
TotalWeight
5 8 10 3 6 13 12
14
Non-Clustered Index SelectionNon-Clustered Index Selection
4 b. Order the columns in the candidate attribute set in each table in descending order of their weights for determining non-clustered indexes.
Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 6 3 3 0 0 0 0
Q2 0 0 0 0 0 3 6
Q3 0 0 0 6 0 5 4
Q4 1 5 2 0 3 0 0
Q5 3 0 0 0 0 5 2
TotalWeight
10 8 5 6 3 13 12
Table 5. Ordered Candidate Attribute Set
15
Non-Clustered Index SelectionNon-Clustered Index Selection
4 c. Build a Query-Attribute Matrix with the ordered columns
Table 6. Query Attribute Matrix with ordered candidate index set
Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 1 1 1 0 0 0 0
Q2 0 0 0 0 0 1 1
Q3 0 0 0 1 0 1 1
Q4 1 1 1 0 1 0 0
Q5 1 0 0 0 0 1 1
TotalWeight
10 8 5 6 3 13 12
16
Non-Clustered Index SelectionNon-Clustered Index Selection
5. Use a data mining clustering technique on the Query-Attribute matrix to group similar queries. A possible clustering result is[Q1 Q4] [Q2 Q3 Q5] (Table 6)
6. Extract the index sets for each table from the clusters obtained. These are all the common attributes from all the queries clustered together.
For the cluster [Q1, Q4] Indexes are: (T1.C, T1.B, T1.A) in that orderFor the cluster [Q2 Q3 Q5] Indexes are (T3.F, T3.G) in that order
17
Clustered Index SelectionClustered Index Selection
6. Choose single column clustered index as follows: a. During parsing assign the following weight to the attributes Range queries: 2 Join: 1 Group by/Order by: 1 b. Rank the attributes according to their weight
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 1 2 1 0 0 0 0
Q2 0 0 0 0 0 0 2+1
Q3 0 0 0 0 1 1 1
Q4 1 3 0 2 0 0 0
Q5 0 0 1 0 0 1 1
TotalWeight
2 5 2 2 1 2 5
Rank 1 2 1 2 1 1 2
Table 7. Ranking according to weight for clustered index
18
Clustered Index SelectionClustered Index Selection
c. Determine the selectivity of columns and rank them according to the selectivity for each table.
T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Selectivity 1 0.85 0.9 0.88 0.55 0.65 0.82
Rank N/A 1 2 2 1 1 2
Table 8. Ranking according to selectivity for clustered index
19
Clustered Index SelectionClustered Index Selection
d. Find the sum of rank with selectivity and rank with weight for each
column.The indexable attribute with the largest sum is suggested as clustered index for the table. If more than one indexable attributes have the highest sum then select the attribute with higher rank in weight.
Column name Rank withselectivity
Rank with weight Sum
T1.A N/A 1 N/A
T1.B 1 2 3
T1.C 2 1 3
T2.D 2 2 4
T2.E 1 1 2
T3.F 1 1 2
T3.G 2 2 4
Table 9. Sum of Ranks with Weight and Selectivity for Clustered Index
20
Algorithm for proposed techniqueAlgorithm for proposed technique
7. Provide both clustered and non-
clustered indexes to the query optimizer
8. Let the optimizer display the estimated
execution plan to execute the workload
in the database.
9. Select the indexes used by the
optimizer as the final suggested index.
Workload, database information
Identifying candidate indexes
Clustering
Candidate index suggestion and creation
Query optimizer index elimination
Exte
rna
l to th
e O
ptim
ize
r
21
Proposed Re-indexing TechniqueProposed Re-indexing Technique
An auto-indexing tool should be capable of re-indexing whenever the current indexes are no longer good
The DBMS can periodically monitor the cost of total table scan for a particular size of workload
When this value exceeds a limit the DBMS triggers the index selection tool to suggest new set of indexes.
The limit to trigger the tool can be determined from the relationship between increase in table scan cost and performance gain due to re-indexing. This can be set by the DBA.
22
Proposed Re-indexing TechniqueProposed Re-indexing Technique
Produce a Chart showing “Re-indexing performance gain vs. Table scan cost” => Guildelines for DBA by doing the following:
Assume the current workload is Wc, current index set is INc, compute total current table scan cost TSc.
Obtain a set of different workload samples over a long history: W2, W3,.. Wn.
Compute total table scan cost for each of workload sample using the current index set: TS2c, TS3c,…, TSnc.
23
Proposed Re-Indexing Technique (Cont.)Proposed Re-Indexing Technique (Cont.)
Run our index selection tool on W2, W3, …, Wn to get the corresponding recommended sets of indexes: IN2r, IN3r,…, INnr.
Compute total table scan cost for each workload using its new recommended set of indexes: TS2r, TS3r,…, TSnr.
Compute the percentage of performance improvement due to reindexing for each workload Wj: [(TSc – TSjr)/TSc)* 100] for j = 2,.., n.
24
Proposed Re-indexing TechniqueProposed Re-indexing Technique
Table scan cost vs improvement in re-indexing
0
10
20
30
40
50
60
70
80
0 500 1000 1500 2000
increase in table scan cost (times)
% im
prov
emen
t in
exec
utio
n tim
e du
e to
re-in
dexi
ng
25
Proposed Re-indexing TechniqueProposed Re-indexing Technique
If the DBA has the chart in advance, the DBA can set up the limit where he/she wants the DBMS to trigger index selection tool.
The existing index set is then compared with the new index set
Indexes which are part of new but not part of existing set are created, those which are part of existing set and not in new set are dropped and those which intersect remain
The process of dropping and creating indexes in the system follows similar methodology as Oracle’s Automated Index-Rebuild System which can be done either online or offline.
26
ExperimentsExperiments
Performance Metric : average query response time =time taken to execute the workload in minutes divided by the total number of queries in the workload
All experiments are conducted on the system Intel Pentium 4-M, CPU 2.0GHz, 512 MB RAM.
Experiments conducted on TPC-R benchmark with its 22 read-only queries
Experiments conducted on Microsoft SQL Server 2000
27
ExperimentsExperiments
Clustering algorithms used in experiment:
1. MACQueen’s k – means clustering algorithm• Used hamming distance as the distance function
2. KEROUAC (knowledge explicit, rapid, off beat and user-centered algorithm for clustering)
K-means is a well-established data clustering algorithm KEROUAC is a clustering algorithm for practical advantage: it
doesn’t require the final clusters number setting Both algorithms have low computational cost
28
Results With Results With KK-means-means
Chart 1
Number of clusters-k Vs Average query response time
0
0.5
1
1.5
2
2.5
3
3 5 8 11 14 16 18 20 22
Number of clusters-k
Ave
rag
e q
ue
ry r
esp
on
se
tim
e i
n m
inu
tes
threshold2=20
threshold2=60
Microsoft IST
No Index
Frequent Itemset
29
Results with KEROUACResults with KEROUAC
Chart 2
Granularity Factor-α Vs Average query response time
0
0.5
1
1.5
2
2.5
3
3 5 8 10 12 15 20 25
Granularity factor-α
Ave
rag
e q
ue
ry r
esp
on
se
tim
e i
n m
inu
tes
threshold2=20
threshold2=60
threshold2=100
Microsoft IST
No Index
Frequent Itemset
30
Results with Index Suggestion TimeResults with Index Suggestion Time
0123456789
minutes
240 500 730
Workload Size
Comparison of Index Suggestion Time
SQL Server
Our Tool
31
ConclusionsConclusions
Performance of the tool critically depends on the choice of (for KEROUAC) and k (for k-means) and the threshold value.
Increasing (k) will group queries with higher similarity to each other in a cluster.
Increasing (k) beyond some point has no effect (all identical queries are already in the same cluster).
We can therefore achieve desired results by choosing (k) = workload size.
Increasing threshold means more attributes eliminated from consideration
Experiments show that a threshold value about workloadsize/4 works good.
Our tool chooses these parameters to operate on the best performance range.
32
ConclusionsConclusions We compared our results with Microsoft SQL Server’s Index
Selection Tool (An optimizer Based Tool) and also with Frequent Itemsets Mining (An External Tool)
Best performance improvement using k – Means clustering compared with Frequent Itemsets Mining is 71.43 %
Best performance improvement using k – Means clustering compared with Microsoft Index Selection Tool is 16.2 %
Best performance improvement using KEROUAC clustering compared with Frequent Itemsets Mining is 73.26 %
Best performance improvement using KEROUAC clustering compared with Microsoft Index Selection Tool is 21.5 %
The index suggestion time for Microsoft Index Selection Tool was 4 times higher than our tool for a workload size of 240
33
Future WorkFuture Work
Test the dependence of the technique on different clustering algorithms
Test with different sizes of workload
Test with update queries in the workload
Test with index elimination technique
34
Thanks!
Questions?
35
Indexable AttributesIndexable Attributes
Indexable Attributes/Columns: Columns which belong to WHERE, GROUP BY, ORDER BY
clauses Operators {=, <, >, <=, >=, BETWEEN, IN}
Example:
SELECT * FROM table1,table2 WHERE table1.column1 = table2.column1
AND (table2.column2 BETWEEN 0 AND 1000)
the indexable attributes are:table1.column1, table2.column1, table2.column2
36
Selectivity of a columnSelectivity of a column
Selectivity: Selectivity ratio of a column/index
= number of unique values in a column/index of the table divided by total number of rows in that table
SELECT COUNT (DISTINCT (column name)) FROM table name;
If a column/index has high selectivity then it is more useful to the optimizer and has more chances to be picked up by the optimizer while executing a query.
37
Multicolumn IndexesMulticolumn Indexes
Multicolumn indexes:
Column involved in multicolumn index should be joined with AND clause and not with OR clause
Order of the columns in a multicolumn index is important. Order should be based on selectivity and also the first ordered column should be the most used column in queries.
An index (a, b, c) is used by queries involving a, b, c both a and b or a but not in any other combinations
Example:An index (major, minor) is suitable for the following querySELECT name FROM test WHERE major = constantAND minor = constant;
38
Frequent Itemset Mining TechniqueFrequent Itemset Mining Technique
Input is a workload of queries Extract indexable attributes. Create a query attribute matrix.
Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 1
Q1: SELECT * FROM T1, T2 WHERE A BETWEEN 1 AND 10 AND C = DQ2: SELECT * FROM T1, T2 WHERE B LIKE ‘%this%’ AND C=5 AND E<100Q3: SELECT * FROM T1, T2 WHERE A=30 AND B>3 GROUP BY C HAVING SUM(E)>2
Q4: SELECT * FROM T1 WHERE B>2 AND E IN (3,2,5)Q5: SELECT * FROM T1, T2 WHERE A=30 AND B>3 GROUP BY C HAVING SUM(E)>2Q6: SELECT * FROM T1, T2 WHERE B>3 GROUP BY C HAVING SUM(E)>2
39
“A closed itemset is a maximal set of items (attributes) that are common to a set of transactions (queries)”
The candidate indexes selected for minimal support greater than or equal to 2/6 in the example are as follows:[1]
{(AC,3/6), (BE, 5/6), (ABCE, 2/6), (BCE, 4/6)}
Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 2
Frequent Itemset Mining TechniqueFrequent Itemset Mining Technique
40
SQL-Server Index Selection ToolSQL-Server Index Selection Tool
Workload
Candidate indexselection
ConfigurationEnimeration
Multi-column IndexGeneration
What-if IndexCreation
CostEvaluation
SQL Server
Final Indexes
Figure 1
41
SQL-Server Index Selection ToolSQL-Server Index Selection Tool
Candidate index selection:
For a given workload W that consists of n queries, n workloads Wi..Wn, each consisting of one query are generated.
For each workload Wi, the set of indexable columns of the query Ii is the starting candidate indexes.
Let Ci be the configuration picked by the index selection tool for Wi
The candidate index set for W is the union of all Ci
42
SQL-Server Index Selection ToolSQL-Server Index Selection Tool
Configuration enumeration:Problem Pick k indexes from a set of n candidate indexes.
Algorithm
1 Let S=the best m index configuration using the naïve enumeration algorithm. If m=k then exit
2 Pick a new index I such that Cost (S U I) < Cost (S U I’) for any choice of I’ != I
3 If Cost (S U I )>= Cost (S) then exit
else S=S U I
4 If |S| = k then exit
5 Go to 2
43
SQL-Server Index Selection ToolSQL-Server Index Selection Tool
Cost Evaluation:
Reduce the number of invocations of the optimizer by deriving costs from already evaluated costs.
A cost of a non atomic configuration can be derived from atomic configuration
“A configuration C is atomic for a workload if for some query in the workload there is a possible execution of a query in the workload by the query engine that uses all indexes in C”.
Not every atomic configuration needs to be evaluated for every single query in the workload
44
SQL-Server Index Selection ToolSQL-Server Index Selection Tool
Multicolumn index generation:
For given K columns K! multicolumn indexes are possible
Iterative approach– First iteration single column indexes are considered– Only the selected single column indexes are input to the two
column indexes in the first iteration– This set of two-column indexes along with single column ones
are input to the third iteration and so on.
45
Clustered IndexClustered Index
Clustered indexes:
“A page allocated to an index is called a data page.”
For tables having clustered index the data rows of each data page are stored in order and the data pages are linked together by doubly-linked list.
For table having no clustered index the data rows are not stored in any particular order. Figure 2
46
Clustered IndexClustered Index
Accessing data with a clustered index:
Figure 3
47
Criteria to choose clustered indexCriteria to choose clustered index
Queries that return large result sets
Columns with a number of duplicate values that are searched frequently
Columns other than primary key that are frequently used in join clauses
Columns searched within a range of values
Columns used in ORDER BY or GROUP BY queries
48
MACQueen’s k – Means Clustering algorithm
1) First k data units are chosen as clusters of one member each.
2) Remaining m – k data units are assigned to the clusters whose centroid is nearest to the data unit under consideration. Centroid is recomputed after every gain in the cluster
3) Iterate through the data set assigning each data unit to its nearest cluster taking the existing cluster centroids as fixed seed points until a certain criteria is reached
49
MACQueen’s k – Means Clustering algorithm
Hamming distance is the number of positions in which two binary words differ
For k = 3; Q1, Q2, Q3 are initial clusters
Hamming distance of Q1 and Q4 is 5Q2 and Q4 is 1Q3 and Q4 is 2
Clusters at step 2 are cluster1: [Q1]cluster2: [Q2, Q4, Q6] cluster3: [Q3,Q5]
Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 3
50
KEROUAC: Knowledge Explicit, Rapid, OFF beat and User-centered Algorithm for Clustering
New Condorcet Criterion (NCC) :
NCC(Pz) = Sim(Ei,Ej) + α x Dissim(Ei)
NCC represents the degree of dissimilarity of objects belonging
to the same cluster and the degree of similarity between clusters.
It is desired to be minimum.
α is called the granularity factor
KEROUAC Clustering algorithm
51
KEROUAC Clustering algorithmAttributes
Queries A B C D E
Q1 1 1 0 0 0
Q2 1 0 1 0 0
Q3 0 1 1 0 0
Q4 1 1 1 0 0
Q5 0 0 1 1 1
Q6 0 0 0 1 1
Table 4
1, 2, 3, 4, 5, 6
1, 3, 4 2, 5, 6
According to B
1, 4 3
According to A
2 5, 6
According to D and E
Neighbor Neighbor
• Calculate NCC for all the neighbors and pick the neighbor with least NCC
• Repeat until a reduction in NCC is not possible
Figure 4
52
Proposed index selection algorithmProposed index selection algorithm
Step 1: Input workload
Step 2: Parse the workload to find indexable attributes = > Query-attribute matrix
Step 3: Identify candidate indexable attributes based on frequency of attributes in the workload and threshold1 and threshold2
Step 4: Compute the total weight of each attribute using the clause weight assignment policy and order the attributes in each table in decreasing order of weights in order to identify non-clustered indexes in Step 6 => Ordered query-attribute matrix.
Step 5: Perform a data mining clustering technique on the ordered query-attribute matrix to group similar queries based on their use of attributes.
53
Proposed Index Selection Algorithm Proposed Index Selection Algorithm (Cont.)(Cont.)
Step 6: Extract index sets for each table from the clustering results in Step5.
Step 7: Select a single attribute clustered index for each table by computing the sum of range query’s weight and selectivity ranking for each attribute and selecting the attribute with the highest sum.
Step 8: Provide the query optimizer with the virtual set of indexes chosen in
steps 5 and 7 (including both non-clustered and clustered indexes).
Step 9: Let the optimizer display the estimated execution plan to execute the workload.
Step 10: Select the indexes used by the optimizer as the final suggested index.