Auto administration of databases based on clustering

About This Presentation

Title:

Auto administration of databases based on clustering

Description:

Motivation of Auto-indexing. Index selection is an important part of ... auto-indexing ... optimizer Index suggested may not be used by the optimizer ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 54

Provided by: jyotsnasu

Learn more at: http://www.cs.ou.edu

Category:

more less

Transcript and Presenter's Notes

Title: Auto administration of databases based on clustering

1
Auto administration of databases based on
clustering

Mujiba Zaman
Jyotsna Surabattula
Le Gruenwald
School of Computer Science
The University of Oklahoma
Norman, Oklahoma, 73019, USA
mujiba, jyotsna, ggruenwald_at_ou.edu

2
Overview of the Presentation

Motivation for auto-indexing
Existing techniques
Proposed Index Selection Technique
Algorithm for Proposed Index Selection Technique
Proposed Re-Indexing Technique
Experiments
Results
Conclusions
Future Work

3
Motivation of Auto-indexing

Index selection is an important part of physical
database design
For large databases it is difficult for DBAs to
analyze data and find an optimal set of indices
The goal of auto-indexing includes
Analyze workload automatically to identify a good
set of indexes.
Create those indexes.
Automatically evaluate current indexes and
re-index if necessary

4
Existing Techniques

The index selection problem (ISP) has been
approached in two different
ways to build index selection tools
External Tools
Some basic assumptions are made
Cost functions are formulated based on the
assumptions
Attempt to minimize the cost function using
combinatorial optimization technique or by some
heuristic method
Optimizer Based Tools
Utilize the query optimizer to give cost
estimates for various index configurations and
suggest a configuration with the least cost
estimation.

5
Advantages and Disadvantages of existing
techniques

External Tools
Advantages
Applications using the DBMS will not suffer
processing delays
Disadvantages
Disconnected from the optimizer Index
suggested may not be used by the optimizer
Becomes obsolete if optimizer changes

6
Advantages and Disadvantages of existing
techniques

Optimizer Based Tools
Advantages
Suggested indexes will be used by the
optimizer
Disadvantages
Performs expensive operation of optimizer
invocation
Longer processing time for other applications
using the DBMS when indexes are being
suggested.

7
Proposed Index Selection Technique

Combines both the approaches.
Based on the intuition that the attributes
that occur more commonly and frequently in a
group of similar queries are likely to be useful
for indexing.
Use a Data Mining Clustering technique to
group queries which are similar in terms of their
use of attributes.
Extract Indexable attributes in all the
queries in each group as indexes.
These indexes can be single-column or
multi-column

8
Proposed Index Selection Technique

For multi-column indexes, the order of the
columns is determined by assigning weights to
attributes based on where they are used in the
queries.
A clustered index is also chosen by assigning
weights to the attributes depending on where they
occur in the queries.
Extracted indexes are then submitted to the query
optimizer for final selection for the given
workload.
The indexes not selected by the optimizer are
eliminated. The remaining indexes are the final
indexes suggested by our tool.
Re-index if total table scan time using current
indexes is higher than that using new indexes.

9
Algorithm for proposed technique

1. Input is a workload of queries
2. Parse the workload to find out all the
indexable attributes
and build two matrix
- Query-Attribute matrix (1 presence of an
attribute 0 otherwise)
Attribute-frequency matrix (frequency of the
attribute indicated by the number)
Consider the following queries
Q1. Select T1.A, T1.B, Sum(T1.C) from T1, T4,
where T1.AT4.K and T1.CT3.H and T1.Blt30 group
by T1.C
Q2. Select Ave(G) from T3 where F like this
and G between 10 and 20 order by G

10
Algorithm for proposed technique

A corresponding query attribute matrix with
several other queries could be as follows
Table 1 Query Attribute Matrix

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 1 0 0 0 0 1 1
Q2 0 0 0 0 0 1 1 0 0
Q3 0 0 0 0 1 1 1 0 0
Q4 1 1 1 1 0 0 0 1 1
Q5 0 0 1 0 0 1 1 0 0
11
Algorithm for proposed technique

A corresponding attribute frequency matrix could
be as follows
Table 2 Attribute Frequency Matrix

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
12
Candidate Index Selection
Table size of T1 5000 Table size of T2
2000 Table size of T3 100 Table size of T4
20 T table size/100 Let threshold1 5,
threshold2 30
3. Choose candidate index set Freq gt
threshold1 OR Freq T gt threshold2
T table size/constant
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
FreqT 150 300 350 60 40 9 6 5 1
Table 3. Attribute-frequency matrix
13
Non-Clustered Index Selection

4 a. Assign weights to the candidate attribute
set as follows
Where Clause 3
Group by/Order by 2
Aggregate functions 1

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 3 3 321 0 0 0 0
Q2 0 0 0 0 0 3 321
Q3 0 0 0 0 6 5 4
Q4 2 5 1 3 0 0 0
Q5 0 0 3 0 0 5 2
Total Weight 5 8 10 3 6 13 12
Table 4. Candidate Attribute Set
14
Non-Clustered Index Selection

4 b. Order the columns in the candidate attribute
set in each table in descending order of their
weights for determining non-clustered indexes.

Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 6 3 3 0 0 0 0
Q2 0 0 0 0 0 3 6
Q3 0 0 0 6 0 5 4
Q4 1 5 2 0 3 0 0
Q5 3 0 0 0 0 5 2
Total Weight 10 8 5 6 3 13 12
Table 5. Ordered Candidate Attribute Set
15
Non-Clustered Index Selection

4 c. Build a Query-Attribute Matrix with the
ordered columns

Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 1 1 1 0 0 0 0
Q2 0 0 0 0 0 1 1
Q3 0 0 0 1 0 1 1
Q4 1 1 1 0 1 0 0
Q5 1 0 0 0 0 1 1
Total Weight 10 8 5 6 3 13 12
Table 6. Query Attribute Matrix with ordered
candidate index set
16
Non-Clustered Index Selection

5. Use a data mining clustering technique on the
Query-Attribute matrix to group similar queries.
A possible clustering result is
Q1 Q4 Q2 Q3 Q5 (Table 6)
6. Extract the index sets for each table from the
clusters obtained.
These are all the common attributes from all
the queries
clustered together.
For the cluster Q1, Q4 Indexes are (T1.C,
T1.B, T1.A) in that order
For the cluster Q2 Q3 Q5 Indexes are (T3.F,
T3.G) in that order

17
Clustered Index Selection

6. Choose single column clustered index as
follows
a. During parsing assign the following weight to
the attributes
Range queries 2 Join 1 Group
by/Order by 1
b. Rank the attributes according to their weight

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 1 2 1 0 0 0 0
Q2 0 0 0 0 0 0 21
Q3 0 0 0 0 1 1 1
Q4 1 3 0 2 0 0 0
Q5 0 0 1 0 0 1 1
Total Weight 2 5 2 2 1 2 5
Rank 1 2 1 2 1 1 2
Table 7. Ranking according to weight for
clustered index
18
Clustered Index Selection

c. Determine the selectivity of columns and rank
them according to the selectivity for each table.

T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Selectivity 1 0.85 0.9 0.88 0.55 0.65 0.82
Rank N/A 1 2 2 1 1 2
Table 8. Ranking according to selectivity for
clustered index
19
Clustered Index Selection

d. Find the sum of rank with selectivity and
rank with weight for each
column.The indexable attribute with the
largest sum is suggested as
clustered index for the table.
If more than one indexable attributes have
the highest sum then
select the attribute with higher rank in
weight.

Column name Rank with selectivity Rank with weight Sum
T1.A N/A 1 N/A
T1.B 1 2 3
T1.C 2 1 3
T2.D 2 2 4
T2.E 1 1 2
T3.F 1 1 2
T3.G 2 2 4
Table 9. Sum of Ranks with Weight and
Selectivity for Clustered Index
20
Algorithm for proposed technique

7. Provide both clustered and non-
clustered indexes to the query optimizer
8. Let the optimizer display the estimated
execution plan to execute the workload
in the database.
9. Select the indexes used by the
optimizer as the final suggested index.

21
Proposed Re-indexing Technique

An auto-indexing tool should be capable of
re-indexing whenever the current indexes are no
longer good
The DBMS can periodically monitor the cost of
total table scan for a particular size of
workload
When this value exceeds a limit the DBMS triggers
the index selection tool to suggest new set of
indexes.
The limit to trigger the tool can be determined
from the relationship between increase in table
scan cost and performance gain due to
re-indexing. This can be set by the DBA.

22
Proposed Re-indexing Technique

Produce a Chart showing Re-indexing performance
gain vs. Table scan cost gt Guildelines for DBA
by doing the following
Assume the current workload is Wc, current index
set is INc, compute total current table scan cost
TSc.
Obtain a set of different workload samples over a
long history W2, W3,.. Wn.
Compute total table scan cost for each of
workload sample using the current index set
TS2c, TS3c,, TSnc.

23
Proposed Re-Indexing Technique (Cont.)

Run our index selection tool on W2, W3, , Wn to
get the corresponding recommended sets of
indexes IN2r, IN3r,, INnr.
Compute total table scan cost for each workload
using its new recommended set of indexes TS2r,
TS3r,, TSnr.
Compute the percentage of performance
improvement due to reindexing for each workload
Wj (TSc TSjr)/TSc) 100 for j 2,.., n.

24
Proposed Re-indexing Technique

25
Proposed Re-indexing Technique

If the DBA has the chart in advance, the DBA can
set up the limit where he/she wants the DBMS to
trigger index selection tool.
The existing index set is then compared with the
new index set
Indexes which are part of new but not part of
existing set are created, those which are part of
existing set and not in new set are dropped and
those which intersect remain
The process of dropping and creating indexes in
the system follows similar methodology as
Oracles Automated Index-Rebuild System which can
be done either online or offline.

26
Experiments

Performance Metric
average query response time
time taken to execute the workload in minutes
divided by the total number of queries in the
workload
All experiments are conducted on the system Intel
Pentium 4-M, CPU 2.0GHz, 512 MB RAM.
Experiments conducted on TPC-R benchmark with its
22 read-only queries
Experiments conducted on Microsoft SQL Server 2000

27
Experiments

Clustering algorithms used in experiment
1. MACQueens k means clustering algorithm
Used hamming distance as the distance function
2. KEROUAC (knowledge explicit, rapid, off beat
and user-centered algorithm for clustering)
K-means is a well-established data clustering
algorithm
KEROUAC is a clustering algorithm for practical
advantage it doesnt require the final clusters
number setting
Both algorithms have low computational cost

28
Results With K-means
Chart 1
29
Results with KEROUAC
Chart 2
30
Results with Index Suggestion Time
31
Conclusions

Performance of the tool critically depends on the
choice of ? (for KEROUAC) and k (for k-means) and
the threshold value.
Increasing ? (k) will group queries with higher
similarity to each other in a cluster.
Increasing ? (k) beyond some point has no effect
(all identical queries are already in the same
cluster).
We can therefore achieve desired results by
choosing ? (k) workload size.
Increasing threshold means more attributes
eliminated from consideration
Experiments show that a threshold value about
workloadsize/4 works good.
Our tool chooses these parameters to operate on
the best performance range.

32
Conclusions

We compared our results with Microsoft SQL
Servers Index Selection Tool (An optimizer Based
Tool) and also with Frequent Itemsets Mining (An
External Tool)
Best performance improvement using k Means
clustering compared with Frequent Itemsets Mining
is 71.43
Best performance improvement using k Means
clustering compared with Microsoft Index
Selection Tool is 16.2
Best performance improvement using KEROUAC
clustering compared with Frequent Itemsets Mining
is 73.26
Best performance improvement using KEROUAC
clustering compared with Microsoft Index
Selection Tool is 21.5
The index suggestion time for Microsoft Index
Selection Tool was 4 times higher than our tool
for a workload size of 240

33
Future Work

Test the dependence of the technique on different
clustering algorithms
Test with different sizes of workload
Test with update queries in the workload
Test with index elimination technique

Thanks!
Questions?

35
Indexable Attributes

Indexable Attributes/Columns
Columns which belong to WHERE, GROUP BY, ORDER BY
clauses
Operators , lt, gt, lt, gt, BETWEEN, IN
Example
SELECT
FROM table1,table2
WHERE table1.column1 table2.column1
AND (table2.column2 BETWEEN 0 AND 1000)
the indexable attributes are
table1.column1, table2.column1, table2.column2

36
Selectivity of a column

Selectivity
Selectivity ratio of a column/index
number of unique values in a column/index of
the table divided by total number of rows in that
table
SELECT COUNT (DISTINCT (column name))
FROM table name
If a column/index has high selectivity then it is
more useful to the optimizer and has more chances
to be picked up by the optimizer while executing
a query.

37
Multicolumn Indexes

Multicolumn indexes
Column involved in multicolumn index should be
joined with AND clause and not with OR clause
Order of the columns in a multicolumn index is
important. Order should be based on selectivity
and also the first ordered column should be the
most used column in queries.
An index (a, b, c) is used by queries involving
a, b, c both a and b or a but not in any other
combinations
Example
An index (major, minor) is suitable for the
following query
SELECT name
FROM test
WHERE major constant
AND minor constant

38
Frequent Itemset Mining Technique

Input is a workload of queries
Extract indexable attributes.
Create a query attribute matrix.

Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Q1 SELECT FROM T1, T2 WHERE A BETWEEN 1 AND
10 AND C D Q2 SELECT FROM T1, T2 WHERE B
LIKE this AND C5 AND Elt100 Q3
SELECT FROM T1, T2 WHERE A30 AND Bgt3
GROUP BY C HAVING SUM(E)gt2
Table 1
Q4 SELECT FROM T1 WHERE Bgt2 AND E IN
(3,2,5) Q5 SELECT FROM T1, T2 WHERE A30 AND
Bgt3 GROUP BY C HAVING SUM(E)gt2 Q6 SELECT
FROM T1, T2 WHERE Bgt3 GROUP BY C HAVING
SUM(E)gt2
39
Frequent Itemset Mining Technique

A closed itemset is a maximal set of items
(attributes) that are common to a set of
transactions (queries)
The candidate indexes selected for minimal
support greater than or equal to 2/6 in the
example are as follows1
(AC,3/6), (BE, 5/6), (ABCE, 2/6), (BCE, 4/6)

Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 2
40
SQL-Server Index Selection Tool
Figure 1
41
SQL-Server Index Selection Tool

Candidate index selection
For a given workload W that consists of n
queries, n workloads Wi..Wn, each consisting of
one query are generated.
For each workload Wi, the set of indexable
columns of the query Ii is the starting candidate
indexes.
Let Ci be the configuration picked by the index
selection tool for Wi
The candidate index set for W is the union of all
Ci

42
SQL-Server Index Selection Tool

Configuration enumeration
Problem
Pick k indexes from a set of n candidate indexes.
Algorithm
1 Let Sthe best m index configuration using the
naïve enumeration algorithm. If mk then exit
2 Pick a new index I such that Cost (S U I) lt
Cost (S U I) for any choice of I ! I
3 If Cost (S U I )gt Cost (S) then exit
else SS U I
4 If S k then exit
5 Go to 2

43
SQL-Server Index Selection Tool

Cost Evaluation
Reduce the number of invocations of the optimizer
by deriving costs from already evaluated costs.
A cost of a non atomic configuration can be
derived from atomic configuration
A configuration C is atomic for a workload if
for some query in the workload there is a
possible execution of a query in the workload by
the query engine that uses all indexes in C.
Not every atomic configuration needs to be
evaluated for every single query in the workload

44
SQL-Server Index Selection Tool

Multicolumn index generation
For given K columns K! multicolumn indexes are
possible
Iterative approach
First iteration single column indexes are
considered
Only the selected single column indexes are input
to the two column indexes in the first iteration
This set of two-column indexes along with single
column ones are input to the third iteration and
so on.

45
Clustered Index

Clustered indexes
A page allocated to an index is called a data
page.
For tables having clustered index the data rows
of each data page are stored in order and the
data pages are linked together by doubly-linked
list.
For table having no clustered index the data rows
are not stored in any particular order.

Figure 2
46
Clustered Index
Accessing data with a clustered index
Figure 3
47
Criteria to choose clustered index

Queries that return large result sets
Columns with a number of duplicate values that
are searched frequently
Columns other than primary key that are
frequently used in join clauses
Columns searched within a range of values
Columns used in ORDER BY or GROUP BY queries

48
MACQueens k Means Clustering algorithm

1) First k data units are chosen as clusters of
one member each.
2) Remaining m k data units are assigned to
the clusters whose centroid is nearest to the
data unit under consideration. Centroid is
recomputed after every gain in the cluster
3) Iterate through the data set assigning each
data unit to its nearest cluster taking the
existing cluster centroids as fixed seed points
until a certain criteria is reached

49
MACQueens k Means Clustering algorithm

Hamming distance is the number of positions in
which two binary words differ
For k 3 Q1, Q2, Q3 are initial clusters
Hamming distance of
Q1 and Q4 is 5
Q2 and Q4 is 1
Q3 and Q4 is 2
Clusters at step 2 are
cluster1 Q1
cluster2 Q2, Q4, Q6
cluster3 Q3,Q5

Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 3
50
KEROUAC Clustering algorithm
KEROUAC Knowledge Explicit, Rapid, OFF beat and
User-centered Algorithm for Clustering New
Condorcet Criterion (NCC) NCC(Pz) ?
Sim(Ei,Ej) a x ? Dissim(Ei) NCC represents
the degree of dissimilarity of objects
belonging to the same cluster and the degree of
similarity between clusters. It is desired
to be minimum. a is called the granularity factor
51
KEROUAC Clustering algorithm
Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 1 0 0 0
Q2 1 0 1 0 0
Q3 0 1 1 0 0
Q4 1 1 1 0 0
Q5 0 0 1 1 1
Q6 0 0 0 1 1
Table 4

Calculate NCC for all the neighbors and pick the
neighbor with least NCC
Repeat until a reduction in NCC is not possible

Figure 4
52
Proposed index selection algorithm

Step 1 Input workload
Step 2 Parse the workload to find indexable
attributes gt Query-attribute matrix
Step 3 Identify candidate indexable attributes
based on frequency of attributes in the workload
and threshold1 and threshold2
Step 4 Compute the total weight of each
attribute using the clause weight assignment
policy and order the attributes in each table in
decreasing order of weights in order to identify
non-clustered indexes in Step 6 gt Ordered
query-attribute matrix.
Step 5 Perform a data mining clustering
technique on the ordered query-attribute matrix
to group similar queries based on their use of
attributes.

53
Proposed Index Selection Algorithm (Cont.)

Step 6 Extract index sets for each table from
the clustering results in Step5.
Step 7 Select a single attribute clustered index
for each table by computing the sum of range
querys weight and selectivity ranking for each
attribute and selecting the attribute with the
highest sum.
Step 8 Provide the query optimizer with the
virtual set of indexes chosen in
steps 5 and 7 (including both non-clustered and
clustered indexes).
Step 9 Let the optimizer display the estimated
execution plan to execute the workload.
Step 10 Select the indexes used by the optimizer
as the final suggested index.