Auto administration of databases based on clustering - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Auto administration of databases based on clustering

Description:

Motivation of Auto-indexing. Index selection is an important part of ... auto-indexing ... optimizer Index suggested may not be used by the optimizer ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 54
Provided by: jyotsnasu
Learn more at: http://www.cs.ou.edu
Category:

less

Transcript and Presenter's Notes

Title: Auto administration of databases based on clustering


1
Auto administration of databases based on
clustering
  • Mujiba Zaman
  • Jyotsna Surabattula
  • Le Gruenwald
  • School of Computer Science
  • The University of Oklahoma
  • Norman, Oklahoma, 73019, USA
  • mujiba, jyotsna, ggruenwald_at_ou.edu

2
Overview of the Presentation
  • Motivation for auto-indexing
  • Existing techniques
  • Proposed Index Selection Technique
  • Algorithm for Proposed Index Selection Technique
  • Proposed Re-Indexing Technique
  • Experiments
  • Results
  • Conclusions
  • Future Work

3
Motivation of Auto-indexing
  • Index selection is an important part of physical
    database design
  • For large databases it is difficult for DBAs to
    analyze data and find an optimal set of indices
  • The goal of auto-indexing includes
  • Analyze workload automatically to identify a good
    set of indexes.
  • Create those indexes.
  • Automatically evaluate current indexes and
    re-index if necessary

4
Existing Techniques
  • The index selection problem (ISP) has been
    approached in two different
  • ways to build index selection tools
  • External Tools
  • Some basic assumptions are made
  • Cost functions are formulated based on the
    assumptions
  • Attempt to minimize the cost function using
    combinatorial optimization technique or by some
    heuristic method
  • Optimizer Based Tools
  • Utilize the query optimizer to give cost
    estimates for various index configurations and
    suggest a configuration with the least cost
    estimation.

5
Advantages and Disadvantages of existing
techniques
  • External Tools
  • Advantages
  • Applications using the DBMS will not suffer
    processing delays
  • Disadvantages
  • Disconnected from the optimizer Index
    suggested may not be used by the optimizer
  • Becomes obsolete if optimizer changes

6
Advantages and Disadvantages of existing
techniques
  • Optimizer Based Tools
  • Advantages
  • Suggested indexes will be used by the
    optimizer
  • Disadvantages
  • Performs expensive operation of optimizer
    invocation
  • Longer processing time for other applications
    using the DBMS when indexes are being
    suggested.

7
Proposed Index Selection Technique
  • Combines both the approaches.
  • Based on the intuition that the attributes
    that occur more commonly and frequently in a
    group of similar queries are likely to be useful
    for indexing.
  • Use a Data Mining Clustering technique to
    group queries which are similar in terms of their
    use of attributes.
  • Extract Indexable attributes in all the
    queries in each group as indexes.
  • These indexes can be single-column or
    multi-column

8
Proposed Index Selection Technique
  • For multi-column indexes, the order of the
    columns is determined by assigning weights to
    attributes based on where they are used in the
    queries.
  • A clustered index is also chosen by assigning
    weights to the attributes depending on where they
    occur in the queries.
  • Extracted indexes are then submitted to the query
    optimizer for final selection for the given
    workload.
  • The indexes not selected by the optimizer are
    eliminated. The remaining indexes are the final
    indexes suggested by our tool.
  • Re-index if total table scan time using current
    indexes is higher than that using new indexes.

9
Algorithm for proposed technique
  • 1. Input is a workload of queries
  • 2. Parse the workload to find out all the
    indexable attributes
  • and build two matrix
  • - Query-Attribute matrix (1 presence of an
    attribute 0 otherwise)
  • Attribute-frequency matrix (frequency of the
    attribute indicated by the number)
  • Consider the following queries
  • Q1. Select T1.A, T1.B, Sum(T1.C) from T1, T4,
    where T1.AT4.K and T1.CT3.H and T1.Blt30 group
    by T1.C
  • Q2. Select Ave(G) from T3 where F like this
    and G between 10 and 20 order by G

10
Algorithm for proposed technique
  • A corresponding query attribute matrix with
    several other queries could be as follows
  • Table 1 Query Attribute Matrix

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 1 0 0 0 0 1 1
Q2 0 0 0 0 0 1 1 0 0
Q3 0 0 0 0 1 1 1 0 0
Q4 1 1 1 1 0 0 0 1 1
Q5 0 0 1 0 0 1 1 0 0
11
Algorithm for proposed technique
  • A corresponding attribute frequency matrix could
    be as follows
  • Table 2 Attribute Frequency Matrix

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
12
Candidate Index Selection
Table size of T1 5000 Table size of T2
2000 Table size of T3 100 Table size of T4
20 T table size/100 Let threshold1 5,
threshold2 30
3. Choose candidate index set Freq gt
threshold1 OR Freq T gt threshold2
T table size/constant
Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G T3.H T4.K
Q1 1 1 3 0 0 0 0 1 1
Q2 0 0 0 0 0 1 3 0 0
Q3 0 0 0 0 2 3 1 0 0
Q4 2 5 1 3 0 0 0 4 4
Q5 0 0 3 0 0 5 2 0 0
Freq 3 6 7 3 2 9 6 5 5
FreqT 150 300 350 60 40 9 6 5 1
Table 3. Attribute-frequency matrix
13
Non-Clustered Index Selection
  • 4 a. Assign weights to the candidate attribute
    set as follows
  • Where Clause 3
  • Group by/Order by 2
  • Aggregate functions 1

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 3 3 321 0 0 0 0
Q2 0 0 0 0 0 3 321
Q3 0 0 0 0 6 5 4
Q4 2 5 1 3 0 0 0
Q5 0 0 3 0 0 5 2
Total Weight 5 8 10 3 6 13 12
Table 4. Candidate Attribute Set
14
Non-Clustered Index Selection
  • 4 b. Order the columns in the candidate attribute
    set in each table in descending order of their
    weights for determining non-clustered indexes.

Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 6 3 3 0 0 0 0
Q2 0 0 0 0 0 3 6
Q3 0 0 0 6 0 5 4
Q4 1 5 2 0 3 0 0
Q5 3 0 0 0 0 5 2
Total Weight 10 8 5 6 3 13 12
Table 5. Ordered Candidate Attribute Set
15
Non-Clustered Index Selection
  • 4 c. Build a Query-Attribute Matrix with the
    ordered columns

Query T1.C T1.B T1.A T2.E T2.D T3.F T3.G
Q1 1 1 1 0 0 0 0
Q2 0 0 0 0 0 1 1
Q3 0 0 0 1 0 1 1
Q4 1 1 1 0 1 0 0
Q5 1 0 0 0 0 1 1
Total Weight 10 8 5 6 3 13 12
Table 6. Query Attribute Matrix with ordered
candidate index set
16
Non-Clustered Index Selection
  • 5. Use a data mining clustering technique on the
    Query-Attribute matrix to group similar queries.
    A possible clustering result is
  • Q1 Q4 Q2 Q3 Q5 (Table 6)
  • 6. Extract the index sets for each table from the
    clusters obtained.
  • These are all the common attributes from all
    the queries
  • clustered together.
  • For the cluster Q1, Q4 Indexes are (T1.C,
    T1.B, T1.A) in that order
  • For the cluster Q2 Q3 Q5 Indexes are (T3.F,
    T3.G) in that order

17
Clustered Index Selection
  • 6. Choose single column clustered index as
    follows
  • a. During parsing assign the following weight to
    the attributes
  • Range queries 2 Join 1 Group
    by/Order by 1
  • b. Rank the attributes according to their weight

Query T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Q1 1 2 1 0 0 0 0
Q2 0 0 0 0 0 0 21
Q3 0 0 0 0 1 1 1
Q4 1 3 0 2 0 0 0
Q5 0 0 1 0 0 1 1
Total Weight 2 5 2 2 1 2 5
Rank 1 2 1 2 1 1 2
Table 7. Ranking according to weight for
clustered index
18
Clustered Index Selection
  • c. Determine the selectivity of columns and rank
    them according to the selectivity for each table.

T1.A T1.B T1.C T2.D T2.E T3.F T3.G
Selectivity 1 0.85 0.9 0.88 0.55 0.65 0.82
Rank N/A 1 2 2 1 1 2
Table 8. Ranking according to selectivity for
clustered index
19
Clustered Index Selection
  • d. Find the sum of rank with selectivity and
    rank with weight for each
  • column.The indexable attribute with the
    largest sum is suggested as
  • clustered index for the table.
  • If more than one indexable attributes have
    the highest sum then
  • select the attribute with higher rank in
    weight.

Column name Rank with selectivity Rank with weight Sum
T1.A N/A 1 N/A
T1.B 1 2 3
T1.C 2 1 3
T2.D 2 2 4
T2.E 1 1 2
T3.F 1 1 2
T3.G 2 2 4
Table 9. Sum of Ranks with Weight and
Selectivity for Clustered Index
20
Algorithm for proposed technique
  • 7. Provide both clustered and non-
  • clustered indexes to the query optimizer
  • 8. Let the optimizer display the estimated
  • execution plan to execute the workload
  • in the database.
  • 9. Select the indexes used by the
  • optimizer as the final suggested index.

21
Proposed Re-indexing Technique
  • An auto-indexing tool should be capable of
    re-indexing whenever the current indexes are no
    longer good
  • The DBMS can periodically monitor the cost of
    total table scan for a particular size of
    workload
  • When this value exceeds a limit the DBMS triggers
    the index selection tool to suggest new set of
    indexes.
  • The limit to trigger the tool can be determined
    from the relationship between increase in table
    scan cost and performance gain due to
    re-indexing. This can be set by the DBA.

22
Proposed Re-indexing Technique
  • Produce a Chart showing Re-indexing performance
    gain vs. Table scan cost gt Guildelines for DBA
    by doing the following
  • Assume the current workload is Wc, current index
    set is INc, compute total current table scan cost
    TSc.
  • Obtain a set of different workload samples over a
    long history W2, W3,.. Wn.
  • Compute total table scan cost for each of
    workload sample using the current index set
    TS2c, TS3c,, TSnc.

23
Proposed Re-Indexing Technique (Cont.)
  • Run our index selection tool on W2, W3, , Wn to
    get the corresponding recommended sets of
    indexes IN2r, IN3r,, INnr.
  • Compute total table scan cost for each workload
    using its new recommended set of indexes TS2r,
    TS3r,, TSnr.
  • Compute the percentage of performance
    improvement due to reindexing for each workload
    Wj (TSc TSjr)/TSc) 100 for j 2,.., n.

24
Proposed Re-indexing Technique

25
Proposed Re-indexing Technique
  • If the DBA has the chart in advance, the DBA can
    set up the limit where he/she wants the DBMS to
    trigger index selection tool.
  • The existing index set is then compared with the
    new index set
  • Indexes which are part of new but not part of
    existing set are created, those which are part of
    existing set and not in new set are dropped and
    those which intersect remain
  • The process of dropping and creating indexes in
    the system follows similar methodology as
    Oracles Automated Index-Rebuild System which can
    be done either online or offline.

26
Experiments
  • Performance Metric
  • average query response time
  • time taken to execute the workload in minutes
    divided by the total number of queries in the
    workload
  • All experiments are conducted on the system Intel
    Pentium 4-M, CPU 2.0GHz, 512 MB RAM.
  • Experiments conducted on TPC-R benchmark with its
    22 read-only queries
  • Experiments conducted on Microsoft SQL Server 2000

27
Experiments
  • Clustering algorithms used in experiment
  • 1. MACQueens k means clustering algorithm
  • Used hamming distance as the distance function
  • 2. KEROUAC (knowledge explicit, rapid, off beat
    and user-centered algorithm for clustering)
  • K-means is a well-established data clustering
    algorithm
  • KEROUAC is a clustering algorithm for practical
    advantage it doesnt require the final clusters
    number setting
  • Both algorithms have low computational cost

28
Results With K-means
Chart 1
29
Results with KEROUAC
Chart 2
30
Results with Index Suggestion Time
31
Conclusions
  • Performance of the tool critically depends on the
    choice of ? (for KEROUAC) and k (for k-means) and
    the threshold value.
  • Increasing ? (k) will group queries with higher
    similarity to each other in a cluster.
  • Increasing ? (k) beyond some point has no effect
    (all identical queries are already in the same
    cluster).
  • We can therefore achieve desired results by
    choosing ? (k) workload size.
  • Increasing threshold means more attributes
    eliminated from consideration
  • Experiments show that a threshold value about
    workloadsize/4 works good.
  • Our tool chooses these parameters to operate on
    the best performance range.

32
Conclusions
  • We compared our results with Microsoft SQL
    Servers Index Selection Tool (An optimizer Based
    Tool) and also with Frequent Itemsets Mining (An
    External Tool)
  • Best performance improvement using k Means
    clustering compared with Frequent Itemsets Mining
    is 71.43
  • Best performance improvement using k Means
    clustering compared with Microsoft Index
    Selection Tool is 16.2
  • Best performance improvement using KEROUAC
    clustering compared with Frequent Itemsets Mining
    is 73.26
  • Best performance improvement using KEROUAC
    clustering compared with Microsoft Index
    Selection Tool is 21.5
  • The index suggestion time for Microsoft Index
    Selection Tool was 4 times higher than our tool
    for a workload size of 240

33
Future Work
  • Test the dependence of the technique on different
    clustering algorithms
  • Test with different sizes of workload
  • Test with update queries in the workload
  • Test with index elimination technique

34
  • Thanks!
  • Questions?

35
Indexable Attributes
  • Indexable Attributes/Columns
  • Columns which belong to WHERE, GROUP BY, ORDER BY
    clauses
  • Operators , lt, gt, lt, gt, BETWEEN, IN
  • Example
  • SELECT
  • FROM table1,table2
  • WHERE table1.column1 table2.column1
  • AND (table2.column2 BETWEEN 0 AND 1000)
  • the indexable attributes are
  • table1.column1, table2.column1, table2.column2

36
Selectivity of a column
  • Selectivity
  • Selectivity ratio of a column/index
  • number of unique values in a column/index of
    the table divided by total number of rows in that
    table
  • SELECT COUNT (DISTINCT (column name))
  • FROM table name
  • If a column/index has high selectivity then it is
    more useful to the optimizer and has more chances
    to be picked up by the optimizer while executing
    a query.

37
Multicolumn Indexes
  • Multicolumn indexes
  • Column involved in multicolumn index should be
    joined with AND clause and not with OR clause
  • Order of the columns in a multicolumn index is
    important. Order should be based on selectivity
    and also the first ordered column should be the
    most used column in queries.
  • An index (a, b, c) is used by queries involving
    a, b, c both a and b or a but not in any other
    combinations
  • Example
  • An index (major, minor) is suitable for the
    following query
  • SELECT name
  • FROM test
  • WHERE major constant
  • AND minor constant

38
Frequent Itemset Mining Technique
  • Input is a workload of queries
  • Extract indexable attributes.
  • Create a query attribute matrix.

Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Q1 SELECT FROM T1, T2 WHERE A BETWEEN 1 AND
10 AND C D Q2 SELECT FROM T1, T2 WHERE B
LIKE this AND C5 AND Elt100 Q3
SELECT FROM T1, T2 WHERE A30 AND Bgt3
GROUP BY C HAVING SUM(E)gt2
Table 1
Q4 SELECT FROM T1 WHERE Bgt2 AND E IN
(3,2,5) Q5 SELECT FROM T1, T2 WHERE A30 AND
Bgt3 GROUP BY C HAVING SUM(E)gt2 Q6 SELECT
FROM T1, T2 WHERE Bgt3 GROUP BY C HAVING
SUM(E)gt2
39
Frequent Itemset Mining Technique
  • A closed itemset is a maximal set of items
    (attributes) that are common to a set of
    transactions (queries)
  • The candidate indexes selected for minimal
    support greater than or equal to 2/6 in the
    example are as follows1
  • (AC,3/6), (BE, 5/6), (ABCE, 2/6), (BCE, 4/6)

Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 2
40
SQL-Server Index Selection Tool
Figure 1
41
SQL-Server Index Selection Tool
  • Candidate index selection
  • For a given workload W that consists of n
    queries, n workloads Wi..Wn, each consisting of
    one query are generated.
  • For each workload Wi, the set of indexable
    columns of the query Ii is the starting candidate
    indexes.
  • Let Ci be the configuration picked by the index
    selection tool for Wi
  • The candidate index set for W is the union of all
    Ci

42
SQL-Server Index Selection Tool
  • Configuration enumeration
  • Problem
  • Pick k indexes from a set of n candidate indexes.
  • Algorithm
  • 1 Let Sthe best m index configuration using the
    naïve enumeration algorithm. If mk then exit
  • 2 Pick a new index I such that Cost (S U I) lt
    Cost (S U I) for any choice of I ! I
  • 3 If Cost (S U I )gt Cost (S) then exit
  • else SS U I
  • 4 If S k then exit
  • 5 Go to 2

43
SQL-Server Index Selection Tool
  • Cost Evaluation
  • Reduce the number of invocations of the optimizer
    by deriving costs from already evaluated costs.
  • A cost of a non atomic configuration can be
    derived from atomic configuration
  • A configuration C is atomic for a workload if
    for some query in the workload there is a
    possible execution of a query in the workload by
    the query engine that uses all indexes in C.
  • Not every atomic configuration needs to be
    evaluated for every single query in the workload

44
SQL-Server Index Selection Tool
  • Multicolumn index generation
  • For given K columns K! multicolumn indexes are
    possible
  • Iterative approach
  • First iteration single column indexes are
    considered
  • Only the selected single column indexes are input
    to the two column indexes in the first iteration
  • This set of two-column indexes along with single
    column ones are input to the third iteration and
    so on.

45
Clustered Index
  • Clustered indexes
  • A page allocated to an index is called a data
    page.
  • For tables having clustered index the data rows
    of each data page are stored in order and the
    data pages are linked together by doubly-linked
    list.
  • For table having no clustered index the data rows
    are not stored in any particular order.

Figure 2
46
Clustered Index
Accessing data with a clustered index
Figure 3
47
Criteria to choose clustered index
  • Queries that return large result sets
  • Columns with a number of duplicate values that
    are searched frequently
  • Columns other than primary key that are
    frequently used in join clauses
  • Columns searched within a range of values
  • Columns used in ORDER BY or GROUP BY queries

48
MACQueens k Means Clustering algorithm
  • 1) First k data units are chosen as clusters of
    one member each.
  • 2) Remaining m k data units are assigned to
    the clusters whose centroid is nearest to the
    data unit under consideration. Centroid is
    recomputed after every gain in the cluster
  • 3) Iterate through the data set assigning each
    data unit to its nearest cluster taking the
    existing cluster centroids as fixed seed points
    until a certain criteria is reached

49
MACQueens k Means Clustering algorithm
  • Hamming distance is the number of positions in
    which two binary words differ
  • For k 3 Q1, Q2, Q3 are initial clusters
  • Hamming distance of
  • Q1 and Q4 is 5
  • Q2 and Q4 is 1
  • Q3 and Q4 is 2
  • Clusters at step 2 are
  • cluster1 Q1
  • cluster2 Q2, Q4, Q6
  • cluster3 Q3,Q5

Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 0 1 1 0
Q2 0 1 1 0 1
Q3 1 1 1 0 1
Q4 0 1 0 0 1
Q5 1 1 1 0 1
Q6 0 1 1 0 1
Table 3
50
KEROUAC Clustering algorithm
KEROUAC Knowledge Explicit, Rapid, OFF beat and
User-centered Algorithm for Clustering New
Condorcet Criterion (NCC) NCC(Pz) ?
Sim(Ei,Ej) a x ? Dissim(Ei) NCC represents
the degree of dissimilarity of objects
belonging to the same cluster and the degree of
similarity between clusters. It is desired
to be minimum. a is called the granularity factor
51
KEROUAC Clustering algorithm
Attributes Attributes Attributes Attributes Attributes Attributes
Queries A B C D E
Q1 1 1 0 0 0
Q2 1 0 1 0 0
Q3 0 1 1 0 0
Q4 1 1 1 0 0
Q5 0 0 1 1 1
Q6 0 0 0 1 1
Table 4
  • Calculate NCC for all the neighbors and pick the
    neighbor with least NCC
  • Repeat until a reduction in NCC is not possible

Figure 4
52
Proposed index selection algorithm
  • Step 1 Input workload
  • Step 2 Parse the workload to find indexable
    attributes gt Query-attribute matrix
  • Step 3 Identify candidate indexable attributes
    based on frequency of attributes in the workload
    and threshold1 and threshold2
  • Step 4 Compute the total weight of each
    attribute using the clause weight assignment
    policy and order the attributes in each table in
    decreasing order of weights in order to identify
    non-clustered indexes in Step 6 gt Ordered
    query-attribute matrix.
  • Step 5 Perform a data mining clustering
    technique on the ordered query-attribute matrix
    to group similar queries based on their use of
    attributes.

53
Proposed Index Selection Algorithm (Cont.)
  • Step 6 Extract index sets for each table from
    the clustering results in Step5.
  • Step 7 Select a single attribute clustered index
    for each table by computing the sum of range
    querys weight and selectivity ranking for each
    attribute and selecting the attribute with the
    highest sum.
  • Step 8 Provide the query optimizer with the
    virtual set of indexes chosen in
  • steps 5 and 7 (including both non-clustered and
    clustered indexes).
  • Step 9 Let the optimizer display the estimated
    execution plan to execute the workload.
  • Step 10 Select the indexes used by the optimizer
    as the final suggested index.
Write a Comment
User Comments (0)
About PowerShow.com