An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya - PowerPoint PPT Presentation

About This Presentation
Title:

An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya

Description:

An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit Chaudhuri Vivek Narasayya CS632 Course seminar by Iyengar Suresh, CSE MTECH1 – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 40
Provided by: acin
Category:

less

Transcript and Presenter's Notes

Title: An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya


1
An Efficient Cost-Driven Selection Tool for
Microsoft SQL ServerSurajit
Chaudhuri Vivek Narasayya
CS632 Course seminar by Iyengar Suresh, CSE
MTECH1
  • Indian Institute of Technology Bombay

2
Introduction
  • Indexes and materialized views are very important
    in physical design of a database.
  • They can significantly improve performance if
    used in the right manner.
  • The tool has been implemented in Microsoft SQL
    Server 7.0 and subsequent versions.
  • The technique developed was a part of the
    AutoAdmin project .

3
AutoAdmin project
  • Make database systems self-tuning and
    self-administering.
  • Enabling databases to track the usage of their
    systems and to gracefully adapt to
    application requirements.
  • Databases actively auto-tunes itself to be
    responsive to application
    needs.
  • Research work Automating index and view
    selections, index merging, 'what-if' indexes,
    automating statistics management for
    Query Optimizers and many more.

4
How do we choose the indexes ?
  • Indexes can be chosen based on
  • Table columns
  • Workload
  • In harmony with the optimizer

5
(No Transcript)
6
Terminologies
Indexable column Columns in a query that are
potentially useful for indexing Admissibl
e index Index that is on one or more (in case
of a multi-column index) indexable
columns of the query. Admissible index
An index that is an admissible index for (for a
workload) one or more queries in the
workload.
7
Cost evaluation
Basic method can is given M configurations and Q
queries in the workload, optimize MQ queries.
Atomic configuration C For some query in the
workload there is a possible execution of the
query by the query engine that uses all indexes
in C. For select/update queries, for a
non-atomic configuration C, there exists an
atomic configuration Ci such that Cost (Q, C)
Min (Cost(Q, Ci)
8
Cost evaluation contd...
For an insert/delete query, for a non-atomic
configuration C The cost can be divided
into a) Cost of selection b) Cost of updating
the table and the indexes used for selection c)
Cost of updating indexes that do not affect the
selection cost.
9
Cost evaluation contd...
Cost evaluation module can identify atomic
configuration heuristically. a) Restriction on
the number of indexes per table b) Restriction
on the number of indexes referenced per
query. This can restrict the search space. The
technique implemented by the authors includes two
indexes per table and also two indexes referenced
per query. This is called single join atomic
configuration.
10
Candidate Index Selection for a given workload
The number of admissible indexes for a given
workload is very large. Determine the best
configuration for each query independently and
then use these configurations for the further
step. Intuition An index which is not a part of
the best configuration for a single query is
unlikely to be a part of best configuration for
the entire workload. This method is called
query-specific-best-configuration candidate index
selection.
11
1. For workload W that consists of n queries,
generate 'n' workloads each consisting of
one query each, where Wi Qi 2. For each
workload Wi, we use the set of indexable columns
Ii of the query in Wi as starting candidate
indexes. 3. Let Ci be the configuration picked
by index selection tool for Wi, i.e., Ci
Enumerate (Ii,Wi). 4. The candidate index set
for W is the union of all Ci's .
Candidate index selection
12
Problem with this approach...
  • Consider a query Q with indexable columns T1.C1
    and T2.C2 such that the best configuration
    for Q is T1.C1 only.
  • Also, there is an insert query in the workload
    on T1.C1 with a high cost.
  • The enumeration phase would not consider T1.C2.
  • So, a general strategy can be consider even the
    second best configuration or the first
    'k' best configurations.

13
  • The candidate index selection does perform well.
    Why ?
  • Indexes that are part of 'next' best
    configuration of a query may appear as the best
    configuration for another query in the workload.
  • Also, most of the indexes in the second best
    configuration do find their way into the best
    configuration.

14
Enumeration
Now we have 'n' candidate indexes and we want to
pick k indexes. A naïve method is to enumerate
all the subsets. However this is not practical
as for realistic values of n and k. (e.g. n40
and k10.)
15
Greedy(m,k) enumeration approach 1. Let S the
best m index configuration using the naive
enumeration algorithm. If m k then exit. 2.
Pick a new index I such that Cost (S U I, W)
lt Cost(S U(I, W) for any choice of I !
I 3. If Cost (S U I) gt Cost(S) then exit
Else S S U (I 4. If size(S) k then exit 5.
Go to 2 A small value of 'm' leads to near
optimal results.
16
Why Greedy algorithm performs well
?? Interaction between indexes. For example,
merge join with two clustered indexes. If one of
the indexed was picked during the 'm' phase of
the above algorithm, its likely that the other
will be picked during the 'k' phase as it will
reduce the overall cost.
17
Multi-column Index Generation
  • Considering all the multi-column indexes is not
    possible.
  • This paper introduces an iterative approach for
    taking into account multi-column indexes of
    increasing width.
  • First step is to find admissible two column
    indexes. This set along with the winning
    one-column indexes becomes the input for the
    second iteration.
  • The strategy followed is for a two-column index
    to be desirable, a single column index on its
    leading column is also desirable.

18
Variants of this strategy
MC-LEAD In (a,b), 'b' need not be an winning
single column index MC-ALL Both a
and b need to be winning single column
indexes.
19
Experimental results
20
(No Transcript)
21
BEST-CONF-1 The winning index BEST-CONF-2 The
first two winning indexes.
22
(No Transcript)
23
(No Transcript)
24
Summary
  • Selecting candidate indexes
  • An algorithm for configuration enumeration
  • Multi-column index
  • Can these ideas be applicable to materialized
    views ??

25
Automated Selection of Materialized Views and
Indexes for SQL Databases Sanjay Agrawal Surajit
Chaudhuri Vivek Narasayya
26
Indexes and Materialized views
  • Indexes can be considered as special materialized
    views.
  • Views have rich structure.
  • Space of potentially interesting materialized
    views for a given workload is very large.
  • The technique proposed deals with automate
    selection of materialized views in presence of
  • a) Workload b) Indexes
  • c) Indexes on materialized views d) In sync
    with the optimizer

27
Indexed Views in SQL server
When a clustered index is created on the view,
SQL Server immediately allocates storage space to
store the results of the view. Indexed View
definition cannot contain the following a)
TOP b) DISTINCT c) MIN, MAX,
COUNT, STDEV, VARIANCE, AVG d) Another view
e) UNION f) Subqueries, outer joins, self
joins g) Cannot include order by in view
definition and some more...
28
joint enumeration
29
Examples
Example 1. Workload consisting of 1000 queries of
the form SELECT l_returnflag,
l_linestatus, SUM(l_quantity) FROM lineitem
WHERE l_shipdate BETWEEN ltDate1gt and ltDate2gt
GROUP BY l_returnflag, l_linestatus (Assume
that each of the 1000 queries has different
constants for ltDate1gt and ltDate2gt ). MV that can
service all 1000 queries SELECT
l_shipdate, l_returnflag, l_linestatus,SUM(l_quant
ity) FROM lineitem GROUP BY
l_shipdate,l_returnflag, l_linestatus
30
Example 2 Workload of 100 queries whose total
cost is 10,000 units. Let T be a table-subset
that occurs in 25 queries whose combined cost is
50 units. Then even if we considered all
syntactically relevant materialized views on T,
the maximum possible benefit of those
materialized views for the workload is 0.5.
Also, the number of rows matters. For example,
tables lineitem and orders may have 6 million and
1.5 million rows respectively, but tables nation
and region may be very small (25 and 5 rows
respectively).
31
Features (tuning mode, performance
benefit)
SQL Server 7.0 Indexes Only, Thorough,
49 SQL Server 2000 Indexes Only, Fast,
37 SQL Server 2000 Indexes Only, Medium,
39 SQL Server 2000 Indexes and Indexed Views,
Thorough, 79
The Index Tuning Wizard in SQL Server 2000
supports the ability to randomly sample queries
from a workload and restrict tuning to the
sampled queries
32
Features
  • Index Usage Report
  • Query Cost Report
  • View-Table Relations Report
  • Workload Analysis Report
  • Maximum columns per index
  • Maximum space for the recommendation

33
Other References
  • AutoAdmin Self-Tuning and Self-Administering
    Databases
  • http//research.microsoft.com/dmx/autoadmin/defaul
    t.asp
  • Index tuning wizard http//www.microsoft.com/techn
    et/prodtechnol/sql/2000/maintain/tunesql.mspx
  • Indexed Views in SQL server http//www.sqlteam.com
    /item.asp?ItemID1015

34
Extra slides
35
Candidate materialized view selection 1) From
the large space of all possible table-subsets
for the workload, we arrive at a smaller set of
interesting table-subsets. 2) Based on these,
a) Propose a set of materialized views for
each query in the workload b)
From this set we select a configuration that is
best for that query. 3) View merging
36
Properties of Merged View
1. All queries answered using either of the
parent views should be answerable using the
merged view. 2. Prevent the merged view from
becoming too large as compared to its parents
based upon some threshold.
37
Metrics used
TS-Cost(T) total cost of all queries in the
workload (for the current database) where
table-subset T occurs. If all queries in the
workload referenced the tables A, B, C and D
together, then using the TS-Cost(T) metric, the
table-subsets T1 A,B would have the same
importance as the table-subset T2
C,D. TS-Weight(T) sum Cost(Qi) (sum of
sizes of tables in T)/ (sum of sizes of all
tables referenced in Qi)) The summation is only
over queries in the workload where T occurs.
38
Then, propose MV on these table sets. For a given
query, take the view with minimum cost.
39
Merging Views
R Materialized views returned by
enumeration While( R gt 1) M' Views
returned by MergeView on each pair of R If
M' return (R-M) R R U M' For each
view in M' remove both its parents End
While Return (R-M)
Write a Comment
User Comments (0)
About PowerShow.com