Title: Frequent Itemsets Mining for Database autoadministration
1Frequent Itemsets Mining for Database
auto-administration
International Database EngineeringApplications
Symposium
2Plan
- Introduction
- State of the art
- Our proposal
- Experiment results
- Conclusion and perspectives
3Introduction
- Database administration
- Definition of conceptual schema
- Definition of files and disc storage
- Data access optimization (creation of indexes and
materialized views) - Minimize and automate the administration
functions - Auto-select indexes
- Using data mining (frequent itemsets)
4State of the art
- Frank et al., 1992 (Georgia Tech)
- Construction of an index graph based on a
workload - Dialogue with the query optimizer ? index cost
- Choice of the least expensive indexes for the
workload - Chaudhuri et al., 1997-2001 (Microsoft)
- Selection of indexable attributes based on a
workload - Construction of mono and multi-attribute
configuration - Dialogue with the query optimizer ? configuration
cost - Choice of the optimal configuration
- Brunel, Rollin et al., 2001 (Lyon 2 - Oklahoma)
- Based on the work of Frank et al.
- First work about the extraction of workload
characteristics
5Log file of transactions
Log file extraction
SQL query syntactic analyzer
Ensemble dattributs candidats
Construction of the Queries-attributes matrix
Queries-attributes matrix
Frequent Itemsets search
Database and workload
Set of frequent itemsets
Candidate indexes
Index construction
6Queries-attributes matrix
- For each query within the workload? extract
attributes present on the clauses - Where
- Group by
- Order By
Xij 1 if the attribute j appears in the clauses
of the query i Xij 0 in the other case
7Example
Q1 Select from T1, T2 where A between 1 and
10 And C D Q2 Select from T1, T2 where B
like sample and C 5 and E lt 100 Q3 Select
from T1, T2 where A 30 and B gt 3 group by C
Having Sum(E) gt 2 Q4 Select from T1 where B
gt 2 and E in (3,2,5) Q5 Select from T1, T2
where A 30 and B gt 3 group by C Having Sum(E) gt
2 Q6 Select from T1, T2 where B gt 3 group by
C Having Sum(E) gt 2
Application of Close with a minimal support
minsup 2/6
8Indexes construction
- Building all indexes
- For each frequent itemset corresponds an index
- This method is feasible when the number of
indexes is relatively small - Building indexes on large tables
- Reduce the number of indexes to build
- The DBA defines whether a table is large or not
9Database experiments
- TPC-R Database
- Transaction Processing Council
- Decision support benchmark TPC-R
- 8 Tables
- Size 1 GB
- Accidentology Data Mart
- BDD research group (ERIC)
- One Fact table Accident and 4 Dimension tables
Place, Date, Condition and PersonResponsible - 65 000 Records
10TPC-R results- All indexes
11TPC-R results - Indexes on large tables
12Data Mart results
13Conclusions
- Use of data mining to determine an index
configuration - Mono et multi-attribute indexes are generated on
the fly - Response time improvements of 20 to 25 for
decision support worklod (TPC-R) - Response time improvements is about 14
14Perspectives
- Improve index selection by designing more
elaborated strategies (cost model, weighting the
workloads queries - Tests on large data warehouses
- Comparison with the Microsoft technique (query
optimizer vs. frequent itemsets) - Extending or coupling our approach with other
performance optimization (materialized views,
physical clustering, ...) - Exploit other data mining algorithms (functional
dependencies or inclusion dependencies)