Title: Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules
1Mining Approximate Functional Dependencies (AFDs)
as Condensed Representations of Association Rules
- Masters Thesis Defense
- by Aravind Krishna Kalavagattu
- Committee Members
- Dr. Subbarao Kambhampati (chair)
- Dr. Yi Chen
- Dr. Huan Liu
2AFDs
- Database Systems
- Well-defined schema and method for querying (SQL)
- Query optimization
- Lately, some systems started supporting IR-Style
answering of user queries
- Data mining
- Discovering useful patterns from data
- Rule learning is a well researched method for
discovering interesting relations between
variables in large databases - Association Rules
Rule Mining with Several applications Over
databases
3Introduction to AFDs
- Approximate Functional Dependencies are rules
denoting approximate determinations at attribute
level. - AFDs are of the form (X gt Y), where X and Y are
sets of attributes - X is the determining set and Y is called
dependent set - Rules with singleton dependent sets are of high
interest - A classic example of an AFD
- (Nationality gt Language)
- More examples
- Make gt Model
- (Job Title, Experience) gt Salary
Indicates that we can approximately guess the
language of a person if we know which country she
is from.
4Introduction (contd..)
- Functional Dependency (FD)
- Given a relation R, a set of attributes X in R is
said to functionally determine another attribute
Y, also in R, (written X ? Y) if and only if each
X value is associated with precisely one Y value.
- AFDs can be loosely defined as FDs that
approximately hold (there are some exception rows
that fail to satisfy the Function over the
current relation) - Example MakegtModel (with error 0.3)
- 70 of the tuples satisfy the dependency
5Applications of AFDs
Predicting Missing Values of attributes In
relational tables (QPIAD) Using values of
attributes in determining set of AFD
Query Optimization (CORDS, BHUNT) Maintaining
correct selectivity estimates
Database design (Database normalization) (Efficie
nt Storage) Similar to the way FDs are used
Query Rewriting (AIMQ, QPIAD, QUIC) Example
ModelgtBodyStyle Rewrite query on ModelRAV4
to Retrieve tuples with bodystyleSUV
6FD Mining and Implications
- FD Mining aims at finding a minimal cover
- Minimum set of FDs from which the entire set of
FDs can be generated - Example If A?B is an FD, then, (A,C?B) is
considered redundant - Can we substitute this by generating only minimal
dependencies in case of AFDs? - NO, because AFDs (ZgtB) may be interesting for
the application and we may prefer them to AgtB. - Non-minimal dependencies perform better in QPIAD,
QUIC etc
Example AFD (JobTitle, Experience)gtSalary Vs
(JobTitlegtSalary)
7Performance Concerns
- AFD Mining is costly
- The pruning strategies of FDs are not applicable
in case of AFDs. - For datasets with large number of attributes, the
search space gets worse! - Method for determining whether a dependency holds
or not is costly - Way to traverse the search space is tricky
- Bottom-up Vs Top-down ?
8Quality Concerns
- Before algorithms for discovering AFDs can be
developed, AFDs need better Interestingness
measures - AFDs used as feature selectors in classification
are expected to give good Accuracy. - AFDs used in query rewriting are expected to give
a high throughput per query. - (VINgtMake) Vs (ModelgtMake)
- (VINgtMake) looks good using the error metric
- But, intuitively (as well as practically)
(ModelgtMake) is a better AFD.
9Challenges in AFD Mining
- 1. Defining right interestingness measures
- 2. Performing an efficient traversal in the
search space of possible rules - 3. Employing effective pruning strategies
10Agenda/Outline
- Introduction
- Related Work
- Provide new perspective for AFDs
- Roll-ups/condensed representations to association
rules - Define measures for AFDs
- Present the AFDMiner algorithm
- Experimental Results
- Performance
- Quality
11Agenda/Outline
- Introduction
- Related Work
- Provide new perspective for AFDs
- Roll-ups/condensed representations to association
rules - Define measures for AFDs
- Present the AFDMiner algorithm
- Experimental Results
- Performance
- Quality
12Related Work
- FD Mining Algorithms
- Aim at finding minimal cover
- DepMiner, FUN, TANE, FD_Mine
Do not work well for AFDs
- Metrics do not seem to matter in practice
- No accompanied algorithm to mine AFDs
- Existing Approximation measures for AFDs
- Tau, InD metrics
Grouping association rules Clustering association
rules (v1gtu, v2gtu as (v1v2gtu))
No one combines them as AFDs
13Existing AFD Miners
- Restricted to singleton determining set
- Works from a sample
- Measure used is not appropriate
- CORDS
- SoftFDs (C1gtC2)
- Uses C1,C2/C1C2 as the approximation measure
- AIMQ/QPIAD/QUIC
- TANE
- Post-processing over TANE
- Highly Inefficient
- Quality of some AFDs is bad
14Agenda/Outline
- Introduction
- Related Work
- Provide new perspective for AFDs
- Roll-ups/condensed representations to association
rules - Define measures for AFDs
- Present the AFDMiner algorithm
- Experimental Results
- Performance
- Quality
15Condensing Association Rules
- Viewing database relations as transactions
- Itemsets attribute-value pairs
- Association rules
- Between Itemsets
- BeergtDiapers
- Here, they are between attribute value pairs
- AFDs are rules between Attributes
- Corresponding to a lot of association rules
sharing the same attributes - Example
Example Association Rule (Toyota, Camry)gtSedan
16Rolling up association rules as AFDs
MakegtModel
HondagtAccord
ToyotagtCamry
TatagtMaruti800
17Confidence
- Consider an association rule of the form (a?ß)
- Confidence denotes the conditional probability of
ß (head) given a (body). - Similarly for an AFD (XgtA),
- Confidence should denote the chance of finding
the values of A, given values of X - Define AFD Confidence in terms of confidence of
association rules
Specifically, picking the best association rule
for every distinct value-combination of the body
of the association rule.
18Confidence
- For the example carDB,
- Confidence Support (MakeHondagtModelAccord)
- Support (MakeToyotagtModelCamry)
3/82/8 5/8 - Interestingly this is equal to (1-g3)
- g3 has a natural interpretation as the fraction
of tuples with exceptions affecting the
dependency.
19Specificity
- For an association rule (a?ß),
- Support is the probability with which the
conditioning event (i.e., a) occurs - Rule with High-Confidence, yet Low-Support is a
bad rule! - Presence of a lot of association rules with low
supports makes the AFD bad. - In classification, this affects prediction
accuracy. - For query rewriting tasks, per-query throughput
is less.
20Types of AFDs
- 1. Model gt Make
- Few Branches - Uniform Distribution
- Good, and might hold good universally
- 2. VIN gt Make
- Many Branches - Uniform Distribution
- Bad - Confidence of each association rule is
high, but bad supports - 3. Model, Location gt Price
- Many Branches - Skewed Distribution
- Few association rules with high support and many
with low support
21Specificity
Normalized with the worst case Specificity i.e.,
X is a key
- The Specificity measure captures our intuition of
different types of AFDs. - It is based on information entropy
- Higher the Specificity (above a threshold), worse
the AFD is ! - Shares similar motivations with the way SplitInfo
is defined in decision trees while computing
Information Gain Ratio - Follows Monotonicity
22Agenda/Outline
- Introduction
- Related Work
- Provide new perspective for AFDs
- Roll-ups/condensed representations to association
rules - Define measures for AFDs
- Present the AFDMiner algorithm
- Experimental Results
- Performance
- Quality
23AFD Mining Problem
- Good AFDs are the ones within the desired
thresholds of the Confidence and Specificity
measures. - Formally, the AFD mining problem can be stated as
follows
24AFD Mining
- The problem of AFD Mining is learn all AFDs that
hold over a given relational table - Two costs
- 1. Major cost is the Combinatoric cost of
traversing the search space - 2. Cost of visiting data to validate each rule
- (To compute the interestingness measures)
- Search process for AFDs is exponential in terms
of the number of attributes
25Pruning Strategies
- 1. Pruning by Specificity
- Specificity(Y) Specificity(X), where Y is a
superset of X - If Specificity(X) gt maxSpecificity, we can prune
all AFDs with X and its supersets as the
determining set - 2. Pruning (applicable to FDs)
- If (X?A) is an FD, all AFDs of the form (Y?A) can
be pruned - 3. Pruning keys
- Needed for FDs
- But, this is subsumed by case 1 in AFDMiner
- Because if Specificity(X) 1, it means X is a key
26AFDMiner algorithm
- Search starts from singleton sets of attributes
and works its way to larger attribute sets
through the set containment lattice level by
level. - When the algorithm is processing a set X, it
tests AFDs of the form - (X \A)gtA), where
- A?X.
- Information from previous levels is captured by
maintaining RHS Candidate Sets for each set.
27Traversal in the Search Space
- During the bottom-up breadth-first search, the
stopping criteria at a node are - The AFD confidence becomes 1, and thus it is an
FD. - The Specificity value of the X is greater than
the max value given.
FD based Pruning
Specificity based Pruning
Example A?C is an FD Then, C is removed from
RHS(ABC)
28Computing Confidence and Specificity
- Methods are based on representing attribute sets
by equivalence class partitions of the set of
tuples - And, ?X is the collection of equivalence classes
of tuples for attribute set X - Example
- ?make 1, 2, 3, 4, 5, 6, 7, 8
- ?model 1, 2, 3, 4, 5, 6, 7, 8
- ?make U model 1, 2, 3, 4, 5, 6, 7,
8 - A functional dependency holds if ?X ?XUA
- For the AFD (XgtA),
- Confidence 1 g3(XgtA)
In this example, Confidence(Model gtMake)
1 Confidence(MakegtModel) 5/8
29Algorithms
30Agenda/Outline
- Introduction
- Related Work
- Provide new perspective for AFDs
- Roll-ups/condensed representations to association
rules - Define measures for AFDs
- Present the AFDMiner algorithm
- Experimental Results
- Performance
- Quality
31Empirical Evaluation
- Experimental Setup
- Data sets
- CensusDB (199523 tuples, 30 attrb)
- MushroomDB (8124 tuples, 23 attrb)
- Parameters for AFDMiner
- minConf
- maxSpecificity
- No. of tuples
- No. of attributes
- MaxLength of determining set
- Aim of the experiments is to show that the
Dual-Measure approach (AFDMinerusing both
confidence and specificity outperforms the
Single-Measure approach (No_Specificity that
uses Confidence alone)
No_Specificity A modified version of AFDMiner,
which uses using only Confidence but not
Specificity for AFDs. Thus, it generates all AFDs
(XgtA) with (Confidence(XgtA) gtminConf)
32Evaluating Quality
- BestAFD
- The highest confident AFD among all the AFDs with
attribute A as their dependent attribute - Classification Task
- Classifier is run with determining set of BestAFD
as features - Used 10-fold cross-validation and computed the
average classification accuracy - Weka tool-kit
- Evaluated over the censusDB
33Evaluation Quality
No_Specificity
CensusDB
- Average Classification accuracy for all
attributes - minConf 0.8 maxSpecificity 0.4
Choosing minConf !
Shows that Specificity is effective in generating
better quality AFDs.
34 Choosing maxSpecificity
MaxSpecificity
MaxSpecificity
CensusDB
CensusDB
- Classification Accuracy (by varying
maxSpecificity) - threshold low gt good rules are pruned
- threshold high gt bad rules are not being pruned
- Classification accuracy approximately forms a
double elbow shaped curve.
35 Choosing maxSpecificity
Best Value
MaxSpecificity
MaxSpecificity
- Time to compute AFDs
- Increases with increasing maxSpecificity
- Rate of change varies
- A good threshold value for Specificity (i.e.,
maxSpecificity) is the value at the first elbow
in the graph on quality
36Query Throughput
No_Specificity
No. of tuples returned for an top-10 queries on
each distinct determining set (denotes query
throughput)
37Discussion on TANE
- Primarily designed to generate FDs
- Modified version for generating Approximate
Dependencies - Uses the error metric g3 for AFDs
- Bottom-up search in the lattice
- Generates only minimal dependencies
- Pruning applicable to FDs
38Comparison (AFDMiner Vs TANE)
- TANENOMINP is a modified version of TANE that
does not stop with just minimal dependencies. - minConf is 0.8 (thus, we set the g3 to be 0.2)
AFDMiner outperforms both the approaches -- thus
strengthening the argument that AFDs with high
confidence and with reasonable Specificity are
the best
39Evaluating Performance
CensusDB
CensusDB
- Time varies linearly with the number of tuples.
- AFDMiner takes less time compared to that of
NoSpecificity. - Time varies exponentially on the number of
attributes. - AFDMiner completes much faster than NoSpecificity
40Evaluating Performance
CensusDB
CensusDB
MushroomDB
MushroomDB
These experiments show that AFDMiner is fast
41Conclusion
- Introduced a novel perspective for AFDs
- Condensed roll-ups of association rules.
- Two metrics for AFDs
- Confidence
- Specificity
- Algorithm AFDMiner
- all AFDs (confidence gt minConf Specificity lt
maxSpecificity) - Bottom-up search in a breadth-first manner in the
set containment lattice of attributes - Pruning based on Specificity
- Experiments AFDMiner generates high-quality
AFDs faster. - AFDs with high Confidence and reasonable
Specificity
A version of this thesis is currently under
review at ICDE 09
42Future Direction
- Conditional Functional Dependencies (CFDs)
- Dependencies of the form (ZipCode?City if
country England). - i.e., Holding true only for certain values of one
or more of other attributes. - CAFDs are the probabilistic counter part of CFDs
- CFDs and CAFDs are applied in data cleaning and
value prediction recently, but mining these
conditional rules is unexplored.
Intuitively, CFDs are intermediate rules between
association rules (value level) and FD (attribute
level). So, we believe that our approach can help
in generating them !
43