Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules - PowerPoint PPT Presentation

About This Presentation

Title:

Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules

Description:

Mining Approximate Functional Dependencies (AFDs) as. Condensed Representations ... Retrieve tuples with bodystyle='SUV' Database design (Database normalization) ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 44

Provided by: Arav151

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules

1
Mining Approximate Functional Dependencies (AFDs)
as Condensed Representations of Association Rules

Masters Thesis Defense
by Aravind Krishna Kalavagattu
Committee Members
Dr. Subbarao Kambhampati (chair)
Dr. Yi Chen
Dr. Huan Liu

2
AFDs

Database Systems
Well-defined schema and method for querying (SQL)
Query optimization
Lately, some systems started supporting IR-Style
answering of user queries

Data mining
Discovering useful patterns from data
Rule learning is a well researched method for
discovering interesting relations between
variables in large databases
Association Rules

Rule Mining with Several applications Over
databases
3
Introduction to AFDs

Approximate Functional Dependencies are rules
denoting approximate determinations at attribute
level.
AFDs are of the form (X gt Y), where X and Y are
sets of attributes
X is the determining set and Y is called
dependent set
Rules with singleton dependent sets are of high
interest
A classic example of an AFD
(Nationality gt Language)
More examples
Make gt Model
(Job Title, Experience) gt Salary

Indicates that we can approximately guess the
language of a person if we know which country she
is from.
4
Introduction (contd..)

Functional Dependency (FD)
Given a relation R, a set of attributes X in R is
said to functionally determine another attribute
Y, also in R, (written X ? Y) if and only if each
X value is associated with precisely one Y value.
AFDs can be loosely defined as FDs that
approximately hold (there are some exception rows
that fail to satisfy the Function over the
current relation)
Example MakegtModel (with error 0.3)
70 of the tuples satisfy the dependency

5
Applications of AFDs
Predicting Missing Values of attributes In
relational tables (QPIAD) Using values of
attributes in determining set of AFD
Query Optimization (CORDS, BHUNT) Maintaining
correct selectivity estimates
Database design (Database normalization) (Efficie
nt Storage) Similar to the way FDs are used
Query Rewriting (AIMQ, QPIAD, QUIC) Example
ModelgtBodyStyle Rewrite query on ModelRAV4
to Retrieve tuples with bodystyleSUV
6
FD Mining and Implications

FD Mining aims at finding a minimal cover
Minimum set of FDs from which the entire set of
FDs can be generated
Example If A?B is an FD, then, (A,C?B) is
considered redundant
Can we substitute this by generating only minimal
dependencies in case of AFDs?
NO, because AFDs (ZgtB) may be interesting for
the application and we may prefer them to AgtB.
Non-minimal dependencies perform better in QPIAD,
QUIC etc

Example AFD (JobTitle, Experience)gtSalary Vs
(JobTitlegtSalary)
7
Performance Concerns

AFD Mining is costly
The pruning strategies of FDs are not applicable
in case of AFDs.
For datasets with large number of attributes, the
search space gets worse!
Method for determining whether a dependency holds
or not is costly
Way to traverse the search space is tricky
Bottom-up Vs Top-down ?

8
Quality Concerns

Before algorithms for discovering AFDs can be
developed, AFDs need better Interestingness
measures
AFDs used as feature selectors in classification
are expected to give good Accuracy.
AFDs used in query rewriting are expected to give
a high throughput per query.
(VINgtMake) Vs (ModelgtMake)
(VINgtMake) looks good using the error metric
But, intuitively (as well as practically)
(ModelgtMake) is a better AFD.

9
Challenges in AFD Mining

1. Defining right interestingness measures
2. Performing an efficient traversal in the
search space of possible rules
3. Employing effective pruning strategies

10
Agenda/Outline

Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association
rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality

11
Agenda/Outline

Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association
rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality

12
Related Work

FD Mining Algorithms
Aim at finding minimal cover
DepMiner, FUN, TANE, FD_Mine

Do not work well for AFDs

Metrics do not seem to matter in practice
No accompanied algorithm to mine AFDs

Existing Approximation measures for AFDs
Tau, InD metrics

Grouping association rules Clustering association
rules (v1gtu, v2gtu as (v1v2gtu))
No one combines them as AFDs
13
Existing AFD Miners

Restricted to singleton determining set
Works from a sample
Measure used is not appropriate

CORDS
SoftFDs (C1gtC2)
Uses C1,C2/C1C2 as the approximation measure

AIMQ/QPIAD/QUIC
TANE
Post-processing over TANE

Highly Inefficient
Quality of some AFDs is bad

14
Agenda/Outline

Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association
rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality

15
Condensing Association Rules

Viewing database relations as transactions
Itemsets attribute-value pairs
Association rules
Between Itemsets
BeergtDiapers
Here, they are between attribute value pairs
AFDs are rules between Attributes
Corresponding to a lot of association rules
sharing the same attributes
Example

Example Association Rule (Toyota, Camry)gtSedan
16
Rolling up association rules as AFDs
MakegtModel
HondagtAccord
ToyotagtCamry
TatagtMaruti800

17
Confidence

Consider an association rule of the form (a?ß)
Confidence denotes the conditional probability of
ß (head) given a (body).
Similarly for an AFD (XgtA),
Confidence should denote the chance of finding
the values of A, given values of X
Define AFD Confidence in terms of confidence of
association rules

Specifically, picking the best association rule
for every distinct value-combination of the body
of the association rule.
18
Confidence

For the example carDB,
Confidence Support (MakeHondagtModelAccord)
Support (MakeToyotagtModelCamry)
3/82/8 5/8
Interestingly this is equal to (1-g3)
g3 has a natural interpretation as the fraction
of tuples with exceptions affecting the
dependency.

19
Specificity

For an association rule (a?ß),
Support is the probability with which the
conditioning event (i.e., a) occurs
Rule with High-Confidence, yet Low-Support is a
bad rule!
Presence of a lot of association rules with low
supports makes the AFD bad.
In classification, this affects prediction
accuracy.
For query rewriting tasks, per-query throughput
is less.

20
Types of AFDs

1. Model gt Make
Few Branches - Uniform Distribution
Good, and might hold good universally
2. VIN gt Make
Many Branches - Uniform Distribution
Bad - Confidence of each association rule is
high, but bad supports
3. Model, Location gt Price
Many Branches - Skewed Distribution
Few association rules with high support and many
with low support

21
Specificity
Normalized with the worst case Specificity i.e.,
X is a key

The Specificity measure captures our intuition of
different types of AFDs.
It is based on information entropy
Higher the Specificity (above a threshold), worse
the AFD is !
Shares similar motivations with the way SplitInfo
is defined in decision trees while computing
Information Gain Ratio
Follows Monotonicity

22
Agenda/Outline

Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association
rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality

23
AFD Mining Problem

Good AFDs are the ones within the desired
thresholds of the Confidence and Specificity
measures.
Formally, the AFD mining problem can be stated as
follows

24
AFD Mining

The problem of AFD Mining is learn all AFDs that
hold over a given relational table
Two costs
1. Major cost is the Combinatoric cost of
traversing the search space
2. Cost of visiting data to validate each rule
(To compute the interestingness measures)
Search process for AFDs is exponential in terms
of the number of attributes

25
Pruning Strategies

1. Pruning by Specificity
Specificity(Y) Specificity(X), where Y is a
superset of X
If Specificity(X) gt maxSpecificity, we can prune
all AFDs with X and its supersets as the
determining set
2. Pruning (applicable to FDs)
If (X?A) is an FD, all AFDs of the form (Y?A) can
be pruned
3. Pruning keys
Needed for FDs
But, this is subsumed by case 1 in AFDMiner
Because if Specificity(X) 1, it means X is a key

26
AFDMiner algorithm

Search starts from singleton sets of attributes
and works its way to larger attribute sets
through the set containment lattice level by
level.
When the algorithm is processing a set X, it
tests AFDs of the form
(X \A)gtA), where
A?X.
Information from previous levels is captured by
maintaining RHS Candidate Sets for each set.

27
Traversal in the Search Space

During the bottom-up breadth-first search, the
stopping criteria at a node are
The AFD confidence becomes 1, and thus it is an
FD.
The Specificity value of the X is greater than
the max value given.

FD based Pruning
Specificity based Pruning
Example A?C is an FD Then, C is removed from
RHS(ABC)
28
Computing Confidence and Specificity

Methods are based on representing attribute sets
by equivalence class partitions of the set of
tuples
And, ?X is the collection of equivalence classes
of tuples for attribute set X
Example
?make 1, 2, 3, 4, 5, 6, 7, 8
?model 1, 2, 3, 4, 5, 6, 7, 8
?make U model 1, 2, 3, 4, 5, 6, 7,
8
A functional dependency holds if ?X ?XUA
For the AFD (XgtA),
Confidence 1 g3(XgtA)

In this example, Confidence(Model gtMake)
1 Confidence(MakegtModel) 5/8
29
Algorithms

Algorithm AFDMiner

30
Agenda/Outline

Introduction
Related Work
Provide new perspective for AFDs
Roll-ups/condensed representations to association
rules
Define measures for AFDs
Present the AFDMiner algorithm
Experimental Results
Performance
Quality

31
Empirical Evaluation

Experimental Setup
Data sets
CensusDB (199523 tuples, 30 attrb)
MushroomDB (8124 tuples, 23 attrb)
Parameters for AFDMiner
minConf
maxSpecificity
No. of tuples
No. of attributes
MaxLength of determining set
Aim of the experiments is to show that the
Dual-Measure approach (AFDMinerusing both
confidence and specificity outperforms the
Single-Measure approach (No_Specificity that
uses Confidence alone)

No_Specificity A modified version of AFDMiner,
which uses using only Confidence but not
Specificity for AFDs. Thus, it generates all AFDs
(XgtA) with (Confidence(XgtA) gtminConf)
32
Evaluating Quality

BestAFD
The highest confident AFD among all the AFDs with
attribute A as their dependent attribute
Classification Task
Classifier is run with determining set of BestAFD
as features
Used 10-fold cross-validation and computed the
average classification accuracy
Weka tool-kit
Evaluated over the censusDB

33
Evaluation Quality
No_Specificity
CensusDB

Average Classification accuracy for all
attributes
minConf 0.8 maxSpecificity 0.4

Choosing minConf !
Shows that Specificity is effective in generating
better quality AFDs.
34
Choosing maxSpecificity
MaxSpecificity
MaxSpecificity
CensusDB
CensusDB

Classification Accuracy (by varying
maxSpecificity)
threshold low gt good rules are pruned
threshold high gt bad rules are not being pruned
Classification accuracy approximately forms a
double elbow shaped curve.

35
Choosing maxSpecificity
Best Value
MaxSpecificity
MaxSpecificity

Time to compute AFDs
Increases with increasing maxSpecificity
Rate of change varies
A good threshold value for Specificity (i.e.,
maxSpecificity) is the value at the first elbow
in the graph on quality

36
Query Throughput
No_Specificity
No. of tuples returned for an top-10 queries on
each distinct determining set (denotes query
throughput)
37
Discussion on TANE

Primarily designed to generate FDs
Modified version for generating Approximate
Dependencies
Uses the error metric g3 for AFDs
Bottom-up search in the lattice
Generates only minimal dependencies
Pruning applicable to FDs

38
Comparison (AFDMiner Vs TANE)

TANENOMINP is a modified version of TANE that
does not stop with just minimal dependencies.
minConf is 0.8 (thus, we set the g3 to be 0.2)

AFDMiner outperforms both the approaches -- thus
strengthening the argument that AFDs with high
confidence and with reasonable Specificity are
the best
39
Evaluating Performance
CensusDB
CensusDB

Time varies linearly with the number of tuples.
AFDMiner takes less time compared to that of
NoSpecificity.
Time varies exponentially on the number of
attributes.
AFDMiner completes much faster than NoSpecificity

40
Evaluating Performance
CensusDB
CensusDB
MushroomDB
MushroomDB
These experiments show that AFDMiner is fast
41
Conclusion

Introduced a novel perspective for AFDs
Condensed roll-ups of association rules.
Two metrics for AFDs
Confidence
Specificity
Algorithm AFDMiner
all AFDs (confidence gt minConf Specificity lt
maxSpecificity)
Bottom-up search in a breadth-first manner in the
set containment lattice of attributes
Pruning based on Specificity
Experiments AFDMiner generates high-quality
AFDs faster.
AFDs with high Confidence and reasonable
Specificity

A version of this thesis is currently under
review at ICDE 09
42
Future Direction

Conditional Functional Dependencies (CFDs)
Dependencies of the form (ZipCode?City if
country England).
i.e., Holding true only for certain values of one
or more of other attributes.
CAFDs are the probabilistic counter part of CFDs
CFDs and CAFDs are applied in data cleaning and
value prediction recently, but mining these
conditional rules is unexplored.

Intuitively, CFDs are intermediate rules between
association rules (value level) and FD (attribute
level). So, we believe that our approach can help
in generating them !
43