Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules

Description:

Mining Approximate Functional Dependencies (AFDs) as. Condensed Representations ... Retrieve tuples with bodystyle='SUV' Database design (Database normalization) ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 44
Provided by: Arav151
Category:

less

Transcript and Presenter's Notes

Title: Mining Approximate Functional Dependencies AFDs as Condensed Representations of Association Rules


1
Mining Approximate Functional Dependencies (AFDs)
as Condensed Representations of Association Rules
  • Masters Thesis Defense
  • by Aravind Krishna Kalavagattu
  • Committee Members
  • Dr. Subbarao Kambhampati (chair)
  • Dr. Yi Chen
  • Dr. Huan Liu

2
AFDs
  • Database Systems
  • Well-defined schema and method for querying (SQL)
  • Query optimization
  • Lately, some systems started supporting IR-Style
    answering of user queries
  • Data mining
  • Discovering useful patterns from data
  • Rule learning is a well researched method for
    discovering interesting relations between
    variables in large databases
  • Association Rules

Rule Mining with Several applications Over
databases
3
Introduction to AFDs
  • Approximate Functional Dependencies are rules
    denoting approximate determinations at attribute
    level.
  • AFDs are of the form (X gt Y), where X and Y are
    sets of attributes
  • X is the determining set and Y is called
    dependent set
  • Rules with singleton dependent sets are of high
    interest
  • A classic example of an AFD
  • (Nationality gt Language)
  • More examples
  • Make gt Model
  • (Job Title, Experience) gt Salary

Indicates that we can approximately guess the
language of a person if we know which country she
is from.
4
Introduction (contd..)
  • Functional Dependency (FD)
  • Given a relation R, a set of attributes X in R is
    said to functionally determine another attribute
    Y, also in R, (written X ? Y) if and only if each
    X value is associated with precisely one Y value.
  • AFDs can be loosely defined as FDs that
    approximately hold (there are some exception rows
    that fail to satisfy the Function over the
    current relation)
  • Example MakegtModel (with error 0.3)
  • 70 of the tuples satisfy the dependency

5
Applications of AFDs
Predicting Missing Values of attributes In
relational tables (QPIAD) Using values of
attributes in determining set of AFD
Query Optimization (CORDS, BHUNT) Maintaining
correct selectivity estimates
Database design (Database normalization) (Efficie
nt Storage) Similar to the way FDs are used
Query Rewriting (AIMQ, QPIAD, QUIC) Example
ModelgtBodyStyle Rewrite query on ModelRAV4
to Retrieve tuples with bodystyleSUV
6
FD Mining and Implications
  • FD Mining aims at finding a minimal cover
  • Minimum set of FDs from which the entire set of
    FDs can be generated
  • Example If A?B is an FD, then, (A,C?B) is
    considered redundant
  • Can we substitute this by generating only minimal
    dependencies in case of AFDs?
  • NO, because AFDs (ZgtB) may be interesting for
    the application and we may prefer them to AgtB.
  • Non-minimal dependencies perform better in QPIAD,
    QUIC etc

Example AFD (JobTitle, Experience)gtSalary Vs
(JobTitlegtSalary)
7
Performance Concerns
  • AFD Mining is costly
  • The pruning strategies of FDs are not applicable
    in case of AFDs.
  • For datasets with large number of attributes, the
    search space gets worse!
  • Method for determining whether a dependency holds
    or not is costly
  • Way to traverse the search space is tricky
  • Bottom-up Vs Top-down ?

8
Quality Concerns
  • Before algorithms for discovering AFDs can be
    developed, AFDs need better Interestingness
    measures
  • AFDs used as feature selectors in classification
    are expected to give good Accuracy.
  • AFDs used in query rewriting are expected to give
    a high throughput per query.
  • (VINgtMake) Vs (ModelgtMake)
  • (VINgtMake) looks good using the error metric
  • But, intuitively (as well as practically)
    (ModelgtMake) is a better AFD.

9
Challenges in AFD Mining
  • 1. Defining right interestingness measures
  • 2. Performing an efficient traversal in the
    search space of possible rules
  • 3. Employing effective pruning strategies

10
Agenda/Outline
  • Introduction
  • Related Work
  • Provide new perspective for AFDs
  • Roll-ups/condensed representations to association
    rules
  • Define measures for AFDs
  • Present the AFDMiner algorithm
  • Experimental Results
  • Performance
  • Quality

11
Agenda/Outline
  • Introduction
  • Related Work
  • Provide new perspective for AFDs
  • Roll-ups/condensed representations to association
    rules
  • Define measures for AFDs
  • Present the AFDMiner algorithm
  • Experimental Results
  • Performance
  • Quality

12
Related Work
  • FD Mining Algorithms
  • Aim at finding minimal cover
  • DepMiner, FUN, TANE, FD_Mine

Do not work well for AFDs
  • Metrics do not seem to matter in practice
  • No accompanied algorithm to mine AFDs
  • Existing Approximation measures for AFDs
  • Tau, InD metrics

Grouping association rules Clustering association
rules (v1gtu, v2gtu as (v1v2gtu))
No one combines them as AFDs
13
Existing AFD Miners
  • Restricted to singleton determining set
  • Works from a sample
  • Measure used is not appropriate
  • CORDS
  • SoftFDs (C1gtC2)
  • Uses C1,C2/C1C2 as the approximation measure
  • AIMQ/QPIAD/QUIC
  • TANE
  • Post-processing over TANE
  • Highly Inefficient
  • Quality of some AFDs is bad

14
Agenda/Outline
  • Introduction
  • Related Work
  • Provide new perspective for AFDs
  • Roll-ups/condensed representations to association
    rules
  • Define measures for AFDs
  • Present the AFDMiner algorithm
  • Experimental Results
  • Performance
  • Quality

15
Condensing Association Rules
  • Viewing database relations as transactions
  • Itemsets attribute-value pairs
  • Association rules
  • Between Itemsets
  • BeergtDiapers
  • Here, they are between attribute value pairs
  • AFDs are rules between Attributes
  • Corresponding to a lot of association rules
    sharing the same attributes
  • Example

Example Association Rule (Toyota, Camry)gtSedan
16
Rolling up association rules as AFDs
MakegtModel
HondagtAccord
ToyotagtCamry
TatagtMaruti800

17
Confidence
  • Consider an association rule of the form (a?ß)
  • Confidence denotes the conditional probability of
    ß (head) given a (body).
  • Similarly for an AFD (XgtA),
  • Confidence should denote the chance of finding
    the values of A, given values of X
  • Define AFD Confidence in terms of confidence of
    association rules

Specifically, picking the best association rule
for every distinct value-combination of the body
of the association rule.
18
Confidence
  • For the example carDB,
  • Confidence Support (MakeHondagtModelAccord)
  • Support (MakeToyotagtModelCamry)
    3/82/8 5/8
  • Interestingly this is equal to (1-g3)
  • g3 has a natural interpretation as the fraction
    of tuples with exceptions affecting the
    dependency.

19
Specificity
  • For an association rule (a?ß),
  • Support is the probability with which the
    conditioning event (i.e., a) occurs
  • Rule with High-Confidence, yet Low-Support is a
    bad rule!
  • Presence of a lot of association rules with low
    supports makes the AFD bad.
  • In classification, this affects prediction
    accuracy.
  • For query rewriting tasks, per-query throughput
    is less.

20
Types of AFDs
  • 1. Model gt Make
  • Few Branches - Uniform Distribution
  • Good, and might hold good universally
  • 2. VIN gt Make
  • Many Branches - Uniform Distribution
  • Bad - Confidence of each association rule is
    high, but bad supports
  • 3. Model, Location gt Price
  • Many Branches - Skewed Distribution
  • Few association rules with high support and many
    with low support

21
Specificity
Normalized with the worst case Specificity i.e.,
X is a key
  • The Specificity measure captures our intuition of
    different types of AFDs.
  • It is based on information entropy
  • Higher the Specificity (above a threshold), worse
    the AFD is !
  • Shares similar motivations with the way SplitInfo
    is defined in decision trees while computing
    Information Gain Ratio
  • Follows Monotonicity

22
Agenda/Outline
  • Introduction
  • Related Work
  • Provide new perspective for AFDs
  • Roll-ups/condensed representations to association
    rules
  • Define measures for AFDs
  • Present the AFDMiner algorithm
  • Experimental Results
  • Performance
  • Quality

23
AFD Mining Problem
  • Good AFDs are the ones within the desired
    thresholds of the Confidence and Specificity
    measures.
  • Formally, the AFD mining problem can be stated as
    follows

24
AFD Mining
  • The problem of AFD Mining is learn all AFDs that
    hold over a given relational table
  • Two costs
  • 1. Major cost is the Combinatoric cost of
    traversing the search space
  • 2. Cost of visiting data to validate each rule
  • (To compute the interestingness measures)
  • Search process for AFDs is exponential in terms
    of the number of attributes

25
Pruning Strategies
  • 1. Pruning by Specificity
  • Specificity(Y) Specificity(X), where Y is a
    superset of X
  • If Specificity(X) gt maxSpecificity, we can prune
    all AFDs with X and its supersets as the
    determining set
  • 2. Pruning (applicable to FDs)
  • If (X?A) is an FD, all AFDs of the form (Y?A) can
    be pruned
  • 3. Pruning keys
  • Needed for FDs
  • But, this is subsumed by case 1 in AFDMiner
  • Because if Specificity(X) 1, it means X is a key

26
AFDMiner algorithm
  • Search starts from singleton sets of attributes
    and works its way to larger attribute sets
    through the set containment lattice level by
    level.
  • When the algorithm is processing a set X, it
    tests AFDs of the form
  • (X \A)gtA), where
  • A?X.
  • Information from previous levels is captured by
    maintaining RHS Candidate Sets for each set.

27
Traversal in the Search Space
  • During the bottom-up breadth-first search, the
    stopping criteria at a node are
  • The AFD confidence becomes 1, and thus it is an
    FD.
  • The Specificity value of the X is greater than
    the max value given.

FD based Pruning
Specificity based Pruning
Example A?C is an FD Then, C is removed from
RHS(ABC)
28
Computing Confidence and Specificity
  • Methods are based on representing attribute sets
    by equivalence class partitions of the set of
    tuples
  • And, ?X is the collection of equivalence classes
    of tuples for attribute set X
  • Example
  • ?make 1, 2, 3, 4, 5, 6, 7, 8
  • ?model 1, 2, 3, 4, 5, 6, 7, 8
  • ?make U model 1, 2, 3, 4, 5, 6, 7,
    8
  • A functional dependency holds if ?X ?XUA
  • For the AFD (XgtA),
  • Confidence 1 g3(XgtA)

In this example, Confidence(Model gtMake)
1 Confidence(MakegtModel) 5/8
29
Algorithms
  • Algorithm AFDMiner

30
Agenda/Outline
  • Introduction
  • Related Work
  • Provide new perspective for AFDs
  • Roll-ups/condensed representations to association
    rules
  • Define measures for AFDs
  • Present the AFDMiner algorithm
  • Experimental Results
  • Performance
  • Quality

31
Empirical Evaluation
  • Experimental Setup
  • Data sets
  • CensusDB (199523 tuples, 30 attrb)
  • MushroomDB (8124 tuples, 23 attrb)
  • Parameters for AFDMiner
  • minConf
  • maxSpecificity
  • No. of tuples
  • No. of attributes
  • MaxLength of determining set
  • Aim of the experiments is to show that the
    Dual-Measure approach (AFDMinerusing both
    confidence and specificity outperforms the
    Single-Measure approach (No_Specificity that
    uses Confidence alone)

No_Specificity A modified version of AFDMiner,
which uses using only Confidence but not
Specificity for AFDs. Thus, it generates all AFDs
(XgtA) with (Confidence(XgtA) gtminConf)
32
Evaluating Quality
  • BestAFD
  • The highest confident AFD among all the AFDs with
    attribute A as their dependent attribute
  • Classification Task
  • Classifier is run with determining set of BestAFD
    as features
  • Used 10-fold cross-validation and computed the
    average classification accuracy
  • Weka tool-kit
  • Evaluated over the censusDB

33
Evaluation Quality
No_Specificity
CensusDB
  • Average Classification accuracy for all
    attributes
  • minConf 0.8 maxSpecificity 0.4

Choosing minConf !
Shows that Specificity is effective in generating
better quality AFDs.
34
Choosing maxSpecificity
MaxSpecificity
MaxSpecificity
CensusDB
CensusDB
  • Classification Accuracy (by varying
    maxSpecificity)
  • threshold low gt good rules are pruned
  • threshold high gt bad rules are not being pruned
  • Classification accuracy approximately forms a
    double elbow shaped curve.

35
Choosing maxSpecificity
Best Value
MaxSpecificity
MaxSpecificity
  • Time to compute AFDs
  • Increases with increasing maxSpecificity
  • Rate of change varies
  • A good threshold value for Specificity (i.e.,
    maxSpecificity) is the value at the first elbow
    in the graph on quality

36
Query Throughput
No_Specificity
No. of tuples returned for an top-10 queries on
each distinct determining set (denotes query
throughput)
37
Discussion on TANE
  • Primarily designed to generate FDs
  • Modified version for generating Approximate
    Dependencies
  • Uses the error metric g3 for AFDs
  • Bottom-up search in the lattice
  • Generates only minimal dependencies
  • Pruning applicable to FDs

38
Comparison (AFDMiner Vs TANE)
  • TANENOMINP is a modified version of TANE that
    does not stop with just minimal dependencies.
  • minConf is 0.8 (thus, we set the g3 to be 0.2)

AFDMiner outperforms both the approaches -- thus
strengthening the argument that AFDs with high
confidence and with reasonable Specificity are
the best
39
Evaluating Performance
CensusDB
CensusDB
  • Time varies linearly with the number of tuples.
  • AFDMiner takes less time compared to that of
    NoSpecificity.
  • Time varies exponentially on the number of
    attributes.
  • AFDMiner completes much faster than NoSpecificity

40
Evaluating Performance
CensusDB
CensusDB
MushroomDB
MushroomDB
These experiments show that AFDMiner is fast
41
Conclusion
  • Introduced a novel perspective for AFDs
  • Condensed roll-ups of association rules.
  • Two metrics for AFDs
  • Confidence
  • Specificity
  • Algorithm AFDMiner
  • all AFDs (confidence gt minConf Specificity lt
    maxSpecificity)
  • Bottom-up search in a breadth-first manner in the
    set containment lattice of attributes
  • Pruning based on Specificity
  • Experiments AFDMiner generates high-quality
    AFDs faster.
  • AFDs with high Confidence and reasonable
    Specificity

A version of this thesis is currently under
review at ICDE 09
42
Future Direction
  • Conditional Functional Dependencies (CFDs)
  • Dependencies of the form (ZipCode?City if
    country England).
  • i.e., Holding true only for certain values of one
    or more of other attributes.
  • CAFDs are the probabilistic counter part of CFDs
  • CFDs and CAFDs are applied in data cleaning and
    value prediction recently, but mining these
    conditional rules is unexplored.

Intuitively, CFDs are intermediate rules between
association rules (value level) and FD (attribute
level). So, we believe that our approach can help
in generating them !
43
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com