Selecting the Right Interestingness Measure for Association Patterns - PowerPoint PPT Presentation

About This Presentation
Title:

Selecting the Right Interestingness Measure for Association Patterns

Description:

Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer Science and Engineering University of Minnesota Presented by Ahmet Bulut – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 21
Provided by: Ahme74
Learn more at: https://web.ece.ucsb.edu
Category:

less

Transcript and Presenter's Notes

Title: Selecting the Right Interestingness Measure for Association Patterns


1
Selecting the Right Interestingness Measure for
Association Patterns
  • Pang-Ning Tan, Vipin Kumar, and Jaideep
    Srivastava
  • Department of Computer Science and Engineering
  • University of Minnesota
  • Presented by Ahmet Bulut

2
Motivation
  • Major Data mining problem analysis of
    relationships among variables
  • Finding sets of binary variables that co-occur
    together
  • Association Rule Mining Agrawal et al.
  • How to find a suitable metric to capture the
    dependencies between variables (defined in terms
    of contingency tables)
  • Many metrics provide conflicting information
  • Goal Automated way of choosing the best metric
    for an application domain

3
(No Transcript)
4
Justification for Conflicts
  • E10 is ranked highest by I measure but lowest
    according to coefficient.
  • Recognize intrinsic properties of the existing
    measures

5
Analysis of a Measure
  • The relationship between the gender of a student
    and the grade obtained in a course
  • Number of male students X 2, Number of female
    students X 10
  • One expects scale-invariance in this particular
    application
  • Most measures are sensitive to scaling of rows
    and columns
  • such as gini index, interest, mutual information
    etc.

6
Solutions to zero-in
  • Support based pruning
  • Eliminate uncorrelated and poorly correlated
    patterns
  • Table Standardization
  • To modify contingency tables to have uniform
    margins
  • Many measures provide non-conflicting information
  • Expectation of domain experts
  • Choose the measure that agrees with the
    expectations the most.
  • Number of contingency tables, T, is high
  • It is possible to extract a small set, S, of
    contingency tables
  • Find the best measure for S to approximate for T

7
Preliminaries
  • The similarity between any two measures M1 and
    M2 the similarity between OM1(T) and OM2(T)
  • The similarity metric used is correlation
    coefficient
  • corr(OM1(T), OM2(T) ) gt threshold then similar

8
Desired Properties of a Measure M
9
Properties of a Measure M contd.
  • Denote 2X2 contingency table as a contingency
    matrix
  • Interestingness measure is a matrix operator, O
    such that
  • OM k where k is a scalar.
  • For instance for Coefficient as the
    interestingness measure
  • k equals to normalized form of the determinant
    operator
  • Det(M) f11f00 f01f10
  • Statistical Independence
  • a singular matrix M whose determinant equal to 0.

10
Properties of a Measure M contd.
  • Property 1 Symmetry under variable permutation
    O(MT) O(M)
  • cosine (IS), interest factor(I), odds ratio ( )
  • Property 2 Row/Column Scaling Invariance
    RCk1 0 0 k2
  • R x M is row scaling and M x R is column scaling
  • If O(RM) O(M) and O(MR) O(M), then M is
    row/scale invariant
  • odds ratio ( ) satisfies this property along
    with Yules Q and Y
  • Property 3 Antisymmetry under row/column
    permutation
  • S 0 11 0
  • If O(SM) -O(M), antisymmetric under row
    permutation
  • If O(MS) -O(M), antisymmetric under column
    permutation
  • Measures that are symmetric under the row and
    column permutation operations no distinction
    between positive and negative correlations of a
    table
  • For example gini index
  • Property 4 Inversion Invariance
  • S0 11 0
  • row and column permutation together
  • If O(SMS)O(M), inversion invariant
  • Insight flip presence with absence and vice
    versa for binary variables.
  • coefficient, odds ratio, collective strength
    are symmetric binary measures
  • Jaccard measure is asymmetric

11
Property 4 and Property 5
  • Market Basket analysis requires unequal treatment
    of binary values of a variable
  • A symmetric measure like the one above is not
    suitable
  • Property 5 Null Invariance If O(MC) O(M)
    where C0 00 k and k is a positive constant
  • For binary variables more records added that do
    not contain the two variables under
    consideration Co-occurrence emphasized

12
Effect of Support Based Pruning
  • Randomly generated synthetic dataset of 10,000
    contingency tables
  • Darker cells, correlation gt 0.85, and lighter
    cells indicate otherwise
  • Tighter bounds on the support of the patterns
    many measures become correlated

13
Elimination of poorly correlated tables using
Support-based Pruning
  • Minimum support threshold to prune out the low
    support patterns
  • Having a maximum support threshold equal
    elimination of uncorrelated, negatively
    correlated and positively correlated tables
  • Having a lower bound of support will prune out
    the negatively correlated or uncorrelated tables.

14
Table standardization
  • A standardized table visual depiction of the
    disjoint distribution of two variables after
    elimination of non-uniform marginals

15
Implications of standardization
  • The rankings from different measures become
    identical

16
Implications of standardization contd
  • After standardization, a matrix has x y y x
    where
  • y N/2-x and x f11
  • If you consider monotonically increasing
    functions of x (nearly all of the measures are)
  • Identical rankings on standardized, positively
    correlated tables
  • Some measures do not satisfy this property
  • Consider the values of x where N/4 lt x lt N/2
  • IPF favors odds ratio measure, therefore final
    rankings agree with odds ratio rankings before
    standardization
  • Leave with Different standardization techniques
    may be more appropriate for different application
    domains

17
Measure Selection Based on Rankings by Experts
  • Ideally, experts rank all the contingency tables,
    choose the best measure accordingly
  • Laborious task if the number of tables is too
    large
  • Provide a smaller set of tables to decide the
    best measure

18
Table Selection via Disjoint Algorithm
  • Use Disjoint algorithm to choose a subset of
    tables of cardinality k.
  • Rank tables according to various measures
  • Compute the similarity between different measures
  • A good table selection scheme minimizes

19
Experimental Results
20
Conclusions
  • Key properties to consider for selecting the
    right measure
  • No measure is consistently better than others
  • Situations where most measures provide correlated
    info
  • Choosing the right measure on a non-biased small
    set of all the tables give good estimates to the
    ideal solution
  • As a future work
  • Extension to k-way contingency tables
  • Association between mixed data types
Write a Comment
User Comments (0)
About PowerShow.com