Associative Classification of Imbalanced Datasets - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Associative Classification of Imbalanced Datasets

Description:

Downside of Support and Confidence. Mining Rules from Imbalanced Data Sets. Fisher's Exact Test ... A good (albeit stereotypical) rule is {Beer,Diaper} Male whose ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 34

Provided by: cha128

Category:

more less

Transcript and Presenter's Notes

Title: Associative Classification of Imbalanced Datasets

1
Associative Classification of Imbalanced Datasets

Sanjay Chawla
School of IT
University of Sydney

2
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

3
Data Mining

Data Mining research has settled into an
equilibrium involving four tasks

Pattern Mining (Association Rules)
Classification
DB
Clustering
Anomaly or Outlier Detection
ML
4
Association Rule Mining

In terms of impact nothing rivals association
rule mining within the data mining community
SIGMOD 93 (4100 citations)
Agrawal, Imielinski, Swami
VLDB 94 (4900 Citations)
Agrawal, Srikant
C4.5 93 (7000 citations)
Ross Quinlan
Gibbs Sampling 84 (IEEE PAMI, 5000 citations)
Geman Geman
Content Addressable Network (3000)
Ratnasamy, Francis, Hadley, Karp

5
Association Rules (Agrawal, Imielinksi and
Swami, 93 SIGMOD)

An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

From Introduction to Data Mining, Tan,Steinbach
and Kumar
6
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is computationally
expensive

7
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

8
Associative Classifiers

Most of the Associative Classifiers are based on
rules discovered using the support-confidence
criterion.
The classifier itself is a collection of rules
ranked using their support or confidence.

9
Associative Classifiers (2)
TID Items Gender
1 Bread, Milk F
2 Bread, Diaper, Beer, Eggs M
3 Milk Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer M
5 Bread, Milk, Diaper, Coke F
In a Classification task we want to predict the
class label (Gender) using the attributes
A good (albeit stereotypical) rule is
Beer,Diaper ? Male whose support is 60 and
confidence is 100
10
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

11
Imbalanced Data Set

In some application domains, Data Sets are
Imbalanced
The proportion of samples from one class is much
smaller than the other class/classes.
And the smaller class is the class of interest.
Support and confidence are biased toward the
majority class, and do not perform well in such
cases.

12
Downsides of Support

Support is biased towards the majority class
Eg classes yes, no, sup(yes)90
minSup gt 10 wipes out any rule predicting no
Suppose X ? no has confidence 1 and support 3.
Rule discarded if minSup gt 3 even though it
perfectly predicts 30 of the instances in the
minority class!

13
Downside of Confidence(1)

20 5 25
70 5 75
90 10 100
Conf(A? C) 20/25 0.8 Support(A?C) 20/100
0.2 Correlation between A and C
Thus, when the data set is imbalanced a high
support and high confidence rule may not
necessarily imply that the antecedent and the
consequent are positively correlated.
14
Downside of Confidence (2)

Reasonable to expect that for good rules the
antecedent and consequent are not independent!
Suppose
P(ClassYes) 0.9
P(ClassYesX) 0.9

15
Downsides of Confidence (3)

Another useful observation
Higher confidence (support) for a rule in the
minority class implies higher correlation, and
lower correlation in the minority class implies
lower confidence, but neither of these apply for
the majority class.
Confidence (support) tends to bias the majority
class.

16
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

17
Contingency Table

A 2 2 Contingency Table for X ? y.
We will use the notation a, b c, d to
represent this table.

18
Fisher Exact Test

Given a table, a, b c, d, Fisher Exact Test
will find the probability (p-value) of obtaining
the given table under the hypothesis that X, X
and y, y are independent.
The margin sums (?rows, ?cols) are fixed.

19
Fisher Exact Test (2)

The p-value is given by

We will only use rules whose p-values are below
the level of significant desired (e.g. 0.01).
Rules that pass this test are statistically
significant in the positively associated
direction (e.g. X ? y).

20
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

21
Class Correlation Ratio

In Class Correlation, we are interested in rules
X ? y where X is more positively correlated with
y than it is with y.
The correlation is defined by

where T is the number of transactions n.
22
Class Correlation Ratio (2)

We then use corr() to measure how correlated X is
with y compared to y.
X and y are positively correlated if corr(X?y)gt1,
and negatively correlated if corr(X?y)lt1.

23
Class Correlation Ratio (3)

Based on correlation corr(), we define the Class
Correlation Ratio (CCR)

The CCR measures how much more positively the
antecedent is correlated with the class it
predicts (e.g. y), relative to the alternative
class (e.g. y).

24
Class Correlation Ratio (4)

We only use rules with CCR higher than a desired
threshold, so that no rules are used that are
more positively associated with the classes they
do not predict.

25
The two measurements

We perform the following tests to determine
whether a potentially interesting rule is indeed
interesting
Check the significant of a rule X ? y by
performing the Fishers Exact Test.
Check whether CCR(X?y) gt 1.
Those rules that pass the above two tests are
candidates for the classification task.

26
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

27
Search and Pruning Strategies

To avoid examining the whole set of possible
rules, we use search strategies that ensure the
concept of being potential interesting is
anti-monotonic
X?y might be considered as potential interesting
if and only if all X?yX in X have been found
to be potentially interesting.

28
Search and Pruning Strategies (2)

The contingency table a, b c, d used to test
for the significance of the rule X ? y in
comparison to one of its generalizations X-z ?
y for the Aggressive search strategy.

29
Example

Suppose we have already determined that the rules
(A a1) ? 1 and (A a2) ? 1 are significant.
Now we want to test if X(A a1) (Aa2) ? 1 is
significant
Then we carry out a FET and calculate the CCR on
X and X Aa1 (i.e. z a2)and X and X-Aa2
(i.e. z a1).
If the minimum of their p-value is less than the
significance level, and their CCR is greater than
1, we keep the X? 1 rule, otherwise we discard it.

30
Ranking Rules

Strength Score (SS)
In order to determine how interesting a rule is,
we need a ranking (ordering) of the rules, and
the ordering is defined by the Strength Score.

31
Overview

Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
Fishers Exact Test
Class Correlation Ratio (CCR)
Searching and Pruning Strategies
Experiments

32
Experiments (Balanced Data)

The preceding approach is represented by
SPARCCC.
The experiments on Balanced Data Sets show that
the average accuracy of SPARCCC compares
favourably to CBA and C4.5.
The table below is the prediction accuracy on
balanced data sets.

33
Experiments (Imbalanced Data)

True Positive Rate (Recall/Sensitivity) is a
better performance measure for imbalanced data
sets.
SPARCCC overcomes other rule based techs such as
CBA and CCCS.
The table below is True Positive Rate of the
Minority Class on Imbalanced version of the
Datasets.

34
References

Florian Verhein, Sanjay Chawla.Using
Significant, Positively Associated and Relatively
Class Correlated Rules For Associative
Classification of Imbalanced Datasets.The 2007
IEEE International Conference on Data Mining .
Omaha NE, USA. October 28-31, 2007.

Write a Comment

User Comments (0)