Title: Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples
1Imbalanced Data Set Learning with Synthetic
Examples
- Benjamin X. Wang
- and
- Nathalie Japkowicz
2The Class Imbalance Problem I
- Data sets are said to be balanced if there are,
approximately, as many positive examples of the
concept as there are negative ones. - There exist many domains that do not have a
balanced data set. - Examples
- Helicopter Gearbox Fault Monitoring
- Discrimination between Earthquakes and Nuclear
Explosions - Document Filtering
- Detection of Oil Spills
- Detection of Fraudulent Telephone Calls
3The Class Imbalance Problem II
- The problem with class imbalances is that
standard learners are often biased towards the
majority class. - That is because these classifiers attempt to
reduce global quantities such as the error rate,
not taking the data distribution into
consideration. - As a result examples from the overwhelming class
are well-classified whereas examples from the
minority class tend to be misclassified.
4Some Generalities
- Evaluating the performance of a learning system
on a class imbalance problem is not done
appropriately with the standard accuracy/error
rate measures. ? ROC Analysis is typically used,
instead. - There is a parallel between research on class
imbalances and cost-sensitive learning. - There are four main ways to deal with class
imbalances re-sampling, re-weighing, adjusting
the probabilistic estimate, one-class learning
5Advantage of Resampling
- Re-sampling provides a simple way of biasing the
generalization process. - It can do so by
- Generating synthetic samples accordingly biased
- Controlling the amount and placement of the new
samples - Note this type of control can also be achieved
by smoothing the classifiers probabilistic
estimate (e.g., Zadrozny Elkan, 2001), but that
type of control cannot be as localized as the one
achieved with re-sampling techniques.
6SMOTE A State-of-the-Art Resampling Approach
- SMOTE stands for Synthetic Minority Oversampling
Technique. - It is a technique designed by Chawla, Hall,
Kegelmeyer in 2002. - It combines Informed Oversampling of the minority
class with Random Undersampling of the
majority class. - SMOTE currently yields the best results as far as
re-sampling and modifying the probabilistic
estimate techniques go (Chawla, 2003).
7SMOTEs Informed Oversampling Procedure II
- For each minority Sample
- Find its k-nearest minority neighbours
- Randomly select j of these neighbours
- Randomly generate synthetic samples along the
lines joining the minority sample and its j
selected neighbours - (j depends on the amount of oversampling desired)
8SMOTEs Informed vs. Random Oversampling
- Random Oversampling (with replacement) of the
minority class has the effect of making the
decision region for the minority class very
specific. - In a decision tree, it would cause a new split
and often lead to overfitting. - SMOTEs informed oversampling generalizes the
decision region for the minority class. - As a result, larger and less specific regions are
learned, thus, paying attention to minority class
samples without causing overfitting.
9SMOTEs Informed Oversampling Procedure I
But what if there is a majority sample Nearby?
Majority sample
Minority sample
Synthetic sample
10SMOTEs Shortcomings
- Overgeneralization
- SMOTEs procedure is inherently dangerous since
it blindly generalizes the minority area without
regard to the majority class. - This strategy is particularly problematic in the
case of highly skewed class distributions since,
in such cases, the minority class is very sparse
with respect to the majority class, thus
resulting in a greater chance of class mixture. - Lack of Flexibility
- The number of synthetic samples generated by
SMOTE is fixed in advance, thus not allowing for
any flexibility in the re-balancing rate.
11SMOTEs Tendency for Overgeneralization
Overgeneralization!!!
Minority sample
Synthetic sample
Majority sample
12Our Proposed Solution
- In order to avoid overgeneralization, we propose
to use three techniques - Testing for data sparsity
- Clustering the minority class
- 2-class (rather than 1-class) sample
generalization - In order to avoid SMOTEs lack of flexibility,
we propose one technique - Multiple Trials/Feedback
- We call our Approach Adaptive Synthetic Minority
Oversampling Method (ASMO)
13ASMOs Strategy I
- Overfitting Avoidance I Testing for data
sparsity - For each minority sample m, if ms g neighbours
are majority samples, then the data set is sparse
and ASMO should be used. Otherwise, SMOTE can be
used. (As a default, we used g20). - Overgeneralization Avoidance II Clustering
- We will use k-means or other such clustering
systems on the minority class (for now, this step
is done, but in a non-standard way)
14ASMOs Strategy II
- Overfitting Avoidance III Synthetic sample
generation using two classes - Rather than using the k-nearest neighbours of the
minority class to generate new samples, we use
the k nearest neighbours of the opposite class.
15ASMOs Strategy III Overfitting
avoidance Overview
- Clustering
-2-class sample generation
Minority sample
Synthetic sample
Majority sample
16ASMOs Strategy III
- Flexibility Enhancement through Multiple Trials
and Feedback - For each Cluster Ci, iterate through different
rates of majority undersampling and synthetic
minority generation. Keep the best combination
subset Si. - Merge the Sis into a single training set S.
- Apply the classifier to S.
17Discussion of our Technique I
- Assumption we made/Justification
- the problem is decomposable. i.e., optimizing
each subset will yield an optimal merged set. - As long as the base classifier we use does some
kind of local learning (not just global
optimization), this assumption should hold. - Question/Answer
- Why did we use different oversampling and
undersampling rates? - It was previously shown that optimal sampling
rates are problem dependent, and thus, are best
set adaptively (Weiss Provost, 2003, Estabrook
Japkowicz, 2001)
18Experiment Setup I
- We tested our system on three different data
sets - Lupus (thanks to James Malley of NIH)
- Minority class 2.8
- Dataset Size 3839
- Abalone-5 (UCI)
- Minority class 2.75
- Dataset Size 4177
- Connect-4 (UCI)
- Minority class 9.5
- Dataset Size 11,258
19Experiment Setup II
- ASMO was compared to two other techniques
- SMOTE
- O-D the Combination of Random Over- and Down
(Under)- sampling O-D was shown to outperform
both Random Oversampling and Random Undersampling
in preliminary experiments. - The base classifier in all experiments is SVM
k-NN was used in the syntactic generation
process in order to identify the samples nearest
neighbours (within the minority class or between
the minority and majority class). - The results are reported in the form of ROC
Curves on 10-fold corss-validation experiments.
20Results on Lupus
21Results on Abalone-5
22Results on Connect-4
23Discussion of the Results
- On every domain, ASMO slightly outperforms both
O-D and SMOTE. In the ROC areas where ASMO
does not outperform the other two systems, its
performance equals theirs. - ASMOs effect seems to be one of smoothening
SMOTEs ROC Curve. - SMOTEs performance is comparatively better in
the two domains where the class imbalance is
greater (Lupus, Abalone-5). We expect its
relative performance to increase as the imbalance
grows even more.
24Summary
- We presented a few modifications to the
State-of-the-art re-sampling system, SMOTE. - These modifications had two goals
- To correct for SMOTEs tendency to overgeneralize
- To make SMOTE more flexible
- We observed a slight improved performance on
three domains. However that improvement came at
the expense of greater time consumption.
25Future Work This was a very preliminary study!
- To clean-up the system (e.g., to use a standard
clustering method) - To test the system more rigorously (to test for
significance to use TANGO used in the medical
domain - To test our system on highly imbalanced data
sets, to see if, indeed, our design helps address
this particular issue. - To modify the data generation process so as to
test biases other than the one proposed by SMOTE.