Reflections - PowerPoint PPT Presentation

About This Presentation
Title:

Reflections

Description:

ROC curves are far better than accuracy. and ROC curves are ... No: do cost-based example-specific sampling, then bagging. ROC curves and AUC are important ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 13
Provided by: siteUo
Category:
Tags: reflections | roc

less

Transcript and Presenter's Notes

Title: Reflections


1
Reflections
  • Robert Holte
  • University of Alberta
  • holte_at_cs.ualberta.ca

2
unbalanced vs. imbalanced
  • Google Searched the web for imbalanced.  about
    53,800.
  • Searched the web for unbalanced.  about
    465,000.

Shouldnt we favour the minority class ???
3
Is FP meaningful ?
  • Elkan individual examples have costs, so the
    number of misclassified positive examples is
    irrelevant
  • Moreover, if the testing distribution can differ
    from the training distribution the FP measured on
    training may have no relation to FP later.

BUT
4
Babies and Bathwater
  • Not every situation involves example-specific
    costs and drifting within-class distributions
  • ROC curves are far better than accuracy
  • and ROC curves are better than AUC or any scalar
    measure
  • and cost curves are even better ??

5
And the question remains
  • How to select the examples for training which
  • give the best classifier for your circumstances ?
  • (Fosters budgeted learning problem)

6
Within-class imbalance
  • Elkan subpopulations in test distribution not
    evenly represented in training
  • Other presenters subpopulations in training are
    not equal size

7
In Defense of studies of C4.5 and undersampling
  • Fosters opening example (budgeted learning) is
    very common.
  • Undersampling is a common technique (SAS manual)
  • Different algorithms react differently to
    undersampling
  • C4.5s reaction is not necessarily intuitive
  • Foster appropriate sampling method depends on
    performance measure

8
Endless Tweaking ?
  • Definitely a danger
  • Overtuning
  • Plethora of results/methods
  • But exploratory research is valid once a clear
    need is established
  • Some papers have presented specific hypotheses
    that can now be tested
  • 1-class SVM outperforms 2-class SVM when

9
Size matters
  • Having a small number of examples is a different
    problem than having an imbalance
  • Both cause problems
  • We should be careful to separate them in our
    experiments

10
No problem ?
  • Foster problem diminishes when datasets get
    large
  • Are some learning algorithms insensitive ?
  • Generative models ?
  • SVMs ? (it seems not after today)
  • Active learning, progressive sampling

11
More problems ?
  • Imbalance detrimental to feature selection
  • Imbalance detrimental to clustering

12
ELKAN Bogosity about learning with unbalanced
data
  • The goal is yes/no classification.
  • No ranking, or probability estimation
  • Often, P(cminorityx) lt 0.5 for all examples x
  • Decision trees and C4.5 are well-suited
  • No model each class separately, then use Bayes
    rule
  • P(cx) P(xc)P(c) / P (xc)P(c)
    P(xc)P(c)
  • No avoid small disjuncts
  • With naïve Bayes P(xc) ? P(xi c)
  • Under/over-sampling are appropriate
  • No do cost-based example-specific sampling, then
    bagging
  • ROC curves and AUC are important
Write a Comment
User Comments (0)
About PowerShow.com