What are the real challenges in data mining? - PowerPoint PPT Presentation

About This Presentation
Title:

What are the real challenges in data mining?

Description:

... a biologist, the concept 'an example of a non- contact' is far from ... (Source: Medical College of Georgia.) Usually we must learn a model to estimate costs ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 22
Provided by: charle129
Category:
Tags: challenges | data | mining | real

less

Transcript and Presenter's Notes

Title: What are the real challenges in data mining?


1
What are the real challenges in data mining?
  • Charles Elkan
  • University of California, San Diego
  • August 21, 2003

2
Bogosity about learning with unbalanced data
  • The goal is yes/no classification.
  • No ranking, or probability estimation
  • Often, P(cminorityx) lt 0.5 for all examples x
  • Decision trees and C4.5 are well-suited
  • No model each class separately, then use Bayes
    rule
  • P(cx) P(xc)P(c) / P (xc)P(c)
    P(xc)P(c)
  • No avoid small disjuncts
  • With naïve Bayes P(xc) ? P(xi c)
  • Under/over-sampling are appropriate
  • No do cost-based example-specific sampling, then
    bagging
  • ROC curves and AUC are important

3
Learning to predict contact maps
3D protein distance map
binary contact map (Source Paolo
Frasconi et al.)
4
Issues in contact map prediction
  1. An ML researcher sees O(n2) non-contacts and O(n)
    contacts.
  2. But to a biologist, the concept an example of a
    non- contact is far from natural.
  3. Moreover, there is no natural probability
    distribution defining the population of all
    proteins.
  4. A statistician sees simply O(n2) distance
    measures but s/he finds least-squares regression
    is useless!

5
For the rooftop detection task
  • We used BUDDS, to extract candidate rooftops
    (I.e. parallelograms) from six-large area images.
    Such processing resulted in 17,289 candidates,
    which an expert labeled as 781 positive examples
    and 17,048 negative examples of the concept
    rooftop.
  • (Source Learning When Data Sets are Imbalanced
    and When Costs are Unequal and Unknown, Marcus
    Maloof, this workshop.)

6
How to detect faces in real-time?
  • Viola and Jones, CVPR 01
  • Slide window over image
  • 45396 features per window
  • Learn boosted decision-stump classifier

7
UCI datasets are small and not highly unbalanced
DATA SET SIZE FEATURES MINORITY FRACTION
PIMA 768 8 0.35
PHONEME 5484 5 0.29
SATIMAGE 6435 36 0.10
MAMMOG. 11183 6 0.02
KRKOPT 28056 6 0.01
(Source C4.5 and Imbalanced Data Sets, Nitin
Chawla, this workshop.)
8
(No Transcript)
9
Features of the DMEF and similar datasets
  1. At least 105 examples and 102.5 features.
  2. No single well-defined target class.
  3. Interesting cases have frequency lt 0.01.
  4. Much information on costs and benefits, but no
    overall model of profit/loss.
  5. Different cost matrices for different examples
  6. Most cost matrix entries are unknown.

10
Example-dependent costs and benefits
actual predicted legitimate fraudulent
legitimate 0.01x - x
fraudulent - 20 - 10
  • Observations
  • Loss or profit depends on the transaction size x.
  • Figuring out the full profit/loss model is hard.
  • Opportunity costs are confusing.
  • Creative management transforms costs into
    benefits.
  • How do we account for long-term costs and
    benefits?

11
Correct decisions require correct probabilities
actual predicted legitimate fraudulent
legitimate 0.01x - x
fraudulent - 20 - 10
  • Let p P(legitimate). The optimal decision is
    approve iff
  • 0.01xp (1-p)x gt (-20)p
    (-10)(1-p)
  • This calculation requires well-calibrated
    estimates of p.

12
ROC curves considered harmful(Source
Medical College of Georgia.)
  • AUC can give a general idea of the quality of
    the probabilistic estimates produced by the
    model
  • No, AUC only evaluates the ranking
    produced.
  • Cost curves are equivalent to ROC curves
  • No, a single point on the ROC curve is
    optimal only if costs are the same for all
    examples.
  • Advice Use profit to compare methods.
  • Issue When is difference statistically
    significant?

13
Usually we must learn a model to estimate costs
actual predicted donor non-donor
solicit x 0.68 - 0.68
ignore 0 0
  • Cost matrix for soliciting donors
    to a charity.
  • The donation amount x is always unknown for test
    examples, so we must use the training data to
    learn a regression model to predict x.

14
So, we learn a model to estimate costs
actual predicted donor non-donor
solicit x 0.68 - 0.68
ignore 0 0
  • Issue The subset in the training set with x gt 0
    is a skewed sample for learning a model to
    estimate x.
  • Reason Donation amount x and probability of
    donation p are inversely correlated.
  • Hence, the training set contains too few examples
    of large donations, compared to small ones.

15
(No Transcript)
16
The reject inference problem
  • Let humans make credit grant/deny decisions.
  • Collect data about repay/write-off, but only for
    people to whom credit is granted.
  • Learn a model from this training data.
  • Apply the model to all future applicants.
  • Issue All future applicants is a sample from a
    different population than people to whom credit
    is granted.

17
(No Transcript)
18
Selection bias makes training labels incorrect
  • In the Wisconsin Prognostic Breast Cancer
    Database, average survival time with chemotherapy
    is lower (58.9 months) than without (63.1)!
  • Historical actions are not optimal, but they are
    not chosen randomly either.

(Source William H. Wolberg, M.D.)
19
Sequences of training sets
  • Use data collected in 2000 to learn a model
    apply this model to select inside the 2001
    population.
  • Use data about the individuals selected in 2001
    to learn a new model apply this model in 2002.
  • And so on
  • Each time a new model is learned, its training
    set is has been created using a different
    selection bias.

20
Lets use the word unbalanced in the future
  • Google Searched the web for imbalanced.  about
    53,800.
  • Searched the web for unbalanced.  about
    465,000.

21
  • C. Elkan.  The Foundations of Cost-Sensitive
    Learning IJCAI'01, pp. 973-978.
  • B. Zadrozny and C. Elkan.  Learning and Making
    Decisions When Costs and Probabilities are Both
    Unknown KDD'01, pp. 204-213.
  • B. Zadrozny and C. Elkan.  Obtaining Calibrated
    Probability Estimates from Decision Trees and
    Naive Bayesian Classifiers ICML'01, pp. 609-616.
  • N. Abe et al.  Empirical Comparison of Various
    Reinforcement Learning Strategies for Sequential
    Targeted Marketing ICDM'02.
  • B. Zadrozny, J. Langford, and N. Abe.
    Cost-Sensitive Learning by Cost-Proportionate
    Example Weighting ICDM03.
Write a Comment
User Comments (0)
About PowerShow.com