Title: What are the real challenges in data mining?
1What are the real challenges in data mining?
- Charles Elkan
- University of California, San Diego
- August 21, 2003
2Bogosity about learning with unbalanced data
- The goal is yes/no classification.
- No ranking, or probability estimation
- Often, P(cminorityx) lt 0.5 for all examples x
- Decision trees and C4.5 are well-suited
- No model each class separately, then use Bayes
rule - P(cx) P(xc)P(c) / P (xc)P(c)
P(xc)P(c) - No avoid small disjuncts
- With naïve Bayes P(xc) ? P(xi c)
- Under/over-sampling are appropriate
- No do cost-based example-specific sampling, then
bagging - ROC curves and AUC are important
3Learning to predict contact maps
3D protein distance map
binary contact map (Source Paolo
Frasconi et al.)
4Issues in contact map prediction
- An ML researcher sees O(n2) non-contacts and O(n)
contacts. - But to a biologist, the concept an example of a
non- contact is far from natural. - Moreover, there is no natural probability
distribution defining the population of all
proteins. - A statistician sees simply O(n2) distance
measures but s/he finds least-squares regression
is useless!
5For the rooftop detection task
- We used BUDDS, to extract candidate rooftops
(I.e. parallelograms) from six-large area images.
Such processing resulted in 17,289 candidates,
which an expert labeled as 781 positive examples
and 17,048 negative examples of the concept
rooftop. - (Source Learning When Data Sets are Imbalanced
and When Costs are Unequal and Unknown, Marcus
Maloof, this workshop.)
6How to detect faces in real-time?
- Viola and Jones, CVPR 01
- Slide window over image
- 45396 features per window
- Learn boosted decision-stump classifier
7UCI datasets are small and not highly unbalanced
DATA SET SIZE FEATURES MINORITY FRACTION
PIMA 768 8 0.35
PHONEME 5484 5 0.29
SATIMAGE 6435 36 0.10
MAMMOG. 11183 6 0.02
KRKOPT 28056 6 0.01
(Source C4.5 and Imbalanced Data Sets, Nitin
Chawla, this workshop.)
8(No Transcript)
9Features of the DMEF and similar datasets
- At least 105 examples and 102.5 features.
- No single well-defined target class.
- Interesting cases have frequency lt 0.01.
- Much information on costs and benefits, but no
overall model of profit/loss. - Different cost matrices for different examples
- Most cost matrix entries are unknown.
10Example-dependent costs and benefits
actual predicted legitimate fraudulent
legitimate 0.01x - x
fraudulent - 20 - 10
- Observations
- Loss or profit depends on the transaction size x.
- Figuring out the full profit/loss model is hard.
- Opportunity costs are confusing.
- Creative management transforms costs into
benefits. - How do we account for long-term costs and
benefits?
11Correct decisions require correct probabilities
actual predicted legitimate fraudulent
legitimate 0.01x - x
fraudulent - 20 - 10
- Let p P(legitimate). The optimal decision is
approve iff - 0.01xp (1-p)x gt (-20)p
(-10)(1-p) - This calculation requires well-calibrated
estimates of p. -
12 ROC curves considered harmful(Source
Medical College of Georgia.)
- AUC can give a general idea of the quality of
the probabilistic estimates produced by the
model - No, AUC only evaluates the ranking
produced. - Cost curves are equivalent to ROC curves
- No, a single point on the ROC curve is
optimal only if costs are the same for all
examples. - Advice Use profit to compare methods.
- Issue When is difference statistically
significant?
13Usually we must learn a model to estimate costs
actual predicted donor non-donor
solicit x 0.68 - 0.68
ignore 0 0
- Cost matrix for soliciting donors
to a charity. - The donation amount x is always unknown for test
examples, so we must use the training data to
learn a regression model to predict x.
14So, we learn a model to estimate costs
actual predicted donor non-donor
solicit x 0.68 - 0.68
ignore 0 0
- Issue The subset in the training set with x gt 0
is a skewed sample for learning a model to
estimate x. - Reason Donation amount x and probability of
donation p are inversely correlated. - Hence, the training set contains too few examples
of large donations, compared to small ones.
15(No Transcript)
16The reject inference problem
- Let humans make credit grant/deny decisions.
- Collect data about repay/write-off, but only for
people to whom credit is granted. - Learn a model from this training data.
- Apply the model to all future applicants.
- Issue All future applicants is a sample from a
different population than people to whom credit
is granted.
17(No Transcript)
18Selection bias makes training labels incorrect
- In the Wisconsin Prognostic Breast Cancer
Database, average survival time with chemotherapy
is lower (58.9 months) than without (63.1)! - Historical actions are not optimal, but they are
not chosen randomly either.
(Source William H. Wolberg, M.D.)
19Sequences of training sets
- Use data collected in 2000 to learn a model
apply this model to select inside the 2001
population. - Use data about the individuals selected in 2001
to learn a new model apply this model in 2002. - And so on
- Each time a new model is learned, its training
set is has been created using a different
selection bias.
20Lets use the word unbalanced in the future
- Google Searched the web for imbalanced. about
53,800. - Searched the web for unbalanced. about
465,000.
21- C. Elkan. The Foundations of Cost-Sensitive
Learning IJCAI'01, pp. 973-978. - B. Zadrozny and C. Elkan. Learning and Making
Decisions When Costs and Probabilities are Both
Unknown KDD'01, pp. 204-213. - B. Zadrozny and C. Elkan. Obtaining Calibrated
Probability Estimates from Decision Trees and
Naive Bayesian Classifiers ICML'01, pp. 609-616. - N. Abe et al. Empirical Comparison of Various
Reinforcement Learning Strategies for Sequential
Targeted Marketing ICDM'02. - B. Zadrozny, J. Langford, and N. Abe.
Cost-Sensitive Learning by Cost-Proportionate
Example Weighting ICDM03.