Profiting%20from%20Data%20Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Profiting%20from%20Data%20Mining

Description:

Illustration with an auction. What is the value of the coins in this jar? Wharton ... Auctions and Over-fitting. Auction jar of coins to a class of MBA students ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 40
Provided by: shar261
Category:

less

Transcript and Presenter's Notes

Title: Profiting%20from%20Data%20Mining


1
Profiting from Data Mining
  • Bob Stine
  • Department of Statistics
  • The Wharton School, Univ of Pennsylvania
  • April 5, 2002
  • www-stat.wharton.upenn.edu/bob

2
Overview
  • Critical stages of data mining process
  • Choosing the right data, people, and problems
  • Modeling
  • Validation
  • Automated modeling
  • Feature creation and selection
  • Exploiting expert knowledge, insights
  • Applications
  • Little detail Biomedical finding predictive
    risk factors
  • More detail Financial predicting returns on
    the market
  • Lots of detail Credit anticipating the onset
    of bankruptcy

3
Predicting Health Risk
  • Who is at risk for a disease?
  • Example detect osteoporosis without expense of
    x-ray
  • Goals
  • Improving public health
  • Savings on medical care
  • Confirm an informal model with data mining
  • Many types of features, interested groups
  • Clinical observations of doctors
  • Laboratory measurements, genetic
  • Self-reported behavior
  • Missing data

4
Predicting the Stock Market
  • Small, hands-on example
  • Goals
  • Better retirement savings?
  • Money for that special vacation? College?
  • Trade-offs risk vs return
  • Lots of free data
  • Access to accurate historical time trends, macro
    factors
  • Recent data more useful than older data
  • Simple modeling technique
  • Validation

5
Predicting the Market Specifics
  • Build a regression model
  • Response is return on the value-weighted SP
  • Use standard forward/backward stepwise
  • Battery of 12 predictors with interactions
  • Train the model during 1992-1996 (training data)
  • Model captures most of variation in 5 years of
    returns
  • Retain only the most significant features
    (Bonferroni)
  • Predict returns in 1997 (validation data)
  • Another version in Foster, Stine Waterman

6
Historical patterns?
?
7
Fitted model predicts...
Exceptional Feb return?
8
What happened?
Training Period
9
Claimed versus Actual Error
Actual
SquaredPredictionError
Claimed
10
Over-confidence?
  • Over-fitting
  • Model fits the training data too well better
    than it can predict the future.
  • Greedy fitting procedure Optimization
    capitalizes on chance
  • Some intuition
  • Coincidences
  • Cancer clusters, the birthday problem
  • Illustration with an auction
  • What is the value of the coins in this jar?

11
Auctions and Over-fitting
  • What is the value of these coins?

12
Auctions and Over-fitting
  • Auction jar of coins to a class of MBA students
  • Histogram shows the bids of 30 students
  • Most were suspicious, but a few were not!
  • Actual value is 3.85
  • Known as Winners Curse
  • Similar to over-fittingbest model like high
    bidder

13
Profiting from data mining?
  • Wheres the profit in this?
  • Mining the miners vs getting value from your
    data
  • Lost opportunities
  • Importance of domain knowledge
  • Validation as a measure of success
  • Prediction provides an explicit check
  • Does your application predict something?

14
Pitfalls and Role of Management
  • Over-fitting is dominated by other issues
  • Management support
  • Life in silos
  • Coordination across domains
  • Responsibility and reward
  • Accountability
  • Who gets the credit when it succeeds?Who suffers
    if the project is not successful?

15
Specific Potholes
  • Moving targets
  • Lets try this with something else.
  • Irrational expectations
  • I could have done better than that.
  • Not with my data
  • Its our data. You cant use it.
  • You did not use our data properly.

16
Back to a real application
  • Emphasis on the statistical issues

17
Predicting Bankruptcy
  • Goal
  • Reduce losses stemming from personal bankruptcy
  • Possible strategies
  • If can identify those with highest risk of
    bankruptcyTake some action
  • Call them for a friendly chat about
    circumstances
  • Unilaterally reduce credit limit
  • Trade-off
  • Good customers borrow lots of money
  • Bad customers also borrow lots of money

18
Predicting Bankruptcy
  • Needle in a haystack
  • 3,000,000 months of credit-card activity
  • 2244 bankruptcies
  • Simple predictor that all are OK looks pretty
    good.
  • What factors anticipate bankruptcy?
  • Spending patterns? Payment history?
  • Demographics? Missing data?
  • Combinations of factors?
  • Cash Advance Las Vegas Problem
  • We consider more than 100,000 predictors!

19
Modeling Predictive Models
  • Build the modelIdentify patterns in training
    data that predict future observations.
  • Which features are real? Coincidental?
  • Evaluate the modelHow do you know that it works?
  • During the model construction phase
  • Only incorporate meaningful features
  • After the model is built
  • Validate by predicting new observations

20
Are all prediction errors the same?
  • Symmetry
  • Is over-predicting as costly as under-predicting?
  • Managing inventories and sales
  • Visible costs versus hidden costs
  • Does a false positive a false negative?
  • Classification in data mining
  • Credit modeling, flagging risky customers
  • False positive call a good customer bad
  • False negative fail to identify a bad
  • Differential costs for different types of errors

21
Building a Predictive Model
  • So many choices
  • Structure What type of model?
  • Neural net
  • CART, classification tree
  • Additive model or regression spline
  • Identification Which features to use?
  • Time lags, natural transformations
  • Combinations of other features
  • Search How does one find these features?
  • Brute force has become cheap.

22
Our Choices
  • Structure
  • Linear regression with nonlinearity via
    interactions
  • All 2-way and some 3-way, 4-way interactions
  • Missing data handled with indicators
  • Identification
  • Conservative standard error
  • Comparison of conservative t-ratio to adaptive
    threshold
  • Search
  • Forward stepwise regression
  • Coming Dynamically changing list of features
  • Good choice affects where you search next.

23
Identifying Predictive Features
  • Classical problem of variable selection
  • Thresholding methods (compare t-ratio to
    threshold)
  • Akaike information criterion (AIC)
  • Bayes information criterion (BIC)
  • Hard thresholding and Bonferroni
  • Arguments for adaptive thresholds
  • Empirical Bayes
  • Information theory
  • Step-up/step-down tests

24
Adaptive Thresholding
  • Threshold changes to conform to attributes of
    data
  • Easier to add features as more are found.
  • Threshold for first predictor
  • Compare conservative t-ratio to Bonferroni.
  • Bonferroni is about Sqrt(2 log p)
  • If something significant is found, continue.
  • Threshold for second predictor
  • Compare t-ratio to reduced threshold
  • New threshold is about Sqrt(2 log p/2)

25
Adaptive Thresholding Benefits
  • EasyAs easy and fast as implementing the
    standard criterion that is used in stepwise
    regression.
  • TheoryResulting model provably as good as best
    Bayes model for the problem at hand.
  • Real worldIt works! Finds models with real
    signal, and stops when the signal runs out.

26
Bankruptcy Model Construction
  • Data reserve 80 for validation
  • Training data
  • 600,000 months
  • 458 bankruptcies
  • Validation data
  • 2,400,000 months
  • 1786 bankruptcies
  • Selection via adaptive thresholding
  • Compare sequence of t-statistics to Sqrt(2 log
    p/q)
  • Dynamic expansion of feature space

27
Bankruptcy Model Preview
  • Predictors
  • Initial search identifies 39
  • Validation SS monotonically falls to 1650
  • Linear fit can do no better than 1735
  • Expanded search of higher interactions finds a
    bit more
  • Nature of predictors comprising the interactions
  • Validation SS drops 10 more
  • Validation Lift chart
  • Top 1000 candidates have 351 bankrupt
  • More validation Calibration
  • Close to actual Pr(bankrupt) for most groups.

28
Bankruptcy Model Fitting
  • Where should the fitting process be stopped?

29
Bankruptcy Model Fitting
  • Our adaptive selection procedure stops at a model
    with 39 predictors.

30
Bankruptcy Model Validation
  • The validation indicates that the fit gets better
    while the model expands. Avoids over-fitting.

31
Bankruptcy Model Linear?
  • Choosing from linear predictors (no interactions)
    does not match the performance of the full search.

32
Bankruptcy Model More?
  • Searching higher-order interactions offers modest
    improvement.

33
Lift Chart
  • Measures how well model classifies sought-for
    group
  • Depends on rule used to label customers
  • Very high thresholdLots of lift, but few
    bankrupt customers are found.
  • Lower thresholdLift drops, but finds more
    bankrupt customers.

34
Generic Lift Chart
Model
Random
35
Bankruptcy Model Lift
  • Much better than diagonal!

36
Calibration
  • Classifier assigns Prob(BR)rating to a
    customer.
  • Weather forecast
  • Among those classified as 2/10 chance of BR,
    how many are BR?
  • Closer to diagonal is better.

37
Bankruptcy Model Calibration
  • Over-predicts risk above claimed probability 0.4

38
Summary of Bankruptcy Model
  • Automatic, adaptive selection
  • Finds patterns that predict new observations
  • Predictive, but not easy to explain
  • Dynamic feature set
  • Current research
  • Information theory allows changing search space
  • Finds more structure than direct search could
    find
  • Validation
  • Essential only for judging fit.
  • Better than hand-made models that take years to
    create.

39
So, wheres the profit in DM?
  • Automated modeling has become very powerful,
    avoiding problems of over-fitting.
  • Role for expert judgment remains
  • What data to use?
  • Which features to try first?
  • What are the economics of the prediction errors?
  • Collaboration
  • Data sources
  • Data analysis
  • Strategic decisions
Write a Comment
User Comments (0)
About PowerShow.com