Title: Issues in Data Mining Applications Tutorial
1Issues in Data Mining Applications-Tutorial-
How to Make A Decision About Your Own Data Mining
Tool?
Authors
- Nemanja Jovanovic, nemko_at_sezampro.yu
- Valentina Milenkovic, tina_at_eunet.yu
- Prof. Dr. Veljko Milutinovic, vm_at_etf.bg.ac.yu
2Data Mining vs. Knowledge Mining ?
3Evolution Of Data Mining
4Examples of DM projects to stimulate your
imagination
- Here are six examples of how data mining is
helping corporations to operate more
efficiently and profitably in today's business
environment. - Targeting a set of consumers who are most
likely to respond to a direct mail campaign - Predicting the probability of default for
consumer loan applications - Reducing fabrication flaws in VLSI chips
- Predicting audience share for television programs
- Predicting the probability that a cancer patient
will respond to radiation therapy - Predicting the probability that an offshore oil
well is actually going to produce oil
5Comparison of forteen DM tools
- Evaluated by four undergraduates inexperienced at
data mining, a relatively
experienced graduate student and
a profesional data mining
consultant - Run under the MS Windows 95, MS Windows NT,
Macintosh System 7.5 - Use one of the four technologies
Decision Trees, Rule
Inductions, Neural or Polynomial Networks - Solve two binary classification problems
multi-class
classification and noiseless estimation problem - Price from 75 to 25.000
6Comparison of forteen DM tools
- The Decision Tree products were - CART
- Scenario - See5 -
S-Plus - The Rule Induction tools were - WizWhy
- DataMind - DMSK - Neural Networks were built from three
programs - NeuroShell2 - PcOLPARS
- PRW - The Polynomial Network tools were -
ModelQuest Expert - Gnosis - a
module of NeuroShell2 - KnowledgeMiner
7Criteria for evaluating DM tools
- A list of 20 criteria for evaluating DM tools,
put into 4 categories - Capability measures what a desktop tool can do,
and how well it does
it - Handless missing data -
Considers misclassification costs - Allows
data transformations - Quality of tesing
options - Has programming language -
Provides useful output reports -
Visualisation
8Visualisation
? excellent capability ? good capability ?-
some capability blank no capability
9Criteria for evaluating DM tools
- Learnability/Usability shows how easy a tool is
to learn and use - Tutorials -
Wizards - Easy to learn - Users
manual - Online help - Interface
10Criteria for evaluating DM tools
- Interoperability shows a tools ability to
interface with other
computer applications - Importing
data - Exporting data - Links to
other applications - Flexibility - Model adjustment
flexibility - Customizable work
enviroment - Ability to write or change code
11Data Input Output Model
? excellent capability ? good capability ?-
some capability blank no capability
12A classification of data sets
- Pima Indians Diabetes data set
- 768 cases of Native American women from the Pima
tribe some of whom are diabetic,
most of whom are not - 8 attributes plus the binary class variable for
diabetes per instance - Wisconsin Breast Cancer data set
- 699 instances of breast tumors some of which
are malignant, most of which are benign - 10 attributes plus the binary malignancy
variable per case - The Forensic Glass Identification data set
- 214 instances of glass collected during crime
investigations
- 10 attributes plus the multi-class output
variable per instance - Moon Cannon data set
- 300 solutions to the equation x 2v 2
sin(g)cos(g)/g
- the data were generated without adding noise
13Evaluation of forteen DM tools
14Strenghts and Weaknesses
- Strengths
- Ease of use (Scenario, WizWhy..)
- Data visualisation (S-plus,MineSet...)
- Depth of algorithms (tree options)
(CART,See5,S-plus..) - Multiplte neural network architectures
(NeuroShell)
- Weaknesses
- Difficult file I/O (OLPARS,CART)
- Limited visualisation (PRW,See5,WizWhy)
- Narrow analyses path (Scenario)
15How to improve existing DM applications
- The top ten points
- Database integration
- no more flat files
- use the millions spent on data warehousing
- Automated model scoring
- without scoring DM is pretty useless
- should be integrated with the driving
applications - Exporting models to other applications
- close the loop between DM and applications
that need to
use the results (scores)
16How to improve existing DM applications
- Business templates
- cross-selling specific application is more
valuable than a general
modeling tool
- Effort knob
- it is relevant in a way that tuning parametars
are not - Incorporate financial information
- the financial information is very important and
often available and shold be provided as
input to the DM application
17How to improve existing DM applications
- Computed target columns
- allow the user to interactively create a new
target variable - Time-series data
- a years worth of monthly balance information is
qualitatively different than twelve distinct
non-time-series variables - Use versus View
- do not present visually to user the full model,
only the most
important levels - Wizards
- not necessarily but desirable
- prevent human error by keeping the user on track
18Potential Applications
- Data mining has many varied fields of
application, - some of which are listed below.
- Retail/Marketing
- Identify buying patterns from customers
- Find associations among customer demographic
characteristics - Predict response to mailing campaigns
- Market basket analysis
19Potential Applications
- Banking
- Detect patterns of fraudulent credit card use
- Identify loyal' customers
- Determine credit card spending by customer groups
- Find hidden correlations between different
financial indicators - Identify stock trading rules from historical
market data
20Potential Applications
- Insurance and Health Care
- Claims analysis - i.e., which medical procedures
are claimed together - Predict which customers will buy new policies
- Identify behaviour patterns of risky customers
- Identify fraudulent behaviour
21Potential Applications
- Transportation
- Determine the distribution schedules among
outlets - Analyse loading patterns
- Medicine
- Characterise patient behaviour to predict office
visits - Identify successful medical therapies for
different illnesses - To predict the effectiveness of surgical
procedures or
medical tests
22Potential Applications
- Sport
- To make the best choice about players in
different circumstance - To predict the results of relevance match
- Do a better list of seed players in groups or
tournament - DM report from an NBA game
- When Price was Point-Guard, J.Williams missed 0
(0) of his jump field-goal attempts and made
100 (4) of his jump field-goal-attempts. - The total number of such field-goal-attempts
was 4.
23DM and Customer Relationship Management
- CRM is a process that manages the interactions
between a company and its customers - Users of CRM software applications are database
marketers - Goals of database marketers are
- identifying market segments, which requires
significant data about prospective customers
and their buying behaviors - build and execute campaigns
- Tightly integrating the two disciplines presents
an opportunity for companies to gain
competetive adventage
24DM and Customer Relationship Management
- How Data Mining helps Database Marketing
- Scoring
- The role of Campaign Management Software
- Increasing the customer lifetime value
- Combining Data Mining and Campaign Management
25DM and Customer Relationship Management
- Evaluating the benefits of a Data Mining model
Gains chart
Profability chart
26Data Mining Examples
- Bass Brewers Weve been brewing beer
since 1777, with increased competition comes a
demand to make faster better informed decision - Northern Bank The information is now more
accessible, paperless and timely. - TSB Group Plc We are using Holos because
of its flexibility and its excellent
multidimensional database
27Data Mining Examples
- Delphic Universites Real value is added to
data by multidimensional manipulation (being
able to to easily compare many different views
of the avaible information in one report) and by
modeling. - Harvard - Holden Sybase technology has
allowed us to develop an information system that
will preserve this legacy into the twenty-first
century - J.P.Morgan The promise of data mining
tools like Information Harvester is that they
are able to quickly wade through massive amounts
of data to identify relationships or trending
information that would not have been avaible
without the tool
28Case study of Breast Cancer Survival Analysis
- Case study of the influence of various patient
characteristics on survival rates for breast
cancer - The survival analysis technique employed is Cox
Regression (this technique is useful in
situations, where some of the patients do not
die during the observation period) - Linear regression technique (if all
patients had died during the observation period)
29Case study of Breast Cancer Survival Analysis
- The observation period runs for 133.8 months
- The modeling sample contains 746 patients (50
patients died during the observation period and
696 who survived beyond the end of the
observation period) - In this example, we are testing only four
predictors - Age, in years, at the start of the observation
period (22 to 88) - Pathological tumor size, in centimeters (0.10 to
7.00) - Number of positive axillary lymph nodes (0 to 35)
- Estrogen receptor status (positive vs. negative)
30Case study of Breast Cancer Survival Analysis
- The Cox Regression used a backward stepwise
likelihood-ratio variable selection method - Significance criteria were set at 0.05 for
inclusion in the model, and 0.10 for
removal from the model - Printout from the final step of the stepwise
regression analysis - ________________ Variables in the Equation
______________ - Variable B S.E. Wald
df Sig R Exp(B) - AGE -.0314 .0121 6.7486
1 .0094 -.0893 .9691 - PATHSIZE .3975 .1175 11.4476 1
.0007 .1259 1.4881 - LNPOS .1372 .0361 14.4100 1
.0001 .1443 1.1471 - __________________________________________________
_____ - The column labeled "Sig" shows the statistical
significance of included variables - The column labeled "R" shows the degree of
unique correlation with the dependent variable
31Case study of Breast Cancer Survival Analysis
- Some key things to note are
- Estrogen status was removed as a predictor
because it did not reach the 0.05
significance criterion for inclusion - Number of positive axillary lymph nodes was the
strongest predictor of survival rates (R.1443 /
Sig.0001), then follow pathological tumor
size (R.1259 / Sig..0007), over the course of
the observation period - Age, although significant, is somewhat less
influential than the other two predictors
(R-0.893 / Sig..0094) - Note that both the number of positive axillary
lymph nodes and the pathological tumor size
are positively correlated, which means that they
are directly associated with more rapid
mortality. - Age is negatively correlated with the dependent
variable, which means that younger age is
predictive of somewhat longer survival.
32Case study of Breast Cancer Survival Analysis
- All patients survive through the 10 month of
the observation period - At the fortieth month, the mortality
rate increases and continues at this fairly
constant increased rate
through the forty-fifth month - At the forty-fifty month, there is a
five-month period without additional mortality - 11 of the original sample has died
The following chart shows the cumulative
survival function during the observation period
33Case study of Breast Cancer Survival Analysis
- Conclusions and Implications
- The case study presented here is relatively
simple, and is for illustrative purposes only. - With the addition of more candidate predictors
(progesterone receptor status, histologic
grade, blood type etc.), an even more powerful
model could emerge. - By understanding the influence of patient
characteristics on mortality rates over time, we
are in a better position to estimate survival
times for individual patients, and to defend
using different or more aggressive therapeutic
approaches for some patients.
34Securities Brokerage Case Study
- Predictive market segmentation model designed to
identify and profile high-value brokerage
customer segments as targets for special
marketing communications efforts. - The dependent variable for this ordinal CHAID
model is brokerage account
commission dollars during the past 12 months - We begin by splitting the client's entire
customer file into a modeling sample and a
validation sample. (Once the model is built
using the modeling sample, we apply it to
the validation sample to see how well it works
on a sample other than the one on which it
was built).
35Securities Brokerage Case Study
- The resulting CHAID model has 55 segments.
- However, the results are summarized in the
following comb chart, showing the segment indexes
(indexes of average dollar value)
36Securities Brokerage Case Study
The part of Gains Chart Average Annual Brokerage
Commission Dollars
- Gains chart provides quantitative detail useful
for financial and marketing planning. - We have highlighted the top 20 of the file in
blue - The top 20 of the file is worth an average
of about 334 per account, which is
nearly three times the average account value for
the entire sample.
...
37Securities Brokerage Case Study
- Using the data in the gains chart this
information, we can better plan our
communications/promotion budget. - In general, the best segments represent customers
who are experienced, aggressive, self-directed
traders. - The other decisions, which the gains chart and
the segmentation rules can help us make - We might wish to conduct some market research
among customers in
under-performing segments, or among
under-performing customers in the
better segments - We can use the segment definitions to help us
identify possible issues and question areas to
include in the survey - Before we try to apply such a model, we perform a
validation against a holdout sample, to confirm
that it is a good model.
38 T h e E n d