Title: Issues in Data Mining Applications Tutorial
1Issues in Data Mining Applications-Tutorial-
How to Make A Decision About Your Own Data Mining
Tool?
Authors
- Nemanja Jovanovic, nemko_at_sezampro.yu
- Valentina Milenkovic, tina_at_eunet.yu
- Prof. Dr. Veljko Milutinovic, vm_at_etf.bg.ac.yu
2Data Mining vs. Knowledge Mining ?
3Instead of a foreword
- .If you are not able
- to swim in the ocean of the data,
- you will get drowned.
4Tutorial Content
This Tutorial will guide you through the
following sections
- What really means Data Mining?
- Successful Data Mining
- Comparasion of fourteen DM tools
- How to improve existing Data Mining applications?
- Potential applications
- Myths and facts about Data Mining
- Two case studies
- The future of DM applications
5Other definitions of Data Mining
- Data mining is the (semi)automatic discovery of
patterns, associations, anomalies, and changes in
data. - Data mining, on the other hand, extracts
information from a database that the user did
not know existed. - Also, data mining is the search for relationships
and global
patterns that exist in large databases
but are
hidden' among the vast amount of data.
6The Foundations of Data Mining
- Massive data collection
- Powerful multiprocessor computers
- Data Mining algorithms
Volume of data
1970
1980
1990
2000
7Evolution Of Data Mining
8Data Mining context
- Application domain
- Data mining problem type
- Technical aspect
- Data mining tools and technique
9Data Mining Techniques
- Artificial Neural Networks
- Decision Trees
- Genetic Algorithms
- Rule Induction
- K-Nearest Neighbor (k-NN)
- Data Visualization
. . .
Input patern
. . ..
Output patern
. . .
. . .
10What type of a user do DM tools require?
- cooperation between business and datamining
experts - less skill and experience business experts in
modeling and using the tools ? more help they
need from data mining experts - example of financial analysts
11Examples of DM projects to stimulate your
imagination
- Here are six examples of how data mining is
helping corporations to operate more
efficiently and profitably in today's business
environment. - Targeting a set of consumers who are most
likely to respond to a direct mail campaign - Predicting the probability of default for
consumer loan applications - Reducing fabrication flaws in VLSI chips
- Predicting audience share for television programs
- Predicting the probability that a cancer patient
will respond to radiation therapy - Predicting the probability that an offshore oil
well is actually going to produce oil
12Successful Data Mining
- Come up with a precise formulation of the problem
you are trying to solve and use the right data - Have a clearly articulated business problem and
then determine whether data mining is the proper
solution technology - Understand and deliver the fundamentals
- Have your technology folks be involved, too
- Visualization of the data mining output is very
important in a meaningful way - Allow the user to interact with the visualization
13Comparison of forteen DM tools
- Evaluated by four undergraduates inexperienced at
data mining, a relatively
experienced graduate student and
a profesional data mining
consultant - Run under the MS Windows 95, MS Windows NT,
Macintosh System 7.5 - Use one of the four technologies
Decision Trees, Rule
Inductions, Neural or Polynomial Networks - Solve two binary classification problems
multi-class
classification and noiseless estimation problem - Price from 75 to 25.000
14Comparison of forteen DM tools
- The Decision Tree products were - CART
- Scenario - See5 -
S-Plus - The Rule Induction tools were - WizWhy
- DataMind - DMSK - Neural Networks were built from three
programs - NeuroShell2 - PcOLPARS
- PRW - The Polynomial Network tools were -
ModelQuest Expert - Gnosis - a
module of NeuroShell2 - KnowledgeMiner
15Criteria for evaluating DM tools
- A list of 20 criteria for evaluating DM tools,
put into 4 categories - Capability measures what a desktop tool can do,
and how well it does
it - Handless missing data -
Considers misclassification costs - Allows
data transformations - Quality of tesing
options - Has programming language -
Provides useful output reports -
Visualisation
16Visualisation
? excellent capability ? good capability ?-
some capability blank no capability
17Criteria for evaluating DM tools
- Learnability/Usability shows how easy a tool is
to learn and use - Tutorials -
Wizards - Easy to learn - Users
manual - Online help - Interface
18Criteria for evaluating DM tools
- Interoperability shows a tools ability to
interface with other
computer applications - Importing
data - Exporting data - Links to
other applications - Flexibility - Model adjustment
flexibility - Customizable work
enviroment - Ability to write or change code
19Data Input Output Model
? excellent capability ? good capability ?-
some capability blank no capability
20A classification of data sets
- Pima Indians Diabetes data set
- 768 cases of Native American women from the Pima
tribe some of whom are diabetic,
most of whom are not - 8 attributes plus the binary class variable for
diabetes per instance - Wisconsin Breast Cancer data set
- 699 instances of breast tumors some of which
are malignant, most of which are benign - 10 attributes plus the binary malignancy
variable per case - The Forensic Glass Identification data set
- 214 instances of glass collected during crime
investigations
- 10 attributes plus the multi-class output
variable per instance - Moon Cannon data set
- 300 solutions to the equation x 2v 2
sin(g)cos(g)/g
- the data were generated without adding noise
21Evaluation of forteen DM tools
22Strenghts and Weaknesses
- Strengths
- Ease of use (Scenario, WizWhy..)
- Data visualisation (S-plus,MineSet...)
- Depth of algorithms (tree options)
(CART,See5,S-plus..) - Multiplte neural network architectures
(NeuroShell)
- Weaknesses
- Difficult file I/O (OLPARS,CART)
- Limited visualisation (PRW,See5,WizWhy)
- Narrow analyses path (Scenario)
23How to improve existing DM applications
- The top ten points
- Database integration
- no more flat files
- use the millions spent on data warehousing
- Automated model scoring
- without scoring DM is pretty useless
- should be integrated with the driving
applications - Exporting models to other applications
- close the loop between DM and applications
that need to
use the results (scores)
24How to improve existing DM applications
- Business templates
- cross-selling specific application is more
valuable than a general
modeling tool
- Effort knob
- it is relevant in a way that tuning parametars
are not - Incorporate financial information
- the financial information is very important and
often available and shold be provided as
input to the DM application
25How to improve existing DM applications
- Computed target columns
- allow the user to interactively create a new
target variable - Time-series data
- a years worth of monthly balance information is
qualitatively different than twelve distinct
non-time-series variables - Use versus View
- do not present visually to user the full model,
only the most
important levels - Wizards
- not necessarily but desirable
- prevent human error by keeping the user on track
26Potential Applications
- Data mining has many varied fields of
application, - some of which are listed below.
- Retail/Marketing
- Identify buying patterns from customers
- Find associations among customer demographic
characteristics - Predict response to mailing campaigns
- Market basket analysis
27Potential Applications
- Banking
- Detect patterns of fraudulent credit card use
- Identify loyal' customers
- Determine credit card spending by customer groups
- Find hidden correlations between different
financial indicators - Identify stock trading rules from historical
market data
28Potential Applications
- Insurance and Health Care
- Claims analysis - i.e., which medical procedures
are claimed together - Predict which customers will buy new policies
- Identify behaviour patterns of risky customers
- Identify fraudulent behaviour
29Potential Applications
- Transportation
- Determine the distribution schedules among
outlets - Analyse loading patterns
- Medicine
- Characterise patient behaviour to predict office
visits - Identify successful medical therapies for
different illnesses - To predict the effectiveness of surgical
procedures or
medical tests
30Potential Applications
- Sport
- To make the best choice about players in
different circumstance - To predict the results of relevance match
- Do a better list of seed players in groups or
tournament - DM report from an NBA game
- When Price was Point-Guard, J.Williams missed 0
(0) of his jump field-goal attempts and made
100 (4) of his jump field-goal-attempts. - The total number of such field-goal-attempts
was 4.
31DM and Customer Relationship Management
- CRM is a process that manages the interactions
between a company and its customers - Users of CRM software applications are database
marketers - Goals of database marketers are
- identifying market segments, which requires
significant data about prospective customers
and their buying behaviors - build and execute campaigns
- Tightly integrating the two disciplines presents
an opportunity for companies to gain
competetive adventage
32DM and Customer Relationship Management
- How Data Mining helps Database Marketing
- Scoring
- The role of Campaign Management Software
- Increasing the customer lifetime value
- Combining Data Mining and Campaign Management
33DM and Customer Relationship Management
- Evaluating the benefits of a Data Mining model
Gains chart
Profability chart
34Myths and Facts about Data Mining
- Myth DM produces surprising results
that will utterly
transform your business. - Myth DM techniques are so sophisticated
that they can
substitute for domain knowledge or for
experience in analysis and model building. - Myth DM tools automatically find the patterns
you are looking for,
without being told what to do.
35Myths and Facts about Data Mining
- Myth Data mining is more effective with more
data, so all existing data
should be brought into any data-mining effort. - Myth Building a DM model on a sample of a
database is ineffective, because sampling loses
the information in the unused data. - Myth Data mining is another fad that will soon
fade, allowing us to return to
standard business practice.
36Myths and Facts about Data Mining
- Myth DM is useful only in certain areas,
such as
marketing, sales, and fraud detection. - Myth The methods used in DM are fundamentally
different from the older quantitative
model-building techniques. - Myth Data mining is an extremely complex
process. - Myth Only massive databases are worth mining.
37Data Mining Examples
- Bass Brewers Weve been brewing beer
since 1777, with increased competition comes a
demand to make faster better informed decision - Northern Bank The information is now more
accessible, paperless and timely. - TSB Group Plc We are using Holos because
of its flexibility and its excellent
multidimensional database
38Data Mining Examples
- Delphic Universites Real value is added to
data by multidimensional manipulation (being
able to to easily compare many different views
of the avaible information in one report) and by
modeling. - Harvard - Holden Sybase technology has
allowed us to develop an information system that
will preserve this legacy into the twenty-first
century - J.P.Morgan The promise of data mining
tools like Information Harvester is that they
are able to quickly wade through massive amounts
of data to identify relationships or trending
information that would not have been avaible
without the tool
39Case study of Breast Cancer Survival Analysis
- Case study of the influence of various patient
characteristics on survival rates for breast
cancer - The survival analysis technique employed is Cox
Regression (this technique is useful in
situations, where some of the patients do not
die during the observation period) - Linear regression technique (if all
patients had died during the observation period)
40Case study of Breast Cancer Survival Analysis
- The observation period runs for 133.8 months
- The modeling sample contains 746 patients (50
patients died during the observation period and
696 who survived beyond the end of the
observation period) - In this example, we are testing only four
predictors - Age, in years, at the start of the observation
period (22 to 88) - Pathological tumor size, in centimeters (0.10 to
7.00) - Number of positive axillary lymph nodes (0 to 35)
- Estrogen receptor status (positive vs. negative)
41Case study of Breast Cancer Survival Analysis
- The Cox Regression used a backward stepwise
likelihood-ratio variable selection method - Significance criteria were set at 0.05 for
inclusion in the model, and 0.10 for
removal from the model - Printout from the final step of the stepwise
regression analysis - ________________ Variables in the Equation
______________ - Variable B S.E. Wald
df Sig R Exp(B) - AGE -.0314 .0121 6.7486
1 .0094 -.0893 .9691 - PATHSIZE .3975 .1175 11.4476 1
.0007 .1259 1.4881 - LNPOS .1372 .0361 14.4100 1
.0001 .1443 1.1471 - __________________________________________________
_____ - The column labeled "Sig" shows the statistical
significance of included variables - The column labeled "R" shows the degree of
unique correlation with the dependent variable
42Case study of Breast Cancer Survival Analysis
- Some key things to note are
- Estrogen status was removed as a predictor
because it did not reach the 0.05
significance criterion for inclusion - Number of positive axillary lymph nodes was the
strongest predictor of survival rates (R.1443 /
Sig.0001), then follow pathological tumor
size (R.1259 / Sig..0007), over the course of
the observation period - Age, although significant, is somewhat less
influential than the other two predictors
(R-0.893 / Sig..0094) - Note that both the number of positive axillary
lymph nodes and the pathological tumor size
are positively correlated, which means that they
are directly associated with more rapid
mortality. - Age is negatively correlated with the dependent
variable, which means that younger age is
predictive of somewhat longer survival.
43Case study of Breast Cancer Survival Analysis
- All patients survive through the 10 month of
the observation period - At the fortieth month, the mortality
rate increases and continues at this fairly
constant increased rate
through the forty-fifth month - At the forty-fifty month, there is a
five-month period without additional mortality - 11 of the original sample has died
The following chart shows the cumulative
survival function during the observation period
44Case study of Breast Cancer Survival Analysis
- Conclusions and Implications
- The case study presented here is relatively
simple, and is for illustrative purposes only. - With the addition of more candidate predictors
(progesterone receptor status, histologic
grade, blood type etc.), an even more powerful
model could emerge. - By understanding the influence of patient
characteristics on mortality rates over time, we
are in a better position to estimate survival
times for individual patients, and to defend
using different or more aggressive therapeutic
approaches for some patients.
45Securities Brokerage Case Study
- Predictive market segmentation model designed to
identify and profile high-value brokerage
customer segments as targets for special
marketing communications efforts. - The dependent variable for this ordinal CHAID
model is brokerage account
commission dollars during the past 12 months - We begin by splitting the client's entire
customer file into a modeling sample and a
validation sample. (Once the model is built
using the modeling sample, we apply it to
the validation sample to see how well it works
on a sample other than the one on which it
was built).
46Securities Brokerage Case Study
- The resulting CHAID model has 55 segments.
- However, the results are summarized in the
following comb chart, showing the segment indexes
(indexes of average dollar value)
47Securities Brokerage Case Study
The part of Gains Chart Average Annual Brokerage
Commission Dollars
- Gains chart provides quantitative detail useful
for financial and marketing planning. - We have highlighted the top 20 of the file in
blue - The top 20 of the file is worth an average
of about 334 per account, which is
nearly three times the average account value for
the entire sample.
...
48Securities Brokerage Case Study
- Using the data in the gains chart this
information, we can better plan our
communications/promotion budget. - In general, the best segments represent customers
who are experienced, aggressive, self-directed
traders. - The other decisions, which the gains chart and
the segmentation rules can help us make - We might wish to conduct some market research
among customers in
under-performing segments, or among
under-performing customers in the
better segments - We can use the segment definitions to help us
identify possible issues and question areas to
include in the survey - Before we try to apply such a model, we perform a
validation against a holdout sample, to confirm
that it is a good model.
49The future of DM applications
- Different opinions
- Very little functionality in DB systems to
support DM applications - Data mining, as a vital application, is
just one more advance in the on-going research
process - Data mining will not go away
T h e E n d