Title: Handson Workshop on Data Mining
1Hands-on Workshop onData Mining
2PART I
- INTRODUCTION TO DATA MINING
3Introduction
- What is Data?
- Data , Information, Knowledge.
- What is Mining?
4Machine Learning vs. Knowledge Engineering
Machine Learning
Knowledge Engineering (Expert Systems)
Samples
Rules
Learning Systems
System
Decision Making (Rules)
Output (Applying the Rules)
e.g. MYCIN (Medical Diagnosis System)
5Expert Systems (Example)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10Definition
- Data Mining is the process of exploration and
analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover
meaningful patterns, relationships and rules. - What comes next?
- 5, 7, 10, 14, 19, ____
- Khairul likes 252 but not 422 he likes 900 but
not 800 he likes 144 but not 134. Which does he
like - 1800 or 1700 ?
- Which of the figures below the line of drawings
best completes the series? -
?
25
11Data Mining vs. Other Techniques
Statistics, Query, Reporting, OLAP, ...
Hypothesis-Free
Hypothesis
Not suitable for large databases and data
warehouses within the time limits.
- Why are my discount coupons not attracting the
sort of return I was expecting? - How can I increase the share I have of my
customers total spending on electronic goods? - How can I get my other stores to match the
incredibly successful sales figures of the main
branch?
- Volume of TVs sold in one store last month.
- Analyze the price sensitivity of new line of TVs.
- Comparing the sales of various of products in
different stores over time. - Hypotheses the manager knows that there are
stores, products, sensitivity and sales figures,
and he is checking out the interrelationships.
12Traditional Data Analysis
Hypothesis
Query Language
Graphics Statistics OLAP
Output
Database
13Relationship between Data Mining and Statistics
- Statistics is closest to data mining.
- Many of the analysis that are now done with data
mining has been used by statistics, such as
predictive models or discovering associations in
databases.
14Data Mining for Business Intelligence
- Business intelligence all of the processes,
techniques and tools that support business
decision-making based on information technology.
The approaches can range from a simple
spreadsheet to a major competitive intelligence
undertaking. Data mining is an important new
component of business intelligence.
15Data Mining and Business IntelligencePositioning
of different business intelligence according to
their potential value as a basis for tactical and
strategic business decisions
Making Decisions Data Presentation Visualization
Techniques Data Mining Information Discovery Data
Exploration OLAP, MDA, Statistical Analysis and
Querying and Reporting Data Warehouses / Data
Marts Data Sources Paper, Files, Information
Providers, Database Systems
Decision Maker Business Analyst Data
Analyst Database Administrator
Increasing potential to support business decision
The value of the information to support decision
making increase from the bottom of the pyramid to
the top.
16Data Mining Applications
- Fortune (Financial Magazine) in its annual report
of the best 500 companies. 80 of them are using
data mining for decision support.
17Data Mining Applications
- The main three business areas where data mining
is applied are - (1) Market Management
- Target Marketing
- Customer relationship management
- Market basket analysis
- Cross Selling
- (2) Risk Management
- Forecasting
- Customer retention
- Improved underwriting
- Quality control
- Competitive Analysis
- (3) Fraud Management
- Fraud detection
18Market Management Applications
- The organization builds the database of customer
product preferences and lifestyles from such
sources as credit card transactions, loyalty
cards, warranty cards, discount coupons, entries
to free prizes drawings and customer complaint
calls. - Data mining algorithms then surf through the data
looking for clusters of model consumers who all
share the same characteristics (examples income,
interests and spending habits).
19Determining customer purchasing patterns over time
- Examples
- The sequence in which they take up financial
services as their family grows - How they change their cars.
- Converting a single bank account to a joint
account indicates marriage which could lead to
future opportunities to get loans, insurance,
study fees.... - By understanding these patterns the organization
can advertise just-in-time.
20Improving Catalog Telesales
- The goal is to track the products its customers
order most frequently as well as to suggest the
purchase of those products in future order. - Some products associations are obvious Camera
Films, Radio Batteries...
21Loyalty Cards
- To reward your frequently buyers.
- Cardholders get special treatment such as
exclusive discounts on selected items, to
encourage them to do more shopping at the shop
and less likely to visit the competition.
22Risk Management Applications
- Risk associated with insurance or investments.
- Risk associated to business risks arising from
competitive threat. - Risk associated to poor product quality.
- Risk associated to customer attrition (i.e. The
loss of customers, especially to competitors.
Examples in the retails, finance and
telecommunications fields). - The idea here is to build a model of a vulnerable
customer who shows characteristics typical of one
who is likely to leave for a competitive company.
23Risk Management Applications
- Example customer losses may frequently follow a
change of address or a recent protracted exchange
with an agent of the company. - One US bank uses such models to predict the loss
of customers up to one year in advance. - Another bank analyzes more than one million
credit card account histories to ensure that it
is not over expected to high rates of attraction.
24Risk Management Applications
- Telecommunication companies have several billion
dollars in uncollectible debts every year. Data
mining can build models that help predict whether
a particular account is likely to be collectible
and is therefore worth going after.
25Forecasting Financial Future
- If changes in financial behavior can be
predicted, the organization can adjust its
investment strategy and capitalize on the
predicted changes. - Example The ability to forecast the right price
of a future which is a contract that allows
someone to buy something at a certain price on a
certain date in the future.
26Pricing Strategy in a Highly Competitive Market
- A chain of gasoline stations used data mining to
develop profitable pricing strategies in a very
competitive marketplace, by developing a model
that helps to determine - Appropriate pricing for its products on a day-to
day basis, with a view to maximizing sales and
profits. - Sales volumes and profitability.
- The likely competitive reaction to their price
changes. - The likely profitability of a new station.
27Fraud Management Applications
- Detecting Telephone Fraud some of the more
important elements (patterns) in building the
model are the destination of the call, duration,
time of day and week. - Those sectors suffer more than most - especially
those where there are many transactions such as
health care, retail, credit card service and
telecommunication. - The goal is to use historical data to build a
model of fraudulent behavior and then use data
mining to help identify similar instances of this
behavior.
28Detecting inappropriate medical treatments
- An insurance company maintains computerized
records of every doctors consultation in
Australia, including details on the diagnosis,
prescribed drugs and recommended treatment. - Using traditional data analysis techniques they
noticed a rapid increase in the number of
prescribed pathology tests. - Using data mining they were able to identify
which combinations of tests were commonly used,
they were able to detect these invalid
combinations and to no longer accept them for
benefit payment, and they were able to identify
that in many cases that a certain test has been
used at given symptoms.
29Future Application Areas
- Text Mining Words are analyzed in context, for
example the word memory used in a medical
article or a computer article. - Web Analytics to develop insights into users
behavior on the internet. For example today
hypertext are typically fixed, the site
developers have provided the most likely links by
trying to second guess what the user wants to do
next. With data mining, historical user browsing
patterns can be analyzed to dynamically suggest
related sites for users to visit.
30Data Mining is not Magic !
- A class of divorced women
- A data mining system discovered that divorced
women have distinctly different shopping pattern
from those of either single or married women. - After analyzing the data they found that the data
on martial status was much less accurate than the
other data because of cultural norms.
31Data Mining is not Magic !
- Missing the Point
- While preparing to mine a database of hospital
patient admission records. They found this
strange graph about the temperature. Then they
discovered that the nurse was likely to have the
temperature 37oC recorded as either 36.9oC or
37.1oC.
Population
35o 36o
37o 38o Temperature
32Data Mining is not Magic !
- Older and wealthy customers were buying large
sedans!! - People born under the sign of Pisces were most
prone to accidents! - Males with incomes between 50k-65k who
subscribe to certain magazines are likely
purchasers of a certain product! - DM just assists business analysis by finding
patterns and relationships in the data. - These patterns and relationships are not
necessarily causes of an action.
33Data Mining Approaches
- Classification Studies (Supervised Learning)
- I want to understand what makes customers more
likely to stay with or to leave my company? no
hypothesis - Clustering Studies (Unsupervised Learning)
- What are the products that are likely to be
purchased together? no hypothesis
34How do we mine data?
- The process of data mining is described as a
process of model building. - Five main steps to data mining
- 1. Data Preparation
- 2. Defining a study
- Reading the data and building a model
- Understanding the model
- 3. Data Mining
- 4. Analysis of Results.
- 5. Assimilation of Knowledge.
35Effort Required for Each Data Mining Process Step
70 60 50 40 30 20 10
Effort
Defining a study Data Preparation Data
Mining Analysis of Results
and Knowledge Assimilation
36PART II
- INTRODUCTION TO CLEMENTINE
- Drug Treatment
- (Exploratory Graphs / C5.0)
37The Problem
- Imagine that you are a medical researcher
compiling data for a study. - You have collected data about a set of patients,
all of whom suffered from the same illness. - During their course of treatment, each patient
responded to one of five medications. - Part of your job is to use data mining to find
out which drug might be appropriate for a future
patient with the same illness.
38The Data Fields
39Data Reading
- Use the Variable File node.
- Open Drug1n
- Select Read field names from file
- Click the Data tab. In the override column,
select cholesterol - Click the Types tab to learn more about the
type of fields in your data. Choose Read Values
to view the actual values for each field - Use the Table Node to have a glance at the
values
40Exploring the Dataset
- Use the Distribution node to explore the data
- Select Drug as the target field
- Click Execute
- The resulting graph shows that patients responded
to drug Y most often and to drugs B and C least
often. - Use the Data Audit node for a quick glance at
distributions and histograms for all fields at
once.
41What factors might influence Drug?
- As a researcher, you know that the concentrations
of sodium and potassium in the blood are
important factors. - Since these are both numeric values, create a
scatterplot of sodium versus potassium, using the
drug categories as a color overlay. - Use the Plot node, double click to edit.
- Na vs. K
- Overlay color Drug
- The plot clearly shows a threshold above which
the correct drug is always drug Y and below which
the correct drug is never drug Y. This threshold
is a ratio - the ratio of sodium (Na) to
potassium (K).
42What factors might influence Drug?
- Use web graph if many of the data fields are
categorical - Web graph maps associations between different
categories - Select BP and Drug. Then, Execute
- It appears that drug Y is associated with all
three levels of blood pressure.
43What factors might influence Drug?
- To focus on the other drugs, use Hide and
Replan. - After hiding drug Y
- Only drugs A and B are associated with high blood
pressure. - Only drugs C and X are associated with low blood
pressure. - Normal blood pressure is associated only with
drug X. - At this point, though, you still don't know how
to choose between drugs A and B or between drugs
C and X, for a given patient. This is where
modeling can help.
44Deriving a New Field
- Since the ratio of sodium to potassium seems to
predict when to use drug Y, you can derive a
field that contains the value of this ratio for
each record. This field might be useful later
when you build a model to predict when to use
each of the five drugs. - Insert a Derive node, and edit
- Name Na_to_K
- Ratio enter Na/K for the formula, or use the
Expression Builder - Check the distribution of the Derive node using
a Histogram node, specify Na_to_K as the field to
be plotted and Drug as the overlay field. - ? when the Na_to_K value is about 15 or above,
drug Y is the drug of choice
45Building a Model
- By exploring and manipulating the data, you have
been able to form some hypotheses. - The ratio of sodium to potassium in the blood
seems to affect the choice of drug, as does blood
pressure. - But you cannot fully explain all of the
relationships yet. - This is where modeling will likely provide some
answers. - In this case, you will try to fit the data using
a rule-building model, C5.0.
46Building a Model
- Since we have a new derived field, Na_to_K, we
can filter out the original fields, Na and K, so
that they are not used twice in the modeling
algorithm. - Use Filter node
- Click the arrows next to Na and K.
- Red Xs appear over the arrows to indicate that
the fields are now filtered out. - Connect a Type node to the Filter node which
allows you to indicate the types of fields that
you are using and how they are used to predict
the outcomes. - Set the direction for the Drug field to Out
(i.e. to be predicted), others directions In.
47Building a Model
- To estimate the model, attach a C5.0 node to
the Type node. Then execute. - Browse the created model.
- Rule Browser
- Viewer Decision Tree
48The Accuracy
- To assess the accuracy of the model connect the
analysis node to the C5.0 Model Node which is
connected to the Type Node - The Analysis node output shows that with this
artificial data set, the model correctly
predicted the choice of drug for almost every
record in the data set. - With a real data set you are unlikely to see 100
accuracy, but you can use the Analysis node to
help determine whether the model is acceptably
accurate for your particular application.