Handson Workshop on Data Mining - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Handson Workshop on Data Mining

Description:

... of a vulnerable customer who shows characteristics typical of one who is likely ... People born under the sign of Pisces were most prone to accidents! ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 49
Provided by: ahmed8
Category:

less

Transcript and Presenter's Notes

Title: Handson Workshop on Data Mining


1
Hands-on Workshop onData Mining
  • Ahmed M. Zeki
  • 8 Sep 2007

2
PART I
  • INTRODUCTION TO DATA MINING

3
Introduction
  • What is Data?
  • Data , Information, Knowledge.
  • What is Mining?

4
Machine Learning vs. Knowledge Engineering
Machine Learning
Knowledge Engineering (Expert Systems)
Samples
Rules
Learning Systems
System
Decision Making (Rules)
Output (Applying the Rules)
e.g. MYCIN (Medical Diagnosis System)
5
Expert Systems (Example)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Definition
  • Data Mining is the process of exploration and
    analysis, by automatic or semi-automatic means,
    of large quantities of data in order to discover
    meaningful patterns, relationships and rules.
  • What comes next?
  • 5, 7, 10, 14, 19, ____
  • Khairul likes 252 but not 422 he likes 900 but
    not 800 he likes 144 but not 134. Which does he
    like
  • 1800 or 1700 ?
  • Which of the figures below the line of drawings
    best completes the series?

  • ?

25
11
Data Mining vs. Other Techniques
Statistics, Query, Reporting, OLAP, ...
Hypothesis-Free
Hypothesis
Not suitable for large databases and data
warehouses within the time limits.
  • Why are my discount coupons not attracting the
    sort of return I was expecting?
  • How can I increase the share I have of my
    customers total spending on electronic goods?
  • How can I get my other stores to match the
    incredibly successful sales figures of the main
    branch?
  • Volume of TVs sold in one store last month.
  • Analyze the price sensitivity of new line of TVs.
  • Comparing the sales of various of products in
    different stores over time.
  • Hypotheses the manager knows that there are
    stores, products, sensitivity and sales figures,
    and he is checking out the interrelationships.

12
Traditional Data Analysis
Hypothesis
Query Language
Graphics Statistics OLAP
Output
Database
13
Relationship between Data Mining and Statistics
  • Statistics is closest to data mining.
  • Many of the analysis that are now done with data
    mining has been used by statistics, such as
    predictive models or discovering associations in
    databases.

14
Data Mining for Business Intelligence
  • Business intelligence all of the processes,
    techniques and tools that support business
    decision-making based on information technology.
    The approaches can range from a simple
    spreadsheet to a major competitive intelligence
    undertaking. Data mining is an important new
    component of business intelligence.

15
Data Mining and Business IntelligencePositioning
of different business intelligence according to
their potential value as a basis for tactical and
strategic business decisions
Making Decisions Data Presentation Visualization
Techniques Data Mining Information Discovery Data
Exploration OLAP, MDA, Statistical Analysis and
Querying and Reporting Data Warehouses / Data
Marts Data Sources Paper, Files, Information
Providers, Database Systems
Decision Maker Business Analyst Data
Analyst Database Administrator
Increasing potential to support business decision
The value of the information to support decision
making increase from the bottom of the pyramid to
the top.
16
Data Mining Applications
  • Fortune (Financial Magazine) in its annual report
    of the best 500 companies. 80 of them are using
    data mining for decision support.

17
Data Mining Applications
  • The main three business areas where data mining
    is applied are
  • (1) Market Management
  • Target Marketing
  • Customer relationship management
  • Market basket analysis
  • Cross Selling
  • (2) Risk Management
  • Forecasting
  • Customer retention
  • Improved underwriting
  • Quality control
  • Competitive Analysis
  • (3) Fraud Management
  • Fraud detection

18
Market Management Applications
  • The organization builds the database of customer
    product preferences and lifestyles from such
    sources as credit card transactions, loyalty
    cards, warranty cards, discount coupons, entries
    to free prizes drawings and customer complaint
    calls.
  • Data mining algorithms then surf through the data
    looking for clusters of model consumers who all
    share the same characteristics (examples income,
    interests and spending habits).

19
Determining customer purchasing patterns over time
  • Examples
  • The sequence in which they take up financial
    services as their family grows
  • How they change their cars.
  • Converting a single bank account to a joint
    account indicates marriage which could lead to
    future opportunities to get loans, insurance,
    study fees....
  • By understanding these patterns the organization
    can advertise just-in-time.

20
Improving Catalog Telesales
  • The goal is to track the products its customers
    order most frequently as well as to suggest the
    purchase of those products in future order.
  • Some products associations are obvious Camera
    Films, Radio Batteries...

21
Loyalty Cards
  • To reward your frequently buyers.
  • Cardholders get special treatment such as
    exclusive discounts on selected items, to
    encourage them to do more shopping at the shop
    and less likely to visit the competition.

22
Risk Management Applications
  • Risk associated with insurance or investments.
  • Risk associated to business risks arising from
    competitive threat.
  • Risk associated to poor product quality.
  • Risk associated to customer attrition (i.e. The
    loss of customers, especially to competitors.
    Examples in the retails, finance and
    telecommunications fields).
  • The idea here is to build a model of a vulnerable
    customer who shows characteristics typical of one
    who is likely to leave for a competitive company.

23
Risk Management Applications
  • Example customer losses may frequently follow a
    change of address or a recent protracted exchange
    with an agent of the company.
  • One US bank uses such models to predict the loss
    of customers up to one year in advance.
  • Another bank analyzes more than one million
    credit card account histories to ensure that it
    is not over expected to high rates of attraction.

24
Risk Management Applications
  • Telecommunication companies have several billion
    dollars in uncollectible debts every year. Data
    mining can build models that help predict whether
    a particular account is likely to be collectible
    and is therefore worth going after.

25
Forecasting Financial Future
  • If changes in financial behavior can be
    predicted, the organization can adjust its
    investment strategy and capitalize on the
    predicted changes.
  • Example The ability to forecast the right price
    of a future which is a contract that allows
    someone to buy something at a certain price on a
    certain date in the future.

26
Pricing Strategy in a Highly Competitive Market
  • A chain of gasoline stations used data mining to
    develop profitable pricing strategies in a very
    competitive marketplace, by developing a model
    that helps to determine
  • Appropriate pricing for its products on a day-to
    day basis, with a view to maximizing sales and
    profits.
  • Sales volumes and profitability.
  • The likely competitive reaction to their price
    changes.
  • The likely profitability of a new station.

27
Fraud Management Applications
  • Detecting Telephone Fraud some of the more
    important elements (patterns) in building the
    model are the destination of the call, duration,
    time of day and week.
  • Those sectors suffer more than most - especially
    those where there are many transactions such as
    health care, retail, credit card service and
    telecommunication.
  • The goal is to use historical data to build a
    model of fraudulent behavior and then use data
    mining to help identify similar instances of this
    behavior.

28
Detecting inappropriate medical treatments
  • An insurance company maintains computerized
    records of every doctors consultation in
    Australia, including details on the diagnosis,
    prescribed drugs and recommended treatment.
  • Using traditional data analysis techniques they
    noticed a rapid increase in the number of
    prescribed pathology tests.
  • Using data mining they were able to identify
    which combinations of tests were commonly used,
    they were able to detect these invalid
    combinations and to no longer accept them for
    benefit payment, and they were able to identify
    that in many cases that a certain test has been
    used at given symptoms.

29
Future Application Areas
  • Text Mining Words are analyzed in context, for
    example the word memory used in a medical
    article or a computer article.
  • Web Analytics to develop insights into users
    behavior on the internet. For example today
    hypertext are typically fixed, the site
    developers have provided the most likely links by
    trying to second guess what the user wants to do
    next. With data mining, historical user browsing
    patterns can be analyzed to dynamically suggest
    related sites for users to visit.

30
Data Mining is not Magic !
  • A class of divorced women
  • A data mining system discovered that divorced
    women have distinctly different shopping pattern
    from those of either single or married women.
  • After analyzing the data they found that the data
    on martial status was much less accurate than the
    other data because of cultural norms.

31
Data Mining is not Magic !
  • Missing the Point
  • While preparing to mine a database of hospital
    patient admission records. They found this
    strange graph about the temperature. Then they
    discovered that the nurse was likely to have the
    temperature 37oC recorded as either 36.9oC or
    37.1oC.

Population
35o 36o
37o 38o Temperature
32
Data Mining is not Magic !
  • Older and wealthy customers were buying large
    sedans!!
  • People born under the sign of Pisces were most
    prone to accidents!
  • Males with incomes between 50k-65k who
    subscribe to certain magazines are likely
    purchasers of a certain product!
  • DM just assists business analysis by finding
    patterns and relationships in the data.
  • These patterns and relationships are not
    necessarily causes of an action.

33
Data Mining Approaches
  • Classification Studies (Supervised Learning)
  • I want to understand what makes customers more
    likely to stay with or to leave my company? no
    hypothesis
  • Clustering Studies (Unsupervised Learning)
  • What are the products that are likely to be
    purchased together? no hypothesis

34
How do we mine data?
  • The process of data mining is described as a
    process of model building.
  • Five main steps to data mining
  • 1. Data Preparation
  • 2. Defining a study
  • Reading the data and building a model
  • Understanding the model
  • 3. Data Mining
  • 4. Analysis of Results.
  • 5. Assimilation of Knowledge.

35
Effort Required for Each Data Mining Process Step
70 60 50 40 30 20 10
Effort
Defining a study Data Preparation Data
Mining Analysis of Results
and Knowledge Assimilation
36
PART II
  • INTRODUCTION TO CLEMENTINE
  • Drug Treatment
  • (Exploratory Graphs / C5.0)

37
The Problem
  • Imagine that you are a medical researcher
    compiling data for a study.
  • You have collected data about a set of patients,
    all of whom suffered from the same illness.
  • During their course of treatment, each patient
    responded to one of five medications.
  • Part of your job is to use data mining to find
    out which drug might be appropriate for a future
    patient with the same illness.

38
The Data Fields
39
Data Reading
  • Use the Variable File node.
  • Open Drug1n
  • Select Read field names from file
  • Click the Data tab. In the override column,
    select cholesterol
  • Click the Types tab to learn more about the
    type of fields in your data. Choose Read Values
    to view the actual values for each field
  • Use the Table Node to have a glance at the
    values

40
Exploring the Dataset
  • Use the Distribution node to explore the data
  • Select Drug as the target field
  • Click Execute
  • The resulting graph shows that patients responded
    to drug Y most often and to drugs B and C least
    often.
  • Use the Data Audit node for a quick glance at
    distributions and histograms for all fields at
    once.

41
What factors might influence Drug?
  • As a researcher, you know that the concentrations
    of sodium and potassium in the blood are
    important factors.
  • Since these are both numeric values, create a
    scatterplot of sodium versus potassium, using the
    drug categories as a color overlay.
  • Use the Plot node, double click to edit.
  • Na vs. K
  • Overlay color Drug
  • The plot clearly shows a threshold above which
    the correct drug is always drug Y and below which
    the correct drug is never drug Y. This threshold
    is a ratio - the ratio of sodium (Na) to
    potassium (K).

42
What factors might influence Drug?
  • Use web graph if many of the data fields are
    categorical
  • Web graph maps associations between different
    categories
  • Select BP and Drug. Then, Execute
  • It appears that drug Y is associated with all
    three levels of blood pressure.

43
What factors might influence Drug?
  • To focus on the other drugs, use Hide and
    Replan.
  • After hiding drug Y
  • Only drugs A and B are associated with high blood
    pressure.
  • Only drugs C and X are associated with low blood
    pressure.
  • Normal blood pressure is associated only with
    drug X.
  • At this point, though, you still don't know how
    to choose between drugs A and B or between drugs
    C and X, for a given patient. This is where
    modeling can help.

44
Deriving a New Field
  • Since the ratio of sodium to potassium seems to
    predict when to use drug Y, you can derive a
    field that contains the value of this ratio for
    each record. This field might be useful later
    when you build a model to predict when to use
    each of the five drugs.
  • Insert a Derive node, and edit
  • Name Na_to_K
  • Ratio enter Na/K for the formula, or use the
    Expression Builder
  • Check the distribution of the Derive node using
    a Histogram node, specify Na_to_K as the field to
    be plotted and Drug as the overlay field.
  • ? when the Na_to_K value is about 15 or above,
    drug Y is the drug of choice

45
Building a Model
  • By exploring and manipulating the data, you have
    been able to form some hypotheses.
  • The ratio of sodium to potassium in the blood
    seems to affect the choice of drug, as does blood
    pressure.
  • But you cannot fully explain all of the
    relationships yet.
  • This is where modeling will likely provide some
    answers.
  • In this case, you will try to fit the data using
    a rule-building model, C5.0.

46
Building a Model
  • Since we have a new derived field, Na_to_K, we
    can filter out the original fields, Na and K, so
    that they are not used twice in the modeling
    algorithm.
  • Use Filter node
  • Click the arrows next to Na and K.
  • Red Xs appear over the arrows to indicate that
    the fields are now filtered out.
  • Connect a Type node to the Filter node which
    allows you to indicate the types of fields that
    you are using and how they are used to predict
    the outcomes.
  • Set the direction for the Drug field to Out
    (i.e. to be predicted), others directions In.

47
Building a Model
  • To estimate the model, attach a C5.0 node to
    the Type node. Then execute.
  • Browse the created model.
  • Rule Browser
  • Viewer Decision Tree

48
The Accuracy
  • To assess the accuracy of the model connect the
    analysis node to the C5.0 Model Node which is
    connected to the Type Node
  • The Analysis node output shows that with this
    artificial data set, the model correctly
    predicted the choice of drug for almost every
    record in the data set.
  • With a real data set you are unlikely to see 100
    accuracy, but you can use the Analysis node to
    help determine whether the model is acceptably
    accurate for your particular application.
Write a Comment
User Comments (0)
About PowerShow.com