Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)

Description:

Up to 65,533 data instances in attribute-value format can be mined ... It supports an automated method for dealing with missing attribute value ... – PowerPoint PPT presentation

Number of Views:329
Avg rating:3.0/5.0
Slides: 46
Provided by: chen127
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)


1
Chapter 4An Excel-based Data Mining Tool(iData
Analyzer)
Jason C. H. Chen, Ph.D. Professor of MIS School
of Business Administration Gonzaga
University Spokane, WA 99223 chen_at_jepson.gonzaga.e
du
2
Objectives
  • This chapter will introduce you the iData
    Analyzer(iDA) and how to use two of learner
    models contained in your iDA software of data
    mining tools.
  • In Section 4.1 overviews the iDA Model for
    Knowledge Discovery.
  • In Section 4.2, introduces an exemplar-based data
    mining tool, ESX, capable of both supervised
    learning and unsupervised clustering.
  • The way of representing datasets and how to use
    ESX to perform unsupervised clustering and
    building supervised learning models and others
    will be also introduced in this chapter.

3
4.1 The iData Analyzer
  • iDA provides support for the business or
    technical analyst by offering a visual learning
    environment, an integrated tool set, and data
    mining process support.
  • iDA consists of the following components
  • Preprocessor
  • Heuristic agent (for larger Large Dataset)
  • ESX
  • Neural Network
  • Rule Maker
  • Report Generator

See p.107 and Appendix A-2 for the instructions
of installation
4
Limitations
  • The commercial version of iDA is bounded by the
    size of a single MS Excel spreadsheet, i.e., up
    to 65,536 rows and 256 columns
  • The iDA input format uses the first three rows of
    a spreadsheet to house information about
    individual attributes
  • Up to 65,533 data instances in attribute-value
    format can be mined
  • The student version allows a maximum of 7,000
    data instances (i.e., 7003 rows)

After completing the installation if the security
setting is high, you should change it to medium
and click OK.
5
Figure 4.1 The iDA system architecture
6
(No Transcript)
7
4.2 ESX A Multipurpose Tool for Data Mining
  • ESX can help create target data, find
    irregularities in data, perform data mining, and
    offer insight about the practical value of
    discovered knowledge.
  • Features of ESX learner model are
  • It supports both supervised learning and
    unsupervised clustering
  • It supports an automated method for dealing with
    missing attribute value
  • It does not make statistical assumptions about
    the nature of data to be processed
  • It can point out inconsistencies and unusual
    values in data

8
Figure 4.3 An ESX concept hierarch
9
4.3 iDAV Format for Data Mining
   
 
Second Row C categorical R
real-valued Third Row (see Table 4.2 below)
Table 4.2 Values for Attribute Usage
 
 
10
Table 4.1 Credit Card Promotion Database iDAV
Format
11
4.4 A Five-step Approach for Unsupervised
Clustering
  • Step 1 Enter the Data to be Mined
  • Step 2 Perform a Data Mining Session
  • Step 3 Read and Interpret Summary Results
  • Step 4 Read and Interpret Individual Class
    Results
  • Step 5 Visualize Individual Class Rules

12
Step 1 Enter The Data To Be Mined
13
Step 2 Perform A Data Mining Session
14
Figure 4.5 Unsupervised settings for ESX
(4,p.116)
Value for instance similarity A value closer
to 100 encourages the formation of new clusters A
value closer to 0 favors new instances to enter
existing clusters
The real-valued tolerance setting helps determine
the similarity criteria for real-valued
attributes. A setting of 1.0 is usually
appropriate.
15
6 A message box indicating that eight clusters
were formed.
This tells us the data has been successfully mine.
16
6, 7 (p.116)As a general rule, an unsupervised
clustering of more than five or six clusters is
likely to be less than optimal.
17
8 and 9, Repeat steps 1-4. For step 5, set the
similarity value to 55
18
Re-rule feature
Covering set rules RuleMaker will generate a set
of best-defining rules for each class.
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance (start with 80-90) values close to
100 will allow RuleMaker to consider only those
attribute values most highly predictive of class
membership for rule generation.
19
10 (p.117) Set minimum rule coverage at 30
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance values close to 100 will allow
RuleMaker to consider only those attribute values
most highly predictive of class membership for
rule generation.
20
A Production Rule for theCredit Card Promotion
Database
  • IF Sex Female 19 ltAge lt 43
  • THEN Life Insurance Promotion Yes
  • Rule Accuracy 100.00
  • Rule Coverage 66.67

Question Can we assume that two-thirds of all
females in the specified age range will take
advantage of the promotion?
  • Rule accuracy is a between-class measure.
  • Rule coverage is a within-class measure.

21
Output ReportsUnsupervised Clustering
  • RES SUM This sheet contains summary statistics
    about attribute values and offers several
    heuristics to help us determine the quality of a
    data mining session.
  • RES CLS this sheet has information about the
    clusters formed as a result of an unsupervised
    mining session
  • RUL TYP Instances are listed by their cluster
    number. The typicality of instance i is the
    average similarity of i to the other members of
    its cluster.
  • RES RUL The production rules generated for each
    cluster are contained in this sheet.

22
10 (p.117) Set minimum rule coverage at 30
23
Figure 4.7 Rules for the credit card promotion
database
24
Step 3 Read and Interpret Summary Results
(p.117)(Sheet1 RES SUM)
  • Class Resemblance Scores (RES)
  • Domain Resemblance Score
  • Domain Predictability

25
Step 3 Read and Interpret Summary Results (p.119)
In general, the within-class RES scores should be
higher than the domain RES. It should be true for
most of the classes.
26
Figure 4.9 - Step 3 Read and Interpret Summary
Results (cont.)
27
Figure 4.9 -Statistics for numerical attributes
and common categorical attribute values Step 3
Read and Interpret Summary Results (cont.)
28
Step 4 Read and Interpret Individual Class
Results (p.121)(Sheet1 RES CLS)
  • Typicality
  • is defined as the average similarity of an
    instance to all other members of its cluster or
    class
  • Class Predictability is a within-class measure.
  • the percent of class instances having a
    particular value for a categorical attribute
  • Class Predictiveness is a between-class measure
  • it is defined as probability an instance resides
    in a specified class given the instance has the
    value for the chosen attribute

29
Figure 4.10 Class 3 Summary Results
30
Figure 4.11 Necessary and sufficient attribute
values for Class 3
31
Step 5 Visualize Individual Class Rules
IF life ins Promo Yes THEN Class 3
rule accuracy 77.78 rule coverage
100.00
32
4.5 A Six-Step Approach for Supervised Learning
  • Step 1 Choose an Output Attribute
  • Launch a fresh life insurance promotion
  • Step 2 Perform the Mining Session
  • Step 3 Read and Interpret Summary Results
  • Step 4 Read and Interpret Test Set Results
  • Step 5 Read and Interpret Class Results
  • Step 6 Visualize and Interpret Class Rules

33
Step 2 Perform the Mining Session
Filename CreditCardPromotion-supervised.xls
O output D Display-Only
34
Step 2(4) Select the number of instances for
training and a real-valued tolerance setting
(p.127)
35
Step 3 Read and Interpret Summary Results
Domain statistics for categorical attributes
tells us that 80 of the training instances
represent individuals without credit card
insurance.
36
Step 3 Read and Interpret Summary Results
(cont.)
37
Step 4 - Read and Interpret Test Set Results
38
Step 5 - Read and Interpret Results for
Individual Classes (p.130)
39
Sheet1 RUL TYP
In Class Yes (Life Ins. Promo) Instances of
Credit Card Ins Yes is 40 (2/5)
40
Step 6 Visualize and Interpret Class Rules
(p.130)
Re-rule feature
Covering set rules RuleMaker will generate a set
of best-defining rules for each class.
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance (start with 80-90) values close to
100 will allow RuleMaker to consider only those
attribute values most highly predictive of class
membership for rule generation.
41
4.6 Techniques for Generating Rules
  1. Define the scope of the rules.
  2. Choose the instances.
  3. Set the minimum rule correctness.
  4. Define the minimum rule coverage.
  5. Choose an attribute significance value.

42
Typicality Scores
4.7 Instance Typicality
  • Identify prototypical and outlier instances.
  • Select a best set of training instances.
  • Used to compute individual instance
    classification confidence scores.

43
Figure 4.13 Instance Typicality
44
4.8 Special Considerations and Features
  • Avoid Mining Delays
  • The Quick Mine Feature
  • Supervised with more than 2000 training set
    instances, quick mine feature will be asked
  • Unsupervised with more than 2000 data instances.
    ESX is given a random selection of 500 instances.
  • Erroneous and Missing Data

45
Homework
  • Use EXS (and iDA) to perform a supervised data
    mining session using the CardiologyCategorical.xls
    data file.
  • Save output file as
    CardiologyCategorical-supervised.xls
  • Lab4 (p.141)
  • Turn in
  • 1. Spreadsheet file (CardiologyCategorical-supervi
    sed.xls) that contains the outcome of data mining
    session
  • 2. Word file that includes (and explains) answers
    to all questions (a. thru n.)
Write a Comment
User Comments (0)
About PowerShow.com