Title: Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)
1Chapter 4An Excel-based Data Mining Tool(iData
Analyzer)
Jason C. H. Chen, Ph.D. Professor of MIS School
of Business Administration Gonzaga
University Spokane, WA 99223 chen_at_jepson.gonzaga.e
du
2Objectives
- This chapter will introduce you the iData
Analyzer(iDA) and how to use two of learner
models contained in your iDA software of data
mining tools. - In Section 4.1 overviews the iDA Model for
Knowledge Discovery. - In Section 4.2, introduces an exemplar-based data
mining tool, ESX, capable of both supervised
learning and unsupervised clustering. - The way of representing datasets and how to use
ESX to perform unsupervised clustering and
building supervised learning models and others
will be also introduced in this chapter.
34.1 The iData Analyzer
- iDA provides support for the business or
technical analyst by offering a visual learning
environment, an integrated tool set, and data
mining process support. - iDA consists of the following components
- Preprocessor
- Heuristic agent (for larger Large Dataset)
- ESX
- Neural Network
- Rule Maker
- Report Generator
See p.107 and Appendix A-2 for the instructions
of installation
4Limitations
- The commercial version of iDA is bounded by the
size of a single MS Excel spreadsheet, i.e., up
to 65,536 rows and 256 columns - The iDA input format uses the first three rows of
a spreadsheet to house information about
individual attributes - Up to 65,533 data instances in attribute-value
format can be mined - The student version allows a maximum of 7,000
data instances (i.e., 7003 rows)
After completing the installation if the security
setting is high, you should change it to medium
and click OK.
5Figure 4.1 The iDA system architecture
6(No Transcript)
74.2 ESX A Multipurpose Tool for Data Mining
- ESX can help create target data, find
irregularities in data, perform data mining, and
offer insight about the practical value of
discovered knowledge. - Features of ESX learner model are
- It supports both supervised learning and
unsupervised clustering - It supports an automated method for dealing with
missing attribute value - It does not make statistical assumptions about
the nature of data to be processed - It can point out inconsistencies and unusual
values in data
8Figure 4.3 An ESX concept hierarch
94.3 iDAV Format for Data Mining
Second Row C categorical R
real-valued Third Row (see Table 4.2 below)
Table 4.2 Values for Attribute Usage
10Table 4.1 Credit Card Promotion Database iDAV
Format
114.4 A Five-step Approach for Unsupervised
Clustering
- Step 1 Enter the Data to be Mined
- Step 2 Perform a Data Mining Session
- Step 3 Read and Interpret Summary Results
- Step 4 Read and Interpret Individual Class
Results - Step 5 Visualize Individual Class Rules
12Step 1 Enter The Data To Be Mined
13Step 2 Perform A Data Mining Session
14Figure 4.5 Unsupervised settings for ESX
(4,p.116)
Value for instance similarity A value closer
to 100 encourages the formation of new clusters A
value closer to 0 favors new instances to enter
existing clusters
The real-valued tolerance setting helps determine
the similarity criteria for real-valued
attributes. A setting of 1.0 is usually
appropriate.
156 A message box indicating that eight clusters
were formed.
This tells us the data has been successfully mine.
166, 7 (p.116)As a general rule, an unsupervised
clustering of more than five or six clusters is
likely to be less than optimal.
178 and 9, Repeat steps 1-4. For step 5, set the
similarity value to 55
18Re-rule feature
Covering set rules RuleMaker will generate a set
of best-defining rules for each class.
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance (start with 80-90) values close to
100 will allow RuleMaker to consider only those
attribute values most highly predictive of class
membership for rule generation.
1910 (p.117) Set minimum rule coverage at 30
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance values close to 100 will allow
RuleMaker to consider only those attribute values
most highly predictive of class membership for
rule generation.
20A Production Rule for theCredit Card Promotion
Database
- IF Sex Female 19 ltAge lt 43
- THEN Life Insurance Promotion Yes
- Rule Accuracy 100.00
- Rule Coverage 66.67
Question Can we assume that two-thirds of all
females in the specified age range will take
advantage of the promotion?
- Rule accuracy is a between-class measure.
- Rule coverage is a within-class measure.
21Output ReportsUnsupervised Clustering
- RES SUM This sheet contains summary statistics
about attribute values and offers several
heuristics to help us determine the quality of a
data mining session. - RES CLS this sheet has information about the
clusters formed as a result of an unsupervised
mining session - RUL TYP Instances are listed by their cluster
number. The typicality of instance i is the
average similarity of i to the other members of
its cluster. - RES RUL The production rules generated for each
cluster are contained in this sheet.
2210 (p.117) Set minimum rule coverage at 30
23Figure 4.7 Rules for the credit card promotion
database
24Step 3 Read and Interpret Summary Results
(p.117)(Sheet1 RES SUM)
- Class Resemblance Scores (RES)
- Domain Resemblance Score
- Domain Predictability
25Step 3 Read and Interpret Summary Results (p.119)
In general, the within-class RES scores should be
higher than the domain RES. It should be true for
most of the classes.
26Figure 4.9 - Step 3 Read and Interpret Summary
Results (cont.)
27Figure 4.9 -Statistics for numerical attributes
and common categorical attribute values Step 3
Read and Interpret Summary Results (cont.)
28Step 4 Read and Interpret Individual Class
Results (p.121)(Sheet1 RES CLS)
- Typicality
- is defined as the average similarity of an
instance to all other members of its cluster or
class - Class Predictability is a within-class measure.
- the percent of class instances having a
particular value for a categorical attribute - Class Predictiveness is a between-class measure
- it is defined as probability an instance resides
in a specified class given the instance has the
value for the chosen attribute
29Figure 4.10 Class 3 Summary Results
30Figure 4.11 Necessary and sufficient attribute
values for Class 3
31Step 5 Visualize Individual Class Rules
IF life ins Promo Yes THEN Class 3
rule accuracy 77.78 rule coverage
100.00
324.5 A Six-Step Approach for Supervised Learning
- Step 1 Choose an Output Attribute
- Launch a fresh life insurance promotion
- Step 2 Perform the Mining Session
- Step 3 Read and Interpret Summary Results
- Step 4 Read and Interpret Test Set Results
- Step 5 Read and Interpret Class Results
- Step 6 Visualize and Interpret Class Rules
33Step 2 Perform the Mining Session
Filename CreditCardPromotion-supervised.xls
O output D Display-Only
34Step 2(4) Select the number of instances for
training and a real-valued tolerance setting
(p.127)
35Step 3 Read and Interpret Summary Results
Domain statistics for categorical attributes
tells us that 80 of the training instances
represent individuals without credit card
insurance.
36Step 3 Read and Interpret Summary Results
(cont.)
37Step 4 - Read and Interpret Test Set Results
38Step 5 - Read and Interpret Results for
Individual Classes (p.130)
39Sheet1 RUL TYP
In Class Yes (Life Ins. Promo) Instances of
Credit Card Ins Yes is 40 (2/5)
40Step 6 Visualize and Interpret Class Rules
(p.130)
Re-rule feature
Covering set rules RuleMaker will generate a set
of best-defining rules for each class.
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance (start with 80-90) values close to
100 will allow RuleMaker to consider only those
attribute values most highly predictive of class
membership for rule generation.
414.6 Techniques for Generating Rules
- Define the scope of the rules.
- Choose the instances.
- Set the minimum rule correctness.
- Define the minimum rule coverage.
- Choose an attribute significance value.
42Typicality Scores
4.7 Instance Typicality
- Identify prototypical and outlier instances.
- Select a best set of training instances.
- Used to compute individual instance
classification confidence scores.
43Figure 4.13 Instance Typicality
444.8 Special Considerations and Features
- Avoid Mining Delays
- The Quick Mine Feature
- Supervised with more than 2000 training set
instances, quick mine feature will be asked - Unsupervised with more than 2000 data instances.
ESX is given a random selection of 500 instances. - Erroneous and Missing Data
45Homework
- Use EXS (and iDA) to perform a supervised data
mining session using the CardiologyCategorical.xls
data file. - Save output file as
CardiologyCategorical-supervised.xls - Lab4 (p.141)
- Turn in
- 1. Spreadsheet file (CardiologyCategorical-supervi
sed.xls) that contains the outcome of data mining
session - 2. Word file that includes (and explains) answers
to all questions (a. thru n.)