Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)

Description:

Up to 65,533 data instances in attribute-value format can be mined ... It supports an automated method for dealing with missing attribute value ... – PowerPoint PPT presentation

Number of Views:329

Avg rating:3.0/5.0

Slides: 46

Provided by: chen127

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)

1
Chapter 4An Excel-based Data Mining Tool(iData
Analyzer)
Jason C. H. Chen, Ph.D. Professor of MIS School
of Business Administration Gonzaga
University Spokane, WA 99223 chen_at_jepson.gonzaga.e
du
2
Objectives

This chapter will introduce you the iData
Analyzer(iDA) and how to use two of learner
models contained in your iDA software of data
mining tools.
In Section 4.1 overviews the iDA Model for
Knowledge Discovery.
In Section 4.2, introduces an exemplar-based data
mining tool, ESX, capable of both supervised
learning and unsupervised clustering.
The way of representing datasets and how to use
ESX to perform unsupervised clustering and
building supervised learning models and others
will be also introduced in this chapter.

3
4.1 The iData Analyzer

iDA provides support for the business or
technical analyst by offering a visual learning
environment, an integrated tool set, and data
mining process support.
iDA consists of the following components
Preprocessor
Heuristic agent (for larger Large Dataset)
ESX
Neural Network
Rule Maker
Report Generator

See p.107 and Appendix A-2 for the instructions
of installation
4
Limitations

The commercial version of iDA is bounded by the
size of a single MS Excel spreadsheet, i.e., up
to 65,536 rows and 256 columns
The iDA input format uses the first three rows of
a spreadsheet to house information about
individual attributes
Up to 65,533 data instances in attribute-value
format can be mined
The student version allows a maximum of 7,000
data instances (i.e., 7003 rows)

After completing the installation if the security
setting is high, you should change it to medium
and click OK.
5
Figure 4.1 The iDA system architecture
6
(No Transcript)
7
4.2 ESX A Multipurpose Tool for Data Mining

ESX can help create target data, find
irregularities in data, perform data mining, and
offer insight about the practical value of
discovered knowledge.
Features of ESX learner model are
It supports both supervised learning and
unsupervised clustering
It supports an automated method for dealing with
missing attribute value
It does not make statistical assumptions about
the nature of data to be processed
It can point out inconsistencies and unusual
values in data

8
Figure 4.3 An ESX concept hierarch
9
4.3 iDAV Format for Data Mining

Second Row C categorical R
real-valued Third Row (see Table 4.2 below)
Table 4.2 Values for Attribute Usage

10
Table 4.1 Credit Card Promotion Database iDAV
Format
11
4.4 A Five-step Approach for Unsupervised
Clustering

Step 1 Enter the Data to be Mined
Step 2 Perform a Data Mining Session
Step 3 Read and Interpret Summary Results
Step 4 Read and Interpret Individual Class
Results
Step 5 Visualize Individual Class Rules

12
Step 1 Enter The Data To Be Mined
13
Step 2 Perform A Data Mining Session
14
Figure 4.5 Unsupervised settings for ESX
(4,p.116)
Value for instance similarity A value closer
to 100 encourages the formation of new clusters A
value closer to 0 favors new instances to enter
existing clusters
The real-valued tolerance setting helps determine
the similarity criteria for real-valued
attributes. A setting of 1.0 is usually
appropriate.
15
6 A message box indicating that eight clusters
were formed.
This tells us the data has been successfully mine.
16
6, 7 (p.116)As a general rule, an unsupervised
clustering of more than five or six clusters is
likely to be less than optimal.
17
8 and 9, Repeat steps 1-4. For step 5, set the
similarity value to 55
18
Re-rule feature
Covering set rules RuleMaker will generate a set
of best-defining rules for each class.
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance (start with 80-90) values close to
100 will allow RuleMaker to consider only those
attribute values most highly predictive of class
membership for rule generation.
19
10 (p.117) Set minimum rule coverage at 30
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance values close to 100 will allow
RuleMaker to consider only those attribute values
most highly predictive of class membership for
rule generation.
20
A Production Rule for theCredit Card Promotion
Database

IF Sex Female 19 ltAge lt 43
THEN Life Insurance Promotion Yes
Rule Accuracy 100.00
Rule Coverage 66.67

Question Can we assume that two-thirds of all
females in the specified age range will take
advantage of the promotion?

Rule accuracy is a between-class measure.
Rule coverage is a within-class measure.

21
Output ReportsUnsupervised Clustering

RES SUM This sheet contains summary statistics
about attribute values and offers several
heuristics to help us determine the quality of a
data mining session.
RES CLS this sheet has information about the
clusters formed as a result of an unsupervised
mining session
RUL TYP Instances are listed by their cluster
number. The typicality of instance i is the
average similarity of i to the other members of
its cluster.
RES RUL The production rules generated for each
cluster are contained in this sheet.

22
10 (p.117) Set minimum rule coverage at 30
23
Figure 4.7 Rules for the credit card promotion
database
24
Step 3 Read and Interpret Summary Results
(p.117)(Sheet1 RES SUM)

Class Resemblance Scores (RES)
Domain Resemblance Score
Domain Predictability

25
Step 3 Read and Interpret Summary Results (p.119)
In general, the within-class RES scores should be
higher than the domain RES. It should be true for
most of the classes.
26
Figure 4.9 - Step 3 Read and Interpret Summary
Results (cont.)
27
Figure 4.9 -Statistics for numerical attributes
and common categorical attribute values Step 3
Read and Interpret Summary Results (cont.)
28
Step 4 Read and Interpret Individual Class
Results (p.121)(Sheet1 RES CLS)

Typicality
is defined as the average similarity of an
instance to all other members of its cluster or
class
Class Predictability is a within-class measure.
the percent of class instances having a
particular value for a categorical attribute
Class Predictiveness is a between-class measure
it is defined as probability an instance resides
in a specified class given the instance has the
value for the chosen attribute

29
Figure 4.10 Class 3 Summary Results
30
Figure 4.11 Necessary and sufficient attribute
values for Class 3
31
Step 5 Visualize Individual Class Rules
IF life ins Promo Yes THEN Class 3
rule accuracy 77.78 rule coverage
100.00
32
4.5 A Six-Step Approach for Supervised Learning

Step 1 Choose an Output Attribute
Launch a fresh life insurance promotion
Step 2 Perform the Mining Session
Step 3 Read and Interpret Summary Results
Step 4 Read and Interpret Test Set Results
Step 5 Read and Interpret Class Results
Step 6 Visualize and Interpret Class Rules

33
Step 2 Perform the Mining Session
Filename CreditCardPromotion-supervised.xls
O output D Display-Only
34
Step 2(4) Select the number of instances for
training and a real-valued tolerance setting
(p.127)
35
Step 3 Read and Interpret Summary Results
Domain statistics for categorical attributes
tells us that 80 of the training instances
represent individuals without credit card
insurance.
36
Step 3 Read and Interpret Summary Results
(cont.)
37
Step 4 - Read and Interpret Test Set Results
38
Step 5 - Read and Interpret Results for
Individual Classes (p.130)
39
Sheet1 RUL TYP
In Class Yes (Life Ins. Promo) Instances of
Credit Card Ins Yes is 40 (2/5)
40
Step 6 Visualize and Interpret Class Rules
(p.130)
Re-rule feature
Covering set rules RuleMaker will generate a set
of best-defining rules for each class.
Minimum correctness rule (50-100) if 80, the
rules generated must have an error rate less than
or equal to 20 Minimum coverage (10-100) if 10,
RuleMaker will generate rules that cover 10 or
more of the instances in each class. Attribute
significance (start with 80-90) values close to
100 will allow RuleMaker to consider only those
attribute values most highly predictive of class
membership for rule generation.
41
4.6 Techniques for Generating Rules

Define the scope of the rules.
Choose the instances.
Set the minimum rule correctness.
Define the minimum rule coverage.
Choose an attribute significance value.

42
Typicality Scores
4.7 Instance Typicality

Identify prototypical and outlier instances.
Select a best set of training instances.
Used to compute individual instance
classification confidence scores.

43
Figure 4.13 Instance Typicality
44
4.8 Special Considerations and Features

Avoid Mining Delays
The Quick Mine Feature
Supervised with more than 2000 training set
instances, quick mine feature will be asked
Unsupervised with more than 2000 data instances.
ESX is given a random selection of 500 instances.
Erroneous and Missing Data

45
Homework

Use EXS (and iDA) to perform a supervised data
mining session using the CardiologyCategorical.xls
data file.
Save output file as
CardiologyCategorical-supervised.xls
Lab4 (p.141)
Turn in
1. Spreadsheet file (CardiologyCategorical-supervi
sed.xls) that contains the outcome of data mining
session
2. Word file that includes (and explains) answers
to all questions (a. thru n.)