Data preparation: Selection, Preprocessing, and Transformation - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Data preparation: Selection, Preprocessing, and Transformation

Description:

Simple visualization tools are very useful for identifying problems ... Use them to build a promising model for the caravan data! ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 26

Provided by: fda017

Category:

more less

Transcript and Presenter's Notes

Title: Data preparation: Selection, Preprocessing, and Transformation

1
Data preparation Selection, Preprocessing, and
Transformation

Literature
I.H. Witten and E. Frank, Data Mining, chapter 2
and chapter 7

2
Fayyads KDD Methodology
data
3
Contents

Data Selection
Data Preprocessing
Data Transformation

4
Data Selection

Goal
Understanding the data
Explore the data
possible attributes
their values
distribution, outliers

5
Getting to know the data

Simple visualization tools are very useful for
identifying problems
Nominal attributes histograms (Distribution
consistent with background knowledge?)
Numeric attributes graphs (Any obvious
outliers?)
2-D and 3-D visualizations show dependencies
Domain experts need to be consulted
Too much data to inspect? Take a sample!

6
Data preprocessing

Problem different data sources (e.g. sales
department, customer billing department, )
Differences styles of record keeping,
conventions, time periods, data aggregation,
primary keys, errors
Data must be assembled, integrated, cleaned up
Data warehouse consistent point of access
External data may be required (overlay data)
Critical type and level of data aggregation

7
Data Preprocessing

Choose data structure (table, tree or set of
tables)
Choose attributes with enough information
Decide on a first representation of the
attributes (numeric or nominal)
Decide on missing values
Decide on inaccurate data (cleansing)

8
Attribute types used in practice

Most schemes accommodate just two levels of
measurement nominal and ordinal
Nominal attributes are also called categorical,
enumerated, or discrete
But enumerated and discrete imply order
Special case dichotomy (boolean attribute)
Ordinal attributes are called numeric, or
continuous
But continuous implies mathematical continuity

9
The ARFF format

ARFF file for weather data with some numeric
features
_at_relation weather
_at_attribute outlook sunny, overcast, rainy
_at_attribute temperature numeric
_at_attribute humidity numeric
_at_attribute windy true, false
_at_attribute play? yes, no
_at_data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...

10
Attribute types

ARFF supports numeric and nominal attributes
Interpretation depends on learning scheme
Numeric attributes are interpreted as
ordinal scales if less-than and greater-than are
used
ratio scales if distance calculations are
performed
(normalization/standardization may be required)
Instance-based schemes define distance between
nominal values (0 if values are equal, 1
otherwise)
Integers nominal, ordinal, or ratio scale?

11
Nominal vs. ordinal

Attribute age nominalIf age young and
astigmatic no and tear production rate
normal then recommendation softIf age
pre-presbyopic and astigmatic no and
tear production rate normal then
recommendation soft
Attribute age ordinal(e.g. young lt
pre-presbyopic lt presbyopic)If age ?
pre-presbyopic and astigmatic no and
tear production rate normal then
recommendation soft

12
Missing values

Frequently indicated by out-of-range entries
Types unknown, unrecorded, irrelevant
Reasons malfunctioning equipment, changes in
experimental design, collation of different
datasets, measurement not possible
Missing value may have significance in itself
(e.g. missing test in a medical examination)
Most schemes assume that is not the case ?
missing may need to be coded as additional
value

13
Inaccurate values

Reason data has not been collected for mining it
Result errors and omissions that dont affect
original purpose of data (e.g. age of customer)
Typographical errors in nominal attributes
?values need to be checked for consistency
Typographical and measurement errors in numeric
attributes ? outliers need to be identified
Errors may be deliberate (e.g. wrong zip codes)
Other problems duplicates, stale data

14
TransformationAttribute selection

Adding a random (i.e. irrelevant) attribute can
significantly degrade C4.5s performance
Problem attribute selection based on smaller and
smaller amounts of data
IBL is also very susceptible to irrelevant
attributes
Number of training instances required increases
exponentially with number of irrelevant
attributes
Naïve Bayes doesnt have this problem.
Relevant attributes can also be harmful

15
Scheme-independent selection

Filter approach assessment based on general
characteristics of the data
One method find subset of attributes that is
enough to separate all the instances
Another method use different learning scheme
(e.g. C4.5, 1R) to select attributes
IBL-based attribute weighting techniques can also
be used (but cant find redundant attributes)
CFS uses correlation-based evaluation of subsets

16
Attribute subsets for weather data
17
Searching the attribute space

Number of possible attribute subsets is
exponential in the number of attributes
Common greedy approaches forward selection and
backward elimination
More sophisticated strategies
Bidirectional search
Best-first search can find the optimum solution
Beam search approximation to best-first search
Genetic algorithms

18
Scheme-specific selection

Wrapper approach attribute selection implemented
as wrapper around learning scheme
Evaluation criterion cross-validation
performance
Time consuming adds factor k2 even for greedy
approaches with k attributes
Linearity in k requires prior ranking of
attributes
Scheme-specific attribute selection essential for
learning decision tables
Can be done efficiently for DTs and Naïve Bayes

19
Discretizing numeric attributes

Can be used to avoid making normality assumption
in Naïve Bayes and Clustering
Simple discretization scheme is used in 1R
C4.5 performs local discretization
Global discretization can be advantageous because
its based on more data
Learner can be applied to discretized attribute
or
It can be applied to binary attributes coding the
cut points in the discretized attribute

20
Unsupervised discretization

Unsupervised discretization generates intervals
without looking at class labels
Only possible way when clustering
Two main strategies
Equal-interval binning
Equal-frequency binning (also called histogram
equalization)
Inferior to supervised schemes in classification
tasks

21
Entropy-based discretization

Supervised method that builds a decision tree
with pre-pruning on the attribute being
discretized
Entropy used as splitting criterion
MDLP used as stopping criterion
State-of-the-art discretization method
Application of MDLP
Theory is the splitting point (log2N-1 bits)
plus class distribution in each subset
DL before/after adding splitting point is compared

22
Example temperature attribute
23
Formula for MDLP

N instances and
k classes and entropy E in original set
k1 classes and entropy E1 in first subset
k2 classes and entropy E2 in first subset
Doesnt result in any discretization intervals
for the temperature attribute

24
Other discretization methods

Top-down procedure can be replaced by bottomup
method
MDLP can be replaced by chi-squared test
Dynamic programming can be used to find optimum
k-way split for given additive criterion
Requires time quadratic in number of instances if
entropy is used as criterion
Can be done in linear time if error rate is used
as evaluation criterion

25
Transformation