preprocessing: an example - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

preprocessing: an example

Description:

the sepal is one of the small, green, leaf like outer parts of a flower ... For example, sepal length. pre-processing: an example. linear Scaling of input data ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 14

Provided by: CHMat

Category:

more less

Transcript and Presenter's Notes

Title: preprocessing: an example

1
pre-processing an example

the iris data set.
a commonly used data base in machine learning and
statistics.
classifying an iris flower into one of three
species based on four attributes.
attributes
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
all attributes are continuous
species (or target class)
setosa
versicolour
virginica
note
the sepal is one of the small, green, leaf like
outer parts of a flower
the petal is one of the brightly coloured outer
parts of the flower
150 examples (50 in each class)

2
pre-processing an example

150 examples (50 in each class)
what does the raw data look like ?
example sepal sepal petal
petal class
length width length width (species)
__________________________________________________
________
1 6.3 3.4
5.6 2.4 virginica
2 5.7 2.6
3.5 1.0 versicolour
... ... ...
... ...
... ... ...
... ...
149 5 3.4
1.5 0.2 setosa
150 5.7 2.8
4.1 1.3 versicolour

3
pre-processing an example

looking at the distribution of the data
For example, sepal length

4
pre-processing an example

linear Scaling of input data
eg
for sepal length
raw data range
4.3 cm (min) to 7.9 cm (max)
scale to 0 (min) and 1(max)
i.e 4.3 becomes 0, 7.9
becomes 1
scaled value (raw_value - minimum_raw_value)/(ma
ximum_raw_value - minimum_raw_value)
or
scaled value raw_value - minimum_raw_value/range
of raw values
for example 1 where sepal length 6.3
scaled value (6.3 - 4.3)/(7.9 - 4.3) 2/3.6
0.56

5
pre-processing an example

Pre-processing the target
three target classes (setosa, versicolour and
virginica)
the numeric target values will depend on the
activation function used by the hidden layer
neurons (assume a logistic function)
using one output unit
eg
0.1 represents setosa
0.5 represents
versicolour
0.9 represents virginica
or using three output units
0.1 0.1 0.9 for setosa
0.1 0.9 0.1 for versicolour
0.9 0.1 0.1 for virginica

6
pre-processing an example

so the pre-processed examples might look like
(using linear scaling and three output units for
the target

)
ltetcgt
4 input units
3 output units
7
Data pre-processing (cont'd)

circular/periodic data
values are repeated periodically
examples
days of the week
months of the year
seasons
consider the representation of the seasons of the
year
we could use a single input
e.g. 0 for summer
0.3 for autumn
0.6 for winter
1 for spring
however the cyclic nature (eg that spring and
summer are close together) is not preserved in
this representation.
or ...
we could use two inputs to represent the season
e.g. 1 1 for summer
1 0 for autumn
0 1 for winter

8
Data pre-processing (cont'd)

circular/periodic data
or ...
we could use four inputs to represent the season
e.g. 1 0 0 0 for summer
0 1 0 0 for autumn
0 0 1 0 for winter
and 0 0 0 1 for spring
You could fuzzify' the inputs i.e. encode the
degree to which the time of the year is seen to
be in each of the seasons.
the time of the year could be represented using
four inputs i.e. each representing the seasonal
degree of that time.
e.g. March could represented as 0.36 0.64 0 0
0.36 in summer, 0.64 in autumn, 0 in winter and
0 in spring
April as 0 1 0 0 0 in summer, 1 in autumn, 0 in
winter and 0 in spring and
May as 0 0.36 0.64 0 0 in summer, 0.36 in
autumn, 0.64 in winter and 0 in spring

9
Data pre-processing (cont'd)

Missing Data
i.e. some examples in the data set have
attributes for which the value is missing
what can you do?
options
1. Don't use these examples. This is not
recommended unless,
there are only a few examples with missing data
with respect to the total number of available
examples ( 10).
2. Substitute a value for the missing value,
some possibilities are
use the maximum
use the minimum
use median (most common across the data set
for that attribute)
use a typical value for that output class
determine the closest' example use its
value for the attribute
3. Accept that the value is missing and use an
extra input to indicate that it is missing
(applicable for discrete inputs)
e.g. gender
where
1 0 0 represents male ( or 1
0)
0 1 0 represents female ( or 0
1)

10
Preparing the training and testing sets
an overview
training data
training a multi-layer perceptron
pre-processing decisions
training set performance
testing data
data set
testing performance (generalisation)
11
Preparing the training and testing sets

Training and Testing Sets
training set used to train the network (i.e. to
adjust the weights)
testing set used to test generalisation
capabilities of the network during training, it
is not used to adjust weights but simply gives an
indication of performance
Allocation of examples to each of these sets?
data is randomly allocated to the training set
and the testing set.
however the distribution of the data should be
preserved in both.
the examples for each of the output (target
classes) should allocated proportionally in each
set.
e.g. if 70 of the total data set are examples
from class A, then the 70 of the training set
should be from class A and 70 of the testing set
should be class A.
There are no strict rules for the number of
examples allocated to each
possibilities
training testing
60 40 or
70 30
remember the number of training examples should
exceed the number of testing examples

12
Preparing the training and testing sets

Cross-validation
This is a more reliable measure of
generalisation than creating single training and
testing sets .
For example 10-fold cross validation

data set
Set 1
Set 2
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
testing examples
training examples
Set 1
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
Set 2
fold 1
Set 1
Set 2
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
fold 2
etc
13
Preparing the training and testing sets

Cross-validation
This is a more reliable measure of
generalisation than creating single training and
testing sets .
For example 10-fold cross validation
comments
ensures that all examples are at some time used
for training and testing.
each example is used only once for testing
minimizes bias in the data sets
results in many more experiments i.e.
time-consuming
cross-validation can give a performance estimate
on unseen examples for a network trained on the
entire data set.