preprocessing: an example - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

preprocessing: an example

Description:

the sepal is one of the small, green, leaf like outer parts of a flower ... For example, sepal length. pre-processing: an example. linear Scaling of input data ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 14
Provided by: CHMat
Category:

less

Transcript and Presenter's Notes

Title: preprocessing: an example


1
pre-processing an example
  • the iris data set.
  • a commonly used data base in machine learning and
    statistics.
  • classifying an iris flower into one of three
    species based on four attributes.
  • attributes
  • sepal length (cm)
  • sepal width (cm)
  • petal length (cm)
  • petal width (cm)
  • all attributes are continuous
  • species (or target class)
  • setosa
  • versicolour
  • virginica
  • note
  • the sepal is one of the small, green, leaf like
    outer parts of a flower
  • the petal is one of the brightly coloured outer
    parts of the flower
  • 150 examples (50 in each class)

2
pre-processing an example
  • 150 examples (50 in each class)
  • what does the raw data look like ?
  • example sepal sepal petal
    petal class
  • length width length width (species)
  • __________________________________________________
    ________
  • 1 6.3 3.4
    5.6 2.4 virginica
  • 2 5.7 2.6
    3.5 1.0 versicolour
  • ... ... ...
    ... ...
  • ... ... ...
    ... ...
  • 149 5 3.4
    1.5 0.2 setosa
  • 150 5.7 2.8
    4.1 1.3 versicolour

3
pre-processing an example
  • looking at the distribution of the data
  • For example, sepal length

4
pre-processing an example
  • linear Scaling of input data
  • eg
  • for sepal length
  • raw data range
  • 4.3 cm (min) to 7.9 cm (max)
  • scale to 0 (min) and 1(max)
  • i.e 4.3 becomes 0, 7.9
    becomes 1
  • scaled value (raw_value - minimum_raw_value)/(ma
    ximum_raw_value - minimum_raw_value)
  • or
  • scaled value raw_value - minimum_raw_value/range
    of raw values
  • for example 1 where sepal length 6.3
  • scaled value (6.3 - 4.3)/(7.9 - 4.3) 2/3.6
    0.56

5
pre-processing an example
  • Pre-processing the target
  • three target classes (setosa, versicolour and
    virginica)
  • the numeric target values will depend on the
    activation function used by the hidden layer
    neurons (assume a logistic function)
  • using one output unit
  • eg
  • 0.1 represents setosa
  • 0.5 represents
    versicolour
  • 0.9 represents virginica
  • or using three output units
  • 0.1 0.1 0.9 for setosa
  • 0.1 0.9 0.1 for versicolour
  • 0.9 0.1 0.1 for virginica

6
pre-processing an example
  • so the pre-processed examples might look like
    (using linear scaling and three output units for
    the target

)
ltetcgt
4 input units
3 output units
7
Data pre-processing (cont'd)
  • circular/periodic data
  • values are repeated periodically
  • examples
  • days of the week
  • months of the year
  • seasons
  • consider the representation of the seasons of the
    year
  • we could use a single input
  • e.g. 0 for summer
  • 0.3 for autumn
  • 0.6 for winter
  • 1 for spring
  • however the cyclic nature (eg that spring and
    summer are close together) is not preserved in
    this representation.
  • or ...
  • we could use two inputs to represent the season
  • e.g. 1 1 for summer
  • 1 0 for autumn
  • 0 1 for winter

8
Data pre-processing (cont'd)
  • circular/periodic data
  • or ...
  • we could use four inputs to represent the season
  • e.g. 1 0 0 0 for summer
  • 0 1 0 0 for autumn
  • 0 0 1 0 for winter
  • and 0 0 0 1 for spring
  • You could fuzzify' the inputs i.e. encode the
    degree to which the time of the year is seen to
    be in each of the seasons.
  • the time of the year could be represented using
    four inputs i.e. each representing the seasonal
    degree of that time.
  • e.g. March could represented as 0.36 0.64 0 0
  • 0.36 in summer, 0.64 in autumn, 0 in winter and
    0 in spring
  • April as 0 1 0 0 0 in summer, 1 in autumn, 0 in
    winter and 0 in spring and
  • May as 0 0.36 0.64 0 0 in summer, 0.36 in
    autumn, 0.64 in winter and 0 in spring

9
Data pre-processing (cont'd)
  • Missing Data
  • i.e. some examples in the data set have
    attributes for which the value is missing
  • what can you do?
  • options
  • 1. Don't use these examples. This is not
    recommended unless,
  • there are only a few examples with missing data
    with respect to the total number of available
    examples ( 10).
  • 2. Substitute a value for the missing value,
    some possibilities are
  • use the maximum
  • use the minimum
  • use median (most common across the data set
    for that attribute)
  • use a typical value for that output class
  • determine the closest' example use its
    value for the attribute
  • 3. Accept that the value is missing and use an
    extra input to indicate that it is missing
    (applicable for discrete inputs)
  • e.g. gender
  • where
  • 1 0 0 represents male ( or 1
    0)
  • 0 1 0 represents female ( or 0
    1)

10
Preparing the training and testing sets
an overview
training data
training a multi-layer perceptron
pre-processing decisions
training set performance
testing data
data set
testing performance (generalisation)
11
Preparing the training and testing sets
  • Training and Testing Sets
  • training set used to train the network (i.e. to
    adjust the weights)
  • testing set used to test generalisation
    capabilities of the network during training, it
    is not used to adjust weights but simply gives an
    indication of performance
  • Allocation of examples to each of these sets?
  • data is randomly allocated to the training set
    and the testing set.
  • however the distribution of the data should be
    preserved in both.
  • the examples for each of the output (target
    classes) should allocated proportionally in each
    set.
  • e.g. if 70 of the total data set are examples
    from class A, then the 70 of the training set
    should be from class A and 70 of the testing set
    should be class A.
  • There are no strict rules for the number of
    examples allocated to each
  • possibilities
  • training testing
  • 60 40 or
  • 70 30
  • remember the number of training examples should
    exceed the number of testing examples

12
Preparing the training and testing sets
  • Cross-validation
  • This is a more reliable measure of
    generalisation than creating single training and
    testing sets .
  • For example 10-fold cross validation

data set
Set 1
Set 2
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
testing examples
training examples
Set 1
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
Set 2
fold 1
Set 1
Set 2
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
fold 2
etc
13
Preparing the training and testing sets
  • Cross-validation
  • This is a more reliable measure of
    generalisation than creating single training and
    testing sets .
  • For example 10-fold cross validation
  • comments
  • ensures that all examples are at some time used
    for training and testing.
  • each example is used only once for testing
  • minimizes bias in the data sets
  • results in many more experiments i.e.
    time-consuming
  • cross-validation can give a performance estimate
    on unseen examples for a network trained on the
    entire data set.
Write a Comment
User Comments (0)
About PowerShow.com