Title: Creating Data Representations
1Creating Data Representations
- Another way of representing n-ary data in a
neural network is using one neuron per feature,
but scaling the (analog) value to indicate the
degree to which a feature is present. - Good examples
- the brightness of a pixel in an input image
- the distance between a robot and an obstacle
- Poor examples
- the letter (1 26) of a word
- the type (1 6) of a chess piece
2Creating Data Representations
- This can be explained as follows
- The way NNs work (both biological and artificial
ones) is that each neuron represents the
presence/absence of a particular feature. - Activations 0 and 1 indicate absence or presence
of that feature, respectively, and in analog
networks, intermediate values indicate the extent
to which a feature is present. - Consequently, a small change in one input value
leads to only a small change in the networks
activation pattern.
3Creating Data Representations
- Therefore, it is appropriate to represent a
non-binary feature by a single analog input value
only if this value is scaled, i.e., it represents
the degree to which a feature is present. - This is the case for the brightness of a pixel or
the output of a distance sensor (feature
obstacle proximity). - It is not the case for letters or chess pieces.
- For example, assigning values to individual
letters (a 0, b 0.04, c 0.08, , z 1)
implies that a and b are in some way more similar
to each other than are a and z. - Obviously, in most contexts, this is not a
reasonable assumption.
4Creating Data Representations
- It is also important to notice that, in
artificial (not natural!), completely connected
networks the order of features that you specify
for your input vectors does not influence the
outcome. - For the network performance, it is not necessary
to represent, for example, similar features in
neighboring input units. - All units are treated equally neighborhood of
two neurons does not imply to the network that
these represent similar features. - Of course once you specified a particular order,
you cannot change it any more during training or
testing.
5Creating Data Representations
- If you wanted to represent the state of each
square on the tic-tac-toe board by one analog
value, which would be the better way to do this? - ltemptygt 0
- X 0.5
- O 1
X 0 ltemptygt 0.5 O 1
Not a good scale!Goes from neutral
tofriendly and thenhostile.
More natural scale!Goes from friendly
toneutral and thenhostile.
6Representing Time
- So far we have only considered static data, that
is, data that do not change over time. - How can we format temporal data to feed them into
an ANN in order to detect spatiotemporal patterns
or even predict future states of a system? - The basic idea is to treat time as another input
dimension. - Instead of just feeding the current data (time
t0) into our network, we expand the input vectors
to contain n data vectors measured at t0, t0 -
?t, t0 - 2?t, t0 - 3?t, , t0 (n 1)?t.
7Representing Time
- For example, if we want to predict stock prices
based on their past values (although other
factors also play a role)
t
0
8Representing Time
- In this case, our input vector would include
seven components, each of them indicating the
stock values at a particular point in time. - These stock values have to be normalized, i.e.,
divided by 1,000, if that is the estimated
maximum value that could occur. - Then there would be a hidden layer, whose size
depends on the complexity of the task. - And there could be exactly one output neuron,
indicating the stock price after the following
time interval (to be multiplied by 1,000).
9Representing Time
- For example, a backpropagation network could do
this task. - It would be trained with many stock price samples
that were recorded in the past so that the price
for time t0 ?t is already known. - This price at time t0 ?t would be the desired
output value of the network and be used to apply
the BPN learning rule. - Afterwards, if past stock prices indeed allow the
prediction of future ones, the network will be
able to give some reasonable stock price
predictions.
10Representing Time
- Another example
- Let us assume that we want to build a very simple
surveillance system. - We receive bitmap images in constant time
intervals and want to determine for each quadrant
of the image if there is any motion visible in
it, and what the direction of this motion is. - Let us assume that each image consists of 10 by
10 grayscale pixels with values from 0 to 255. - Let us further assume that we only want to
determine one of the four directions N, E, S, and
W.
11Representing Time
- As said before, it makes sense to represent the
brightness of each pixel by an individual analog
value. - We normalize these values by dividing them by
255. - Consequently, if we were only interested in
individual images, we would feed the network with
input vectors of size 100. - Let us assume that two successive images are
sufficient to detect motion. - Then at each point in time, we would like to feed
the network with the current image and the
previous image that we received from the camera.
12Representing Time
- We can simply concatenate the vectors
representing these two images, resulting in a
200-dimensional input vector. - Therefore, our network would have 200 input
neurons, and a certain number of hidden units. - With regard to the output, would it be a good
idea to represent the direction (N, E, S, or W)
by a single analog value? - No, these values do not represent a scale, so
this would make the network computations
unnecessarily complicated.
13Representing Time
- Better solution
- 16 output neurons with the following
interpretation
This way, the network can, in a straightforward
way, indicate the direction of motion in each
quadrant (Q1, Q2, Q3, and Q4). Each output value
could specify the amount (or speed?) of the
corresponding type of motion.
14Exemplar Analysis
- When building a neural network application, we
must make sure that we choose an appropriate set
of exemplars (training data) - The entire problem space must be covered.
- There must be no inconsistencies
(contradictions) in the data. - We must be able to correct such problems
without compromising the effectiveness of the
network.
15Ensuring Coverage
- For many applications, we do not just want our
network to classify any kind of possible input. - Instead, we want our network to recognize whether
an input belongs to any of the given classes or
it is garbage that cannot be classified. - To achieve this, we train our network with both
classifiable and garbage data (null
patterns). - For the the null patterns, the network is
supposed to produce a zero output, or a
designated null neuron is activated.
16Ensuring Coverage
- In many cases, we use a 11 ratio for this
training, that is, we use as many null patterns
as there are actual data samples. - We have to make sure that all of these exemplars
taken together cover the entire input space. - If it is certain that the network will never be
presented with garbage data, then we do not
need to use null patterns for training.
17Ensuring Consistency
- Sometimes there may be conflicting exemplars in
our training set. - A conflict occurs when two or more identical
input patterns are associated with different
outputs. - Why is this problematic?
18Ensuring Consistency
- Assume a BPN with a training set including the
exemplars (a, b) and (a, c). - Whenever the exemplar (a, b) is chosen, the
network adjust its weights to present an output
for a that is closer to b. - Whenever (a, c) is chosen, the network changes
its weights for an output closer to c, thereby
unlearning the adaptation for (a, b). - In the end, the network will associate input a
with an output that is between b and c, but is
neither exactly b or c, so the network error
caused by these exemplars will not decrease. - For many applications, this is undesirable.
19Ensuring Consistency
- To identify such conflicts, we can apply a
(binary) search algorithm to our set of
exemplars. - How can we resolve an identified conflict?
- Of course, the easiest way is to eliminate the
conflicting exemplars from the training set. - However, this reduces the amount of training data
that is given to the network. - Eliminating exemplars is the best way to go if it
is found that these exemplars represent invalid
data, for example, inaccurate measurements. - In general, however, other methods of conflict
resolution are preferable.
20Ensuring Consistency
- Another method combines the conflicting patterns.
- For example, if we have exemplars
- (0011, 0101),(0011, 0010),
- we can replace them with the following single
exemplar - (0011, 0111).
- The way we compute the output vector of the new
exemplar based on the two original output vectors
depends on the current task. - It should be the value that is most similar (in
terms of the external interpretation) to the
original two values.
21Ensuring Consistency
- Alternatively, we can alter the representation
scheme. - Let us assume that the conflicting measurements
were taken at different times or places. - In that case, we can just expand all the input
vectors, and the additional values specify the
time or place of measurement. - For example, the exemplars
- (0011, 0101),(0011, 0010)
- could be replaced by the following ones
- (100011, 0101),(010011, 0010).
22Ensuring Consistency
- One advantage of altering the representation
scheme is that this method cannot create any new
conflicts. - Expanding the input vectors cannot make two or
more of them identical if they were not identical
before.
23Training and Performance Evaluation
- How many samples should be used for training?
- Heuristic At least 5-10 times as many samples as
there are weights in the network. - Formula (Baum Haussler, 1989)
- P is the number of samples, W is the number of
weights to be trained, and a is the desired
accuracy (e.g., proportion of correctly
classified samples).
24Training and Performance Evaluation
- What learning rate ? should we choose?
- The problems that arise when ? is too small or to
big are similar to the Adaline. - Unfortunately, the optimal value of ? entirely
depends on the application. - Values between 0.1 and 0.9 are typical for most
applications. - Often, ? is initially set to a large value and is
decreased during the learning process. - Leads to better convergence of learning, also
decreases likelihood of getting stuck in local
error minimum at early learning stage.