Perceptrons, Neural Networks and Data Mining - PowerPoint PPT Presentation

1 / 12

About This Presentation

Title:

Perceptrons, Neural Networks and Data Mining

Description:

History of Perceptrons and Multi-layer Perceptrons ... common function for f (and g) is the sigmoid or logistic fucntion, f(x) = 1/(1 e ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 13

Provided by: james171

Category:

more less

Transcript and Presenter's Notes

Title: Perceptrons, Neural Networks and Data Mining

1
Perceptrons, Neural Networks and Data Mining

By Jim Shine
10 October 2003

2
History of Perceptrons and Multi-layer
Perceptrons Developed outside of statistics,
some original motivation to mimic brain
functions. 1943 McCulloch and Pitts, concept of
summation of inputs to produce an output y(i)
sum(w(i)x(i) 1949 Hebb, weights can increase
with learning when both input and output fire
together, weight increases dw(i,j)
Kx(i)y(j) 1950s-1960s 2-layer networks
(original perceptron) and equivalent work by
Widrow (adaptive linear neurons/elements,
ADALINE) y(i) fsum(w(i)x(i) strictly
linear separator (f usually a threshold
function). Minsky Papert, Perceptrons (1969)
exposed flaws. 1974 Werbos develops gradient
descent concept later used in multi-layer
perceptrons
3
Example of an early perceptron, 4 inputs, one
output
y(i) fsum(w(i)x(i) Weight updates
wnewwoldKx(i)
x1
x2
f
x3
x4
4
History of Perceptrons continued 1986 Rumelhart
and McClelland, Parallel Distributed
Processing, multi-layer perceptrons with
gradient descent correction same independent
error correction as described by Werbos.
y(k)f(sum(w(k,j)v(j))) f(sumw(k,j)gsum(w(i,
,j)x(i)) ). Functions f and g now nonlinear,
logistic/sigmoid most popular. This approach does
very well (for its time) at classification. Late-
1980s, early 1990s neural networks become a very
hot topic. Since the early 1990s the topic has
cooled similarities with statistical approaches
noticed excessive brain modeling claims reduced
5
Neural Networks/Perceptrons in more
detail Nested layers of functions
yf(v)f(g(x)) f and g are the function of
weighted sums of the different xi and vi
respectively. Weights are given initial
settings then iteratively adjusted based on error
of the predicted y first the v-y weights, then
the x-v weights (backpropagation) Lots of
software around, easy to write a basic program,
harder to optimize it. Ripley has library for
Splus, other code exists in Matlab
6
Example of a multi-layer perceptron (5-3-2), one
lhidden layer FEEDFORWARD (data goes left to
right)
y(k)f(sumw(k,j)gsum(w(i,,j)x(i)) )
Hidden layer g(sum)v Output layer
f(sum)y First set of weights 11,12,13,21,22,23,3
1,32,33,41,42,43,51,52,53 Second set of weights
11,12,21,22,31,32
x1
x2
f
g
x3
g
x4
f
g
x5
7
Error correction If there is no difference
between the output and the target, weights are
not changed. If there is a difference (an error)
then the weights are changed A common function
for f (and g) is the sigmoid or logistic
fucntion, f(x) 1/(1e(-x)), because the
derivative f(x)f(x)(1-f(x)) which is
computationally very efficient. Other nonlinear
functions such as tanh are sometimes used as
well.
8
Error correction differences between actual and
target output used to correct weights, the second
set first, and then the first set (weights change
right to left, propagating back through the
network)
x1
dw (second set) Kf(sum)error Dw (first
set) Kg(xi)sum(second set changessecond set
weights)
x2
f
g
x3
g
x4
f
g
x5
9
Algorithm for training the weights Initial
weights usually set randomly. Weight adjustment
after each cycle of training data (batch) or in
some cases after each data point (online).
Stopping criteria several possible. Usually
when weight changes drop below a certain
threshold. Can also stop after a certain number
of iterations.
10
Other issues Optimal size of layers and
nodes Regularization Weight decay Alternative
error correction approaches such as conjugate
gradient, cascade correlation, quickprop,etc,
etc
11
Other, less well-known neural network paradigms
Adaptive resonance and Kohonen self-organizing
maps (unsupervised learning/clustering)
12
Parallels to other statistical/data mining
approaches Radial basis functions
y(k)sum(w(k,j)theta(f(x(j)) where theta is a
basis function, and f(x(j)) is a function of the
inputs related to distance Projection Pursuit
regression Generalized linear models, generalized
additive models, MARS References in rest of book
PP Regression, p. 195, 395-96 gradient descent
search, p. 253 section 10.3 on perceptrons
section 11.4 on Artificial neural networks.

Write a Comment

User Comments (0)