Title: Minimal Neural Networks
1Minimal Neural Networks
- Support vector machines and Bayesian learning for
neural networks
Peter Andras andrasp_at_ieee.org
2Bayesian neural networks I.
The Bayes rule
Lets consider a model of a system and an
observation of the system, an event. The a
posteriori probability of correctness of the
model, after the observation of the event, is
proportional to the product of the a priori
correctness of the model and the probability of
the event conditioned by the correctness of the
model.
Mathematically
where ? is the parameter of the model H? and D is
the observed event
3Bayesian neural networks II.
Best model model with highest a posteriori
probability of correctness
Model selection by optimizing the formula
4Bayesian neural networks III.
Application to neural networks
g? is the function represented by the neural
network,
where ? is the vector of all parameters of the
network
is the observed event
we suppose normal distribution for the data
conditioned by the validity of a model, i.e., the
observed values yi are normally distributed
around g?(xi), if ? is the correct parameter
vector
5Bayesian neural networks IV.
By making the calculations we get
and the new formula for optimization is
6Bayesian neural networks V.
The equivalence of the regularization and
Bayesian model selection
Regularization formula
Bayesian optimization formula
Equivalence
Both represents a priori information about the
correct solution
7Bayesian neural networks VI.
Bayesian pruning by regularization
Gauss pruning
Laplace pruning
Cauchy pruning
N is the number of components of the ? vectors
8Support vector machines - SVM I.
Linear separable classes
- many separators - there is an optimal separator
9Support vector machines - SVM II.
How to find the optimal separator ?
- support vectors - overspecification
Property one less support vector
new optimal separator
10Support vector machines - SVM III.
We look for minimal and robust separators. These
are minimal and robust models of the data. The
full data set is equivalent with the set of the
support vectors with respect to the specification
of the minimal robust model.
11Support vector machines - SVM IV.
Mathematical problem formulation I.
we represent the separator as a pair (w,b), where
w is vector and b is a scalar
we look w and b such that they satisfy
The support vectors are those xi-s for which this
inequality is in fact equality.
12Support vector machines - SVM V.
Mathematical problem formulation II.
The distances form the origo of the hyper-planes
of the support vectors are
The distance between the two planes is
13Support vector machines - SVM VI.
Mathematical problem formulation III.
Optimal separator the distance between the two
hyper-planes is maximal
Optimization
with the restrictions that or in other form
14Support vector machines - SVM VII.
Mathematical problem formulation IV.
Complete optimization formula, using Lagrange
multipliers
15Support vector machines - SVM VIII.
Mathematical problem formulation V.
Writing the optimality conditions for w and b we
get
The dual problem is
The support vectors are those xi-s for which ?i
is strictly positive
16Support vector machines - SVM IX.
Graphical interpretation
We search for the tangent point of a
hyper-ellipsoid with the positive space quadrant
17Support vector machines - SVM X.
How to solve the support vector problem ?
Optimization with respect to the ?-s
- gradient method - Newton and quasi-Newton
methods
We get as result
- the support vectors - the optimal linear
separator
18Support vector machines - SVM XI.
Implications for artificial neural networks
- robust perceptron (low sensitivity to noise) -
minimal linear classificatory neural network
19Support vector machines - SVM XII.
What can we do if the boundary is nonlinear ?
Idea
transform the data vectors to a space where the
separator is linear
20Support vector machines - SVM XIII.
The transformation many times is made to an
infinite dimensional space, usually a function
space. Example x ? cos(uTx)
21Support vector machines - SVM XIV.
The new optimization formulas are
22Support vector machines - SVM XIV.
How to handle the products of the transformed
vectors ?
Idea use a transformation that fits the Mercer
theorem
Let
then K has a decomposition
Mercer theorem
where
and H is a function space
if and only if
for each
23Support vector machines - SVM XV.
Optimization formula with transformation that
fits the Mercer theorem
The form of the solution
the b is determined from an equation valid for a
support vector
24Support vector machines - SVM XVI.
Examples of transformations and kernels
a. b. c.
25Support vector machines - SVM XVII.
Other typical kernels
26Support vector machines - SVM XVIII.
Summary of main ideas
- look for minimal complexity classification
- transform the data to another space where the
class boundaries are linear - use Mercer kernels
27Support vector machines - SVM XIX.
Practical issues
- the global optimization doesnt work with large
amount of data ? sequential optimization with
chunks of the data - the resulted models are minimal complexity
models, they are insensitive to noise and keep
the generalization ability of the more complex
models - applications character recognition, economic
forecasting
28Regularization neural networks
General optimization vs. optimization over the
grid
The regularization operator specifies the grid -
we look for functions that satisfy Tg20 - in
the relaxed case the regularization operator is
incorporated as a constraint in the error
function