Title: Sieci neuronowe
1Sieci neuronowe bezmodelowa analiza danych?
- K. M. Graczyk
- IFT, Uniwersytet Wroclawski
- Poland
2Why Neural Networks?
- Inspired by C. Giunti (Torino)
- PDFs by Neural Network
- Papers of Forte et al.. (JHEP 0205062,200, JHEP
0503080,2005, JHEP 0703039,2007,
Nucl.Phys.B8091-63,2009). - A kind of model independent way of fitting data
and computing associated uncertainty - Learn, Implement, Publish (LIP rule)
- Cooperation with R. Sulej (IPJ, Warszawa) and P.
Plonski (Politechnika Warszawska) - NetMaker
- GrANNet ) my own C library
3Road map
- Artificial Neural Networks (NN) idea
- Feed Forward NN
- PDFs by NN
- Bayesian statistics
- Bayesian approach to NN
- GrANNet
4Inspired by Nature
The human brain consists of around 1011 neurons
which are highly interconnected with around 1015
connections
5Applications
- Function approximation, or regression analysis,
including time series prediction, fitness
approximation and modeling. - Classification, including pattern and sequence
recognition, novelty detection and sequential
decision making. - Data processing, including filtering, clustering,
blind source separation and compression. - Robotics, including directing manipulators,
Computer numerical control.
6Artificial Neural Network
the simplest example ? Linear Activation
Functions ? Matrix
7threshold
8activation functions
- Heavside function q(x)
- ? 0 or 1 signal
- sigmoid function
- tanh()
- linear
signal is amplified
Signal is weaker
9architecture
- 3 -layers network, two hidden
- 1211
- 221 121 par9
- Bias neurons, instead of thresholds
- Signal One
F(x)
x
Linear Function
Symmetric Sigmoid Function
10Neural Networks Function Approximation
- The universal approximation theorem for neural
networks states that every continuous function
that maps intervals of real numbers to some
output interval of real numbers can be
approximated arbitrarily closely by a multi-layer
perceptron with just one hidden layer. This
result holds only for restricted classes of
activation functions, e.g. for the sigmoidal
functions. (Wikipedia.org)
11A map from one vector space to another
12Supervised Learning
- Propose the Error Function
- in principle any continuous function which has a
global minimum - Motivated by Statistics Standard Error Function,
chi2, etc, - Consider set of the data
- Train given NN by showing the data ? marginalize
the error function - back propagation algorithms
- An iterative procedure which fixes weights
13Learning Algorithms
- Gradient Algorithms
- Gradient descent
- RPROP (Ridmiller Braun)
- Conjugate gradients
- Look at curvature
- QuickProp (Fahlman)
- Levenberg-Marquardt (hessian)
- Newtonian method (hessian)
- Monte Carlo algorithms (based on Marcov chain
algorithm)
14Overfitting
- More complex models describe data in better way,
but lost generalities - bias-variance trade-off
- Overfitting ? large values of the weights
- Compare with the test set (must be twice larger
than original) - Regularization ? additional penalty term to error
function
Decay rate
15What about physics
Problems Some general constraints Model
Independent Analysis Statistical Model ? data ?
Uncertainty of the predictions
16Fitting data with Artificial Neural Networks
- The goal of the network training is not to learn
on exact representation of the training data
itself, but rather to built statistical model for
the process which generates the data - C. Bishop, Neural Networks for Pattern
Recognition
17Parton Distribution Function with NN
18Parton Distributions Functions S. Forte, L.
Garrido, J. I. Latorre and A. Piccione, JHEP 0205
(2002) 062
- A kind of model independent analysis of the data
- Construction of the probability density PG(Q2)
in the space of the structure functions - In practice only one Neural Network architecture
- Probability density in the space of parameters of
one particular NN
But in reality Forte at al.. did
19The idea comes from W. T. Giele and S. Keller
Training Nrep neural networks, one for each set
of Ndat pseudo-data
The Nrep trained neural networks ? provide a
representation of the probability measure in the
space of the structure functions
20uncertainty
correlation
2110, 100 and 1000 replicas
22short
enough long
too long
30 data points, overfitting
23(No Transcript)
24My criticism
- The simultaneous use of artificial data and chi2
error function overestimates uncertainty? - Do not discuss other NN architectures
- Problems with overfitting (a need of test set)
- Relatively simple approach, comparing with the
present techniques in NN computing. - The uncertainty of the model predictions must be
generated by the probability distribution
obtained for the model then the data itself
25GraNNet Why?
- I stole some ideas from FANN
- C Library, easy in use
- User defined Error Function (any you wish)
- Easy access to units and their weights
- Several ways for initiating network of given
architecture - Bayesin learning
- Main objects
- Classes NeuralNetwork, Unit
- Learning algorithms so far QuickProp, Rprop,
Rprop-, iRprop-, iRprop,, - Network Response Uncertainty (based on Hessian)
- Some restarting and stopping simple solutions
26Structure of GraNNet
- Libraries
- Unit class
- Neural_Network class
- Activation (activation and error function
structures) - Learning algorithms
- RProp, RProp-, iRProp, RProp-, Quickprop,
Backprop - generatormt
- TNT inverse matrix package
27Bayesian Approach
- common sense reduced to calculations
28Bayesian Framework for BackProp NN, MacKay,
Bishop,
- Objective Criteria for comparing alternative
network solutions, in particular with different
architectures - Objective criteria for setting decay rate a
- Objective choice of regularizing function Ew
- Comparing with test data is not required.
29Notation and Conventions
30Model Classification
- A collection of models, H1, H2, , Hk
- We believe that models are classified by P(H1),
P(H2), , P(Hk) (sum to 1) - After observing data D ? Bayes rule ?
- Usually at the beginning P(H1)P(H2) P(Hk)
31Single Model Statistics
- Assume that model Hi is the correct one
- The neural network A with weights w is considered
- Task 1 Assuming some prior probability of w,
after including data, construct Posterior - Task 2 consider the space of hypothesis and
construct evidence for them
32Hierarchy
33Constructing prior and posterior functions
Weight distribution!!!
likelihood
Prior
Posterior probability
w0
34Computing Posterior
hessian
Covariance matrix
35How to fix proper a?
- Two ideas
- Evidence Approximation (MacKay)
- Hierarchical
- Find wMP
- Find aMP
- Perform analytically integrals over a
If sharply peaked!!!
36Getting aMP
The effective number of well-determined parameters
Iterative procedure during training
37Bayesian Model Comparison Occam Factor
Occam Factor
- The log of Occam Factor ? amount of
- Information we gain after data have arrived
- Large Occam factor ?? complex models
- larger accessible phase space (larger range of
posterior) - Small Occam factor ?? simple models
- small accessible phase space (larger range of
posterior)
Best fit likelihood
38Evidence
39(No Transcript)
40131 network preferred by data
41(No Transcript)
42131 seems to be preferred by the data