Title: How to win big by thinking straight about
1How to win big by thinking straight
about relatively trivial problems
Tony Bell University of California at Berkeley
2Density Estimation
Make the model
like the reality
by minimising the Kullback-Leibler Divergence
by gradient descent in a parameter of the
model
THIS RESULT IS COMPLETELY GENERAL.
3The passive case ( 0)
For a general model distribution written in the
Energy-based form
energy
partition function (or zeroth moment...)
the gradient evaluates in the simple
Boltzmann-like form
learn on data while awake
unlearn on data while asleep
4The single-layer case
Many problems solved by modeling in the
transformed space
Learning Rule (Natural Gradient)
for non-loopy hypergraph
The Score Function
is the important quantity
5Conditional Density Modeling
To model
use the rules
This little known fact has hardly ever been
exploited. It can be used instead of regression
everywhere.
6Independent Components, Subspaces and Vectors
ICA
ISA
IVA
DCA
(ie score function hard to get at due to Z)
7IVA used for audio-separation in real room
8 Score functions derived from sparse factorial
and radial densities
9Results on real-room source separation
10Why does IVA work on this problem?
Because the score function, and thus the
learning, is only sensitive to the amplitude of
the complex vectors, representing correlations of
amplitudes of frequency components
associated with a single speaker. Arbitrary
dependencies can exist between the phases of
this vector. Thus all phase (ie higher-order
statistical structure) is confined within the
vector and removed between them.
- Its a simple trick, just relaxing the
independence assumptions - in a way that fits speech. But we can do much
more - build conditional models across frequency
components - make models for data that is even more
structured - Video is time x space x colour
- Many experiments are time x sensor x
task-condition x trial
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19The big picture.
Behind this effort is an attempt to explore
something called The Levels Hypothesis, which
is the idea that in biology, in the brain, in
nature, there is a kind of density estimation
taking place across scales. To explore this
idea, we have a twofold strategy 1.
EMPIRICAL/DATA ANALYSIS Build algorithms
that can probe the EEG across scales, ie across
frequencies 2. THEORETICAL Formalise
mathematically the learning process in such
systems.
20A Multi-Level View of Learning
( STDP)
Increasing Timescale
LEARNING at a LEVEL is CHANGE IN INTERACTIONS
between its UNITS, implemented by INTERACTIONS at
the LEVEL beneath, and by extension resulting in
CHANGE IN LEARNING at the LEVEL above.
Interactionsfast Learningslow
Separation of timescales allows INTERACTIONS at
one LEVEL to be LEARNING at the LEVEL above.
21Infomax between Levels. (eg synapses
density-estimate spikes)
Infomax between Layers. (eg V1 density-estimates
Retina)
1
2
all neural spikes
t
synapses, dendites
y
all synaptic readout
- overcomplete
- includes all feedback
- information flows between levels
- arbitrary dependencies
- models input and intrinsic activity
- square (in ICA formalism)
- feedforward
- information flows within a level
- predicts independent activity
- only models outside input
pdf of all spike times
pdf of all synaptic readouts
If we can make this pdf uniform
then we have a model constructed from
all synaptic and dendritic causality
22Formalisation of the problem
p is the data distribution, q is the model
distribution w is a synaptic weight, and I(y,t)
is the spike synapse mutual information
IF
THEN if we were doing classical Infomax, we would
use the gradient
(1)
BUT if ones actions can change the data, THEN an
extra term appears
(2)
changing ones model to fit the world
change the world to fit the model, as
well as
therefore (2) must be easier than (1). This is
what we are now researching.