How to win big by thinking straight about

About This Presentation

Title:

How to win big by thinking straight about

Description:

... Function Linear Transform Shaping Density Many problems solved by modeling in the transformed space for non-loopy hypergraph is the important quantity ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 23

Provided by: Ton1225

Learn more at: http://www.rctn.org

Category:

more less

Transcript and Presenter's Notes

Title: How to win big by thinking straight about

1
How to win big by thinking straight
about relatively trivial problems
Tony Bell University of California at Berkeley
2
Density Estimation
Make the model
like the reality
by minimising the Kullback-Leibler Divergence
by gradient descent in a parameter of the
model
THIS RESULT IS COMPLETELY GENERAL.
3
The passive case ( 0)
For a general model distribution written in the
Energy-based form
energy
partition function (or zeroth moment...)
the gradient evaluates in the simple
Boltzmann-like form
learn on data while awake
unlearn on data while asleep
4
The single-layer case
Many problems solved by modeling in the
transformed space
Learning Rule (Natural Gradient)
for non-loopy hypergraph
The Score Function
is the important quantity
5
Conditional Density Modeling
To model
use the rules
This little known fact has hardly ever been
exploited. It can be used instead of regression
everywhere.
6
Independent Components, Subspaces and Vectors
ICA
ISA
IVA
DCA
(ie score function hard to get at due to Z)
7
IVA used for audio-separation in real room
8

Score functions derived from sparse factorial
and radial densities
9
Results on real-room source separation
10
Why does IVA work on this problem?
Because the score function, and thus the
learning, is only sensitive to the amplitude of
the complex vectors, representing correlations of
amplitudes of frequency components
associated with a single speaker. Arbitrary
dependencies can exist between the phases of
this vector. Thus all phase (ie higher-order
statistical structure) is confined within the
vector and removed between them.

Its a simple trick, just relaxing the
independence assumptions
in a way that fits speech. But we can do much
more
build conditional models across frequency
components
make models for data that is even more
structured
Video is time x space x colour
Many experiments are time x sensor x
task-condition x trial

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
The big picture.
Behind this effort is an attempt to explore
something called The Levels Hypothesis, which
is the idea that in biology, in the brain, in
nature, there is a kind of density estimation
taking place across scales. To explore this
idea, we have a twofold strategy 1.
EMPIRICAL/DATA ANALYSIS Build algorithms
that can probe the EEG across scales, ie across
frequencies 2. THEORETICAL Formalise
mathematically the learning process in such
systems.
20
A Multi-Level View of Learning
( STDP)
Increasing Timescale
LEARNING at a LEVEL is CHANGE IN INTERACTIONS
between its UNITS, implemented by INTERACTIONS at
the LEVEL beneath, and by extension resulting in
CHANGE IN LEARNING at the LEVEL above.
Interactionsfast Learningslow
Separation of timescales allows INTERACTIONS at
one LEVEL to be LEARNING at the LEVEL above.
21
Infomax between Levels. (eg synapses
density-estimate spikes)
Infomax between Layers. (eg V1 density-estimates
Retina)
1
2
all neural spikes
t
synapses, dendites
y
all synaptic readout

overcomplete
includes all feedback
information flows between levels
arbitrary dependencies
models input and intrinsic activity

square (in ICA formalism)
feedforward
information flows within a level
predicts independent activity
only models outside input

pdf of all spike times
pdf of all synaptic readouts
If we can make this pdf uniform
then we have a model constructed from
all synaptic and dendritic causality
22
Formalisation of the problem
p is the data distribution, q is the model
distribution w is a synaptic weight, and I(y,t)
is the spike synapse mutual information
IF
THEN if we were doing classical Infomax, we would
use the gradient
(1)

BUT if ones actions can change the data, THEN an
extra term appears
(2)
changing ones model to fit the world
change the world to fit the model, as
well as
therefore (2) must be easier than (1). This is
what we are now researching.

Write a Comment

User Comments (0)