Information Theoretic Learning Finding structure in data ... - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Information Theoretic Learning Finding structure in data ...

Description:

Information Theoretic Learning Finding structure in data ... Jose Principe and Sudhir Rao University of Florida principe_at_cnel.ufl.edu www.cnel.ufl.edu – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 62
Provided by: Sudhi7
Category:

less

Transcript and Presenter's Notes

Title: Information Theoretic Learning Finding structure in data ...


1
Information Theoretic LearningFinding structure
in data ...
  • Jose Principe and Sudhir Rao
  • University of Florida
  • principe_at_cnel.ufl.edu
  • www.cnel.ufl.edu

2
Outline
  • Structure
  • Connection to Learning
  • Learning Structure the old view
  • A new framework
  • Applications

3
Structure
Patterns / Regularities
Amorphous/chaos
Interdependence between subsystems
White Noise
4
Connection to Learning
5
Type of Learning
  • Supervised Learning
  • Data
  • Desired Signal/Teacher
  • Reinforcement Learning
  • Data
  • Rewards/Punishments
  • Unsupervised Learning
  • Only the Data

6
Unsupervised Learning
  • What can be done only with the data??

Examples
First Principles
Auto associative memory, ART PCA, Linskers
informax rule
Preserve maximum information
Barlows minimum redundancy principle, ICA etc
Extract independent features
Gaussian Mixture Models, EM algorithm, Parametric
Density Estimation.
Learn the probability distribution
7
Connection to Self Organization
If cell 1 is one of the cells providing input to
cell 2, and if cell 1s activity tends to be
high whenever cell 2s activity is high, then
the future contributions that the firing of cell
1 makes to the firing of cell 2 should increase..
-Donald Hebb, 1949, Neuropsychologist.
What is the purpose????
A
-
Does the Hebb-type algorithm cause a developing
perceptual network to optimize some property that
is deeply connected with the mature networks
functioning as a information processing system.
C

B
Increase wb proportional to activity of B and C

- Linsker, 1988
8
Linskers Infomax principle
Linear Network
X1
w1
noise
X2
Under Gaussian assumptions and uncorrelated noise
the rate for a linear network is ,
Y
XL-1
wL
XL
?Maximize Rate
Maximize Shannon Rate I(X,Y)
?
Hebbian Rule!!
9
Barlows redundancy principle
Independence features ?no redundancy
ICA!!!
Converting an M dimensional problem ? N one
dimensional problems
N conditional probabilities required for an event
V P(VFeature i)
2M conditional probabilities required for an
event V P(Vstimuli)
10
Summary 1
Global Objective Function
example, Infomax
Extracting desired signal from the data itself
Self Organizing Rule
example, Hebbian rule
Revealing the structure through interaction of
the data points
Unsupervised Learning
Discovering structure in Data
example, PCA
11
Questions
  • Can we go beyond these preprocessing stages??
  • Can we create global cost function which extract
    goal oriented structures from the data?
  • Can we derive self organizing principle from such
    a cost function??

A big YES!!!
12
What is Information Theoretic Learning?
  • ITL is a methodology to adapt linear or nonlinear
    systems using criteria based on the information
    descriptors of entropy and divergence.
  • Center piece is a non-parametric estimator for
    entropy that
  • Does not require an explicit estimation of pdf
  • Uses the Parzen window method which is known to
    be consistent and efficient
  • Estimator is smooth
  • Readily integrated in conventional gradient
    descent learning
  • Provides a link to Kernel learning and SVMs.
  • Allows an extension to random processes

13
ITL is a different way of thinking about data
quantification
  • Moment expansions, in particular Second Order
    moments are still today the workhorse of
    statistics. We automatically translate deep
    concepts (e.g. similarity, Hebbs postulate of
    learning ) in 2nd order statistical equivalents.
  • ITL replaces 2nd order moments with a geometric
    statistical interpretation of data in probability
    spaces.
  • Variance by Entropy
  • Correlation by Correntopy
  • Mean square error (MSE) by Minimum error entropy
    (MEE)
  • Distances in data space by distances in
    probability spaces

14
Information Theoretic LearningEntropy
  • Entropy quantifies the degree of uncertainty in a
    r.v. Claude Shannon defined entropy as

Not all random variables (r.v.) are equally
random!
5
15
Information Theoretic LearningRenyis Entropy
  • Norm of the pdf

Renyis entropy equals Shannons as
16
Information Theoretic LearningParzen windowing
  • Given only samples drawn from a distribution
  • Convergence

17
Information Theoretic Learning Renyis Quadratic
Entropy
  • Order-2 entropy Gaussian kernels

Pairwise interactions between samples O(N2)
Information potential, V2(X)
provides a potential field over the
space of the samples parameterized by the kernel
size s
Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000.
18
Information Theoretic Learning Information Force
  • In adaptation, samples become information
    particles that interact through information
    forces.
  • Information potential
  • Information force

Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000. Erdogmus,
Principe, Hild, Natural Computing, 2002.
19
What will happen if we allow the particles to
move under the influence of these forces?
Information force within a dataset arising due to
H(X)
20
Information Theoretic Learning Backpropagation
of Information Forces
  • Information forces become the injected error to
    the dual or adjoint network that determines the
    weight updates for adaptation.

21
Information Theoretic Learning Quadratic
divergence measures
  • Kulback-Liebler
  • Divergence
  • Renyis Divergence
  • Euclidean Distance
  • Cauchy- Schwartz
  • Distance
  • Mutual Information is a special case (divergence
    between the joint and the product of marginals)

Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000.
22
Information Theoretic Learning Unifying
criterion for learning from samples
23
Training ADALINE sample by sample Stochastic
information gradient (SIG)
  • Theorem The expected value of the stochastic
    information gradient (SIG), is the gradient of
    Shannons entropy estimated from the samples
    using Parzen windowing.
  • For the Gaussian kernel and M1
  • The form is the same as for LMS except that
    entropy learning works with differences in
    samples.
  • The SIG works implicitly with the L1 norm of the
    error.

Erdogmus, Principe, Hild, IEEE Signal Processing
Letters, 2003.
24
SIG Hebbian updates
  • In a linear network the Hebbian update is
  • The update maximizing Shannon output entropy
  • with the SIG becomes
  • Which is more powerful and biologically
    plausible?

Hebbian updates would converge to any direction
but SIG found consistently the 90 degree
direction!
Generated 50 samples of a 2D distribution where
the x axis is uniform and the y axis is Gaussian
and the sample covariance matrix is 1
Erdogmus, Principe, Hild, IEEE Signal Processing
Letters, 2003.
25
ITL - Applications
www.cnel.ufl.edu ? ITL has examples and Matlab
code
26
Renyis cross entropy
Let be
two r.vs with iid samples. Then Renyis cross
entropy is given by
Using parzen estimate for the pdfs gives
27
Cross information potential and cross
information force
Force between particles of two datasets
28
Cross information force between two datasets
arising due to H(XY)
29
Cauchy Schwartz Divergence
A measure of similarity between two datasets
Same probability density functions
30
A New ITL FrameworkInformation Theoretic Mean
Shift
STATEMENT
Consider a dataset with iid
samples. We wish to find a new dataset
which captures interesting
structures of the original dataset .
FORMULATION
Cost Redundancy Reduction term Similarity
Measure Term
Weighted Combination
31
Information Theoretic Mean Shift
Form 1
This cost looks like a reaction diffusion
equation Entropy term implements
diffusion Cauchy Schwarz implements attraction to
the original data
32
Analogy
The weighting parameter ? squeezes the
information flow through a bottleneck extracting
different levels of structure in the data.
  • We can also visualize ? as a slope parameter. The
    previous methods used only ?1 or

33
Self organizing rule
Rewriting cost function as
Differentiating w.r.to xk1,2,,N and
rearranging gives
Fixed Point Update!!
34
An Example
Crescent shaped Dataset
35
Effect of ?
36
Summary 2
Starting with the Data
??8
? 1
? 0
Back to Data
Modes
Single Point
37
Applications- Clustering
Statement
Segment data into different groups such that
samples belonging to same group are closer to
each other than samples of different groups.
The idea
Mode Finding Ability
Clustering
38
Mean Shift a review
Modes are stationary points of the equation,
39
Two variants GBMS and GMS
Gaussian Mean Shift
Gaussian Blurring Mean Shift
Single dataset X Initialize XXo
Two datasets X and Xo Initialize XXo
40
Connection to ITMS
? 1
? 0
GBMS
GMS
41
Applications- Clustering
10 Random Gaussian Clusters and its pdf plot
42
GBMS result
GMS result
43
Image segmentation
44
GBMS
GMS
45
Applications- Principal Curves
  • Non linear extension of PCA.
  • Self-consistent smooth curves which pass
    through the middle of a d-dimensional
    probability distribution or data cloud.

A new definition (Erdogmus et al.)
A point is an element of the d-dimensional
principal set ,denoted by iff is
orthonormal to at least (n-d)
eigenvectors of and is a strict
local maximum in the subspace spanned by these
eigenvectors.
46
PC continued
  • is a 0-dimensional principal set
    corresponding to modes of the data. is the
    1-dimensional principal curve, is a
    2-dimensional principal surface and so on
  • Hierarchical structure, .
  • .
  • ITMS satisfies this definition (experimentally).
  • Gives principal curve for .

47
Principal curve of spiral data passing through
the modes
48
Denoising
Chain of Ring Dataset
49
Applications -Vector Quantization
  • Limiting case of ITMS (? ? 8).
  • Dcs(XXo) can be seen as distortion measure
    between X and Xo.
  • Initialize X with far fewer points than Xo

50
Comparison
ITVQ
LBG
51
Unsupervised Learning Tasks choose a point in a
2D space!
Vector Q.
? ? Different Tasks
Principal curves
clustering
s ? Different Scales
52
Conclusions
  • Goal-oriented structures goes beyond
    preprocessing stages and helps us extract
    abstract representations in the data.
  • A common framework binds these interesting
    structures as different levels of information
    extraction from the data. ITMS achieves this and
    can be used for
  • Clustering
  • Principal Curves
  • Vector Quantization and more ..

53
Whats Next?
54
CorrentropyA new generalized similarity measure
  • Correlation is one of the most widely used
    functions in signal processing and pattern
    recognition.
  • But, correlation only quantifies similarity fully
    if the random variables are Gaussian distributed.
  • Can we define a new function that measures
    similarity but it is not restricted to second
    order statistics?
  • Use the ITL framework.

55
CorrentropyA new generalized similarity measure
  • Define correntropy of a random process xt as
  • We can easily estimate correntropy using kernels
  • The name correntropy comes from the fact that the
    average over the lags (or the dimensions) is the
    information potential (argument of Renyis
    entropy)
  • For strictly stationary and ergodic r. p.

Santamaria I., Pokharel P., Principe J.,
Generalized Correlation Function Definition,
Properties and Application to Blind
Equalization, IEEE Trans. Signal Proc. vol 54,
no 6, pp 2187- 2186, 2006.
56
CorrentropyA new generalized similarity measure
  • How does it look like? The sinewave

57
CorrentropyA new generalized similarity measure
  • Properties of Correntropy
  • It has a maximum at the origin ( )
  • It is a symmetric positive function
  • Its mean value is the information potential
  • Correntropy includes higher order moments of data
  • The matrix whose elements are the correntopy at
    different lags is Toeplitz

58
CorrentropyA new generalized similarity measure
  • Correntropy as a cost function versus MSE.

59
CorrentropyA new generalized similarity measure
  • Correntropy induces a metric (CIM) in the sample
    space defined by
  • Therefore correntropy can
  • be used as an alternative
  • similarity criterion in the
  • space of samples.

Liu W., Pokharel P., Principe J., Correntropy
Properties and Applications in Non Gaussian
Signal Processing, accepted in IEEE Trans. Sig.
Proc
60
RKHS induced by CorrentropyDefinition
  • For a stochastic process Xt, t?T with T being
    an index set, correntropy defined as
  • VX(t,s) EK(XtXs)
  • is symmetric and positive definite.
  • Thus it induces a new RKHS denoted as VRKHS (VH).
    There is a kernel mapping ? such that
  • Vx(t,s)lt ?(t),?(s) gtVH
  • Any symmetric non-negative kernel is the
    covariance kernel of a random function and vice
    versa (Parzen).
  • Therefore, given a random function Xt, t ? T
    there exists another random function ft, t ? T
    such that
  • Eft fsVX(t,s)

61
RKHS induced by CorrentropyDefinition
  • This RKHS seems very appropriate for nonlinear
    signal processing.
  • In this space we can compute using linear
    algorithms nonlinear systems in the input space
    such as
  • Matched filters
  • Wiener filters
  • Principal Component Analysis
  • Solve constrained optimization problems
  • Due adaptive filtering and controls
Write a Comment
User Comments (0)
About PowerShow.com