Multiscale Analysis of Digital Data Bases Matrices - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Multiscale Analysis of Digital Data Bases Matrices

Description:

... from any set of digital documents, from text to music to questionnaires. ... ( digital documents), where the dimension d could be very large, in the millions. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 43
Provided by: coif
Category:

less

Transcript and Presenter's Notes

Title: Multiscale Analysis of Digital Data Bases Matrices


1
Multiscale Analysis of Digital Data Bases
(Matrices)
  • R, Coifman M. Gavish
  • Department of Mathematics
  • Yale University
  •  

2
Our goal here is to describe the evolution
of a stream of ideas in Harmonic Analysis, ideas
which in the past have been mostly applied for
the analysis and extraction of information from
physical data, and which now are increasingly
applied to organize and extract information and
knowledge from any set of digital documents, from
text to music to questionnaires.
3
Consider the problem of unraveling the geometric
structure in a matrix, . We view the columns or
the rows as collections of points in high
dimension whose geometry we need to define.
The matrix on the left is a permutation in rows
and columns of the matrix below it . The
challenge is to unravel the various simple
submatrices .
4
We now describe a general methodology to extend
these ideas to general matrices , whether the
goal is organization of a data base to extract
knowledge, or to build a basis relative to which
a matrix is efficiently represented . We
illustrate the outcome of such organization on
the MMPI ( Minnesota Multiphasic Psychological
Inventory) questionnaire . The Tensor Haar Basis
enables filtering out anomalous responses , and
provide detailed analysis (pun intended) .
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9

Hierarchical organization of columns, people
(demographics), and rows,questions
(topics). This dual folder building structure is
mutually supportive , and opens the door to
systematic anlysis .
10
(No Transcript)
11
Demographics of the population at two scales of
profile similarity
12
The next slide represents a similar organization
in the vocabulary of a body of Science News
documents . The vocabulary is grouped by the
functional usage within the documents . The
geometry of the vocabulary is presented in such a
way that the Euclidean distance in the display
represents the affinity of the words as measured
by the documents .
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Learning row and columns
18

The texture image above was treated like a
questionnaire the folders at two scales are the
texture segments . Specifically to each pixel we
associate the 8x8 patch around it , we then
compute log of the absolute value of the Fourier
transform as a list of 64 questions . ( joint
work with Ali Haddad)
19
A hyperspectral image in which each pixel is
asked 128 spectral questions .
20
Multiscale Geometric Analysis on Digital data
clouds



Over the last few years we have seen a flurry of
activity in the machine learning community to
develop tools for information extraction data
mining ,search ,denoising etc. We claim that much
of this work can be reformulated as signal
processing of functions defined on point clouds
in dimension d. (digital documents), where the
dimension d could be very large, in the millions.
It turns out that the main ideas and power of
multiscale Harmonic Analysis can easily be
translated into this context through the
introduction of data driven scaling and
organizational mechanisms. To achieve this goal
we need to learn how to organize sensor outputs,
or digital documents into scalable geometries ,
where similar documents are linked by weights
reflecting their affinity.
21
(No Transcript)
22
Conventional nearest neighbor search , compared
with a diffusion search. The data is a pathology
slide ,each pixel is a digital document (spectrum
below for each class )
23
Diffusion Geometry
Diffusions between A and B have to go through the
bottleneck ,while C is easily reachable from B.
The Markov matrix defining a diffusion could be
given by a kernel , or by inference (infection)
between neighboring nodes. The diffusion
distance d accounts for preponderance of
inference links of length t. The shortest path
between A and C is roughly the same as between B
and C . The diffusion distance however is larger
since diffusion occurs through a bottleneck.
24
A simple empirical diffusion matrix A can be
constructed as follows Let represent
normalized data ,we soft truncate the
covariance matrix as
A is a renormalized
Markov version of this matrix The eigenvectors
of this matrix provide a local non linear
principal component analysis of the data . Whose
entries are the diffusion coordinates These are
also the eigenfunctions of the discrete Graph
Laplace Operator.
This map is a diffusion (at time t) embedding
into Euclidean space
25
Another similar construction for empirical data
26
Observe that in general any positive kernel with
spectrum as above can give rise to a natural
orthogonal basis as well as a natural multiscale
analysis.
27
The First two eigenfunctions organize the small
images which were provided in random order, in
fact assembling the 3D puzzle.
28
(No Transcript)
29
The long term diffusion of heterogeneous material
is remapped below . The left side has a higher
proportion of heat conducting material ,thereby
reducing the diffusion distance among points ,
the bottle neck increases that distance
30
Diffusion map into 3 d of the same heterogeneous
graph The distance between two points measures
the diffusion between them.
31
Diffusion as a search mechanism. Starting with a
few labeled points in two classes , the points
are identified by the preponderance of
evidence. (Szummer ,Slonim, Tishby)
32
The image on the left is projected into the three
dimensional space spanned by the eigenvectors 5
,8 10 which are active on the scarf
The image above is viewed as a data base of all
sub images of size 5x5, natural structures are
discovered through projections on various
subspaces.
33
The multiscale organization algorithm proceeds as
follows . Start with a disjoint partition of the
graph into clusters of diameter between 1 and 2
relative to the distance at scale 1 . Consider
the new graph formed by letting the elements of
the partition be the vertices Using the distance
between sets and affinity between sets described
above we repeat.
34
On this graph we partition again into clusters of
diameter between 1 and 2 relative to the set
distance (we double the time scale ) and redefine
the affinity between clusters of clusters using
the previously defined affinity between sub
clusters. Iterate until only disjoint clusters
are left. Another approximate version of this
algorithm is to embed the data using a diffusion
map into Euclidean space and pull back a
Euclidean based version of the above .
35
4 Gaussian Clouds
36
A simple example black disk on white background
  • Above are represented the first 4 prolates in the
    image space (image domain vs. prolate value).
  • Prolates 1 and 2 capture the ratio of black
    pixels over white pixels.
  • Prolates 3 and 4 capture the angle q
  • Locally, 2 prolates are sufficient to describe
    the data

q
37
If a set in high dimensions can be parametrized
by ,say the unit square in 2 dimensions, such
parmetrization will define an induced metric on
the square . For example the set of images of
8x8 squares below are naturally parametrized by
their average and orientation of the edge .Their
distance in 64 d is roughly the square root of
the usual metric.
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
References 1 E. Stein, Topics in Harmonic
Analysis related to the Littlewood-Paley theory,
Princeton University Press, 1970. 2 R.
Coifman and G. Weiss, Analyse Harmonique
Noncommutative sur Certains Espaces
Homogenes, Springer-Verlag, 1971. 3 R. Coifman
,G. Weiss, Extensions of Hardy spaces and their
use in analysis. Bul. Of the A.M.S., 83, 4,
1977, 569-645. 4 Belkin, M., Niyogi, P.
(2001). Laplacian eigenmaps and spectral
techniques for embedding and clustering.
Advances in Neural Information Processing Systems
14 (NIPS 2001) (p. 585). 5Belkin, M., Niyogi,
P. (2003a). Laplacian eigenmaps for
dimensionality reduction and data
repre- sentation. Neural Computation, 6,
13731396. 6Coifman, R. R., Lafon, S., Lee, A.,
Maggioni, M.,Nadler, B., Warner, F., Zucker, S.
(2005a) . Geometric diffusions as a tool for
harmonic analysis and structure defnition of
data. part i Diffusion maps.Proc. of Nat. Acad.
Sci., 74267431. 7 Coifman R.R.,S Lafon,
Diffusion maps, Applied and Computational
Harmonic Analysis, 21 5-30, 2006. 8 Coifman
R.R., B.Nadler, S Lafon, I G Kevrekidis,
Diffusion maps, spectral clustering and reaction
coordinates of dynamical systems, Applied and
Computational Harmonic Analysis, 21113-127,
2006. 9 Ronald R Coifman1, Mauro Maggioni1,
Steven W Zucker1 and Ioannis G Kevrekidis
Geometric diffusions for the analysis of data
from sensor networks Current Opinion in
Neurobiology 2005, 15576584 10 Ham J, Lee
DD, Mika S Scholkopf A kernel view of
the dimensionality reduction of manifolds. In
Proceedings of the XXI Conference on Machine
Learning, Banff, Canada, 2004
Write a Comment
User Comments (0)
About PowerShow.com