Title: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map
1Visualization and Navigation of Document
Information Spaces Using a Self-Organizing Map
- Daniel X. Pape
- Community Architectures for Network Information
Systems - dpape_at_canis.uiuc.edu
- www.canis.uiuc.edu
- CSNA98
- 6/18/98
2Overview
- Self-Organizing Map (SOM) Algorithm
- U-Matrix Algorithm for SOM Visualization
- SOM Navigation Application
- Document Representation and Collection Examples
- Problems and Optimizations
- Future Work
3Basic SOM Algorithm
- Input
- Number (n) of Feature Vectors (x)
- format
- vector name a, b, c, d
- examples
- 1 0.1, 0.2, 0.3, 0.4
- 2 0.2, 0.3, 0.3, 0.2
4Basic SOM Algorithm
- Output
- Neural network Map of (M) Nodes
- Each node has an associated Weight Vector (m) of
the same dimensionality as the input feature
vectors - Examples
- m1 0.1, 0.2, 0.3, 0.4
- m2 0.2, 0.3, 0.3, 0.2
5Basic SOM Algorithm
- Output (cont.)
- Nodes laid out in a grid
6Basic SOM Algorithm
- Other Parameters
- Number of timesteps (T)
- Learning Rate (eta)
7Basic SOM Algorithm
SOM() foreach timestep t foreach feature
vector fv wnode find_winning_node(fv) u
pdate_local_neighborhood(wnode)
find_winning_node() foreach node n compute
distance of m to feature vector return node
with the smallest distance
update_local_neighborhood(wnode) foreach node
n m m eta x - m
8U-Matrix Visualization
- Provides a simple way to visualize cluster
boundaries on the map - Simple algorithm
- for each node in the map, compute the average of
the distances between its weight vector and those
of its immediate neighbors - Average distance is a measure of a nodes
similarity between it and its neighbors
9U-Matrix Visualization
- Interpretation
- one can encode the U-Matrix measurements as
greyscale values in an image, or as altitudes on
a terrain - landscape that represents the document space the
valleys, or dark areas are the clusters of data,
and the mountains, or light areas are the
boundaries between the clusters
10U-Matrix Visualization
- Example
- dataset of random three dimensional points,
arranged in four obvious clusters
11U-Matrix Visualization
Four (color-coded) clusters of three-dimensional
points
12U-Matrix Visualization
Oblique projection of a terrain derived from the
U-Matrix
13U-Matrix Visualization
Terrain for a real document collection
14Current Labeling Procedure
- Feature vectors are encoded as 0s and 1s
- Weight vectors have real values from 0 to 1
- Sort weight vector dimensions by element value
- dimension with greatest value is best noun
phrase for that node - Aggregate nodes with the same best noun phrase
into groups
15Umatrix Navigation
- 3D Space-Flight
- Hierarchical Navigation
16Document Data
- Noun phrases extracted
- Set of unique noun phrases computed
- each noun phrase becomes a dimension of the data
set - Each document represented by a binary vector with
a 1 or a 0 denoting the existence or absence of
each noun phrase
17Document Data
- Example
- 10 total noun phrases
- alexander, king, macedonians, darius, philip,
horse, soldiers, battle, army, death - each element of the feature vector will be a 1 or
a 0 - 1 1, 1, 0, 0, 1, 1, 0, 0, 0, 0
- 2 0, 1, 0, 1, 0, 0, 1, 1, 1, 1
18Document Collection Examples
19Problems
- As document sets get larger, the feature vectors
get longer, use more memory, etc. - Execution time grows to unrealistic lengths
20Solutions?
- Need algorithm refinements for sparse feature
vectors - Need a faster way to do the find_winning_node()
computation - Need a better way to do the update_local_neighborh
ood() computation
21Sparse Vector Optimization
- Intelligent support for sparse feature vectors
- saves on memory usage
- greatly improves speed of the weight vector
update computation
22Faster find_winning_node()
- SOM weight vectors become partially ordered very
quickly
23Faster find_winning_node()
U-Matrix Visualization of an Initial, Unordered
SOM
24Faster find_winning_node()
Partially Ordered SOM after 5 timesteps
25Faster find_winning_node()
- Dont do a global search for the winner
- Start search from last known winner position
- Pro
- usually finds a new winner very quickly
- Con
- this new search for a winner can sometimes get
stuck in a local minima
26Better Neighborhood Update
- Nodes get told to update quite often
- Weight vector is made public only during a
find_winner() search - With local find_winning_node() search, a lazy
neighborhood weight vector update can be performed
27Better Neighborhood Update
- Cache update requests
- each node will store the winning node and feature
vector for each update request - The node performs the update computations called
for by the stored update requests only when asked
for its weight vector - Possible reduction of number of requests by
averaging the feature vectors in the cache
28New Execution Times
29Future Work
- Parallelization
- Label Problem
30Label Problem
- Current Procedure not very good
- Cluster boundaries
- Term selection
31Cluster Boundaries
- Image processing
- Geometric
32Cluster Boundaries
33Term Selection
- Too many unique noun phrases
- Too many dimensions in the feature vector data
- Knee of frequency curve