Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map PowerPoint PPT Presentation

presentation player overlay
1 / 33
About This Presentation
Transcript and Presenter's Notes

Title: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map


1
Visualization and Navigation of Document
Information Spaces Using a Self-Organizing Map
  • Daniel X. Pape
  • Community Architectures for Network Information
    Systems
  • dpape_at_canis.uiuc.edu
  • www.canis.uiuc.edu
  • CSNA98
  • 6/18/98

2
Overview
  • Self-Organizing Map (SOM) Algorithm
  • U-Matrix Algorithm for SOM Visualization
  • SOM Navigation Application
  • Document Representation and Collection Examples
  • Problems and Optimizations
  • Future Work

3
Basic SOM Algorithm
  • Input
  • Number (n) of Feature Vectors (x)
  • format
  • vector name a, b, c, d
  • examples
  • 1 0.1, 0.2, 0.3, 0.4
  • 2 0.2, 0.3, 0.3, 0.2

4
Basic SOM Algorithm
  • Output
  • Neural network Map of (M) Nodes
  • Each node has an associated Weight Vector (m) of
    the same dimensionality as the input feature
    vectors
  • Examples
  • m1 0.1, 0.2, 0.3, 0.4
  • m2 0.2, 0.3, 0.3, 0.2

5
Basic SOM Algorithm
  • Output (cont.)
  • Nodes laid out in a grid

6
Basic SOM Algorithm
  • Other Parameters
  • Number of timesteps (T)
  • Learning Rate (eta)

7
Basic SOM Algorithm
SOM() foreach timestep t foreach feature
vector fv wnode find_winning_node(fv) u
pdate_local_neighborhood(wnode)
find_winning_node() foreach node n compute
distance of m to feature vector return node
with the smallest distance
update_local_neighborhood(wnode) foreach node
n m m eta x - m
8
U-Matrix Visualization
  • Provides a simple way to visualize cluster
    boundaries on the map
  • Simple algorithm
  • for each node in the map, compute the average of
    the distances between its weight vector and those
    of its immediate neighbors
  • Average distance is a measure of a nodes
    similarity between it and its neighbors

9
U-Matrix Visualization
  • Interpretation
  • one can encode the U-Matrix measurements as
    greyscale values in an image, or as altitudes on
    a terrain
  • landscape that represents the document space the
    valleys, or dark areas are the clusters of data,
    and the mountains, or light areas are the
    boundaries between the clusters

10
U-Matrix Visualization
  • Example
  • dataset of random three dimensional points,
    arranged in four obvious clusters

11
U-Matrix Visualization
Four (color-coded) clusters of three-dimensional
points
12
U-Matrix Visualization
Oblique projection of a terrain derived from the
U-Matrix
13
U-Matrix Visualization
Terrain for a real document collection
14
Current Labeling Procedure
  • Feature vectors are encoded as 0s and 1s
  • Weight vectors have real values from 0 to 1
  • Sort weight vector dimensions by element value
  • dimension with greatest value is best noun
    phrase for that node
  • Aggregate nodes with the same best noun phrase
    into groups

15
Umatrix Navigation
  • 3D Space-Flight
  • Hierarchical Navigation

16
Document Data
  • Noun phrases extracted
  • Set of unique noun phrases computed
  • each noun phrase becomes a dimension of the data
    set
  • Each document represented by a binary vector with
    a 1 or a 0 denoting the existence or absence of
    each noun phrase

17
Document Data
  • Example
  • 10 total noun phrases
  • alexander, king, macedonians, darius, philip,
    horse, soldiers, battle, army, death
  • each element of the feature vector will be a 1 or
    a 0
  • 1 1, 1, 0, 0, 1, 1, 0, 0, 0, 0
  • 2 0, 1, 0, 1, 0, 0, 1, 1, 1, 1

18
Document Collection Examples
19
Problems
  • As document sets get larger, the feature vectors
    get longer, use more memory, etc.
  • Execution time grows to unrealistic lengths

20
Solutions?
  • Need algorithm refinements for sparse feature
    vectors
  • Need a faster way to do the find_winning_node()
    computation
  • Need a better way to do the update_local_neighborh
    ood() computation

21
Sparse Vector Optimization
  • Intelligent support for sparse feature vectors
  • saves on memory usage
  • greatly improves speed of the weight vector
    update computation

22
Faster find_winning_node()
  • SOM weight vectors become partially ordered very
    quickly

23
Faster find_winning_node()
U-Matrix Visualization of an Initial, Unordered
SOM
24
Faster find_winning_node()
Partially Ordered SOM after 5 timesteps
25
Faster find_winning_node()
  • Dont do a global search for the winner
  • Start search from last known winner position
  • Pro
  • usually finds a new winner very quickly
  • Con
  • this new search for a winner can sometimes get
    stuck in a local minima

26
Better Neighborhood Update
  • Nodes get told to update quite often
  • Weight vector is made public only during a
    find_winner() search
  • With local find_winning_node() search, a lazy
    neighborhood weight vector update can be performed

27
Better Neighborhood Update
  • Cache update requests
  • each node will store the winning node and feature
    vector for each update request
  • The node performs the update computations called
    for by the stored update requests only when asked
    for its weight vector
  • Possible reduction of number of requests by
    averaging the feature vectors in the cache

28
New Execution Times
29
Future Work
  • Parallelization
  • Label Problem

30
Label Problem
  • Current Procedure not very good
  • Cluster boundaries
  • Term selection

31
Cluster Boundaries
  • Image processing
  • Geometric

32
Cluster Boundaries
  • Image processing example

33
Term Selection
  • Too many unique noun phrases
  • Too many dimensions in the feature vector data
  • Knee of frequency curve
Write a Comment
User Comments (0)
About PowerShow.com