Self Organization, Classification, and Matching of Hierarchical Data - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Self Organization, Classification, and Matching of Hierarchical Data

Description:

The user can then remove wrongly distributed documents; Refinement phase ... The user removes wrongly assigned documents from the taxonomy ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 21

Provided by: diego97

Category:

more less

Transcript and Presenter's Notes

Title: Self Organization, Classification, and Matching of Hierarchical Data

1
Self Organization, Classification,and Matching
of Hierarchical Data
23 October 2002
Diego Sona
partners
2
Outline

Problem Description (Distributed Personal
Knowledge)
Idea of a System
Bootstrap and Refinement
Introduction to Self Organizing Maps (SOMs)
A Solution for Bootstrap
Taxonomic SOM
Sketch of Supervised Learning Matching

3
Distributed Personal Knowledge

Ideally, any single user (re-)organizes its own
knowledge base according to personal preferences
In a long term scenario, huge knowledge-bases
will be spread among many smaller, locally
organized, knowledge bases
For example the Web directory can be seen as a
collection of specialized and/or competitive
directories.

4
Needs for Personal KM

Tools supporting the user while creating,
(re-)organizing, and maintaining (huge)
structured collections of data.
Tools for helping and improving the search and
the exchange of information between distributed
(structured) knowledge bases

5
Supporting the Directories Creation

The directory creation is performed in two steps
Bootstrap phase
Automatic annotation of labeled taxonomies with
flat sets of data, helping the user to design his
own data structures
The user can then remove wrongly distributed
documents
Refinement phase
Learn the correct relation documents/concepts,
according to the user correction performed during
the bootstrap phase

6
The Bootstrap

The user searches for an arrangement of a given
set of documents in a labeled taxonomy
The data arrangement should be performed
automatically
When?
At the beginning, when the user organizes his own
flat data into structured taxonomies
At run-time, when the user changes his taxonomies
(structure and/or labels) and she wants to
populate the new taxonomies with the old
documents.

7
The Bootstrap Hypothesis

The user has
a not organized (flat) personal knowledge base
a labeled taxonomy specified using a descriptive
language (e.g. CTXML)
She wants a tool that automatically organizes
data according to the chosen taxonomy.

Bootstrap Tool
8
The User Correction

The user removes wrongly assigned documents from
the taxonomy
The number of removed documents should be
minimum

Primates
Primates
Apes
Monkey
Apes
Monkey
Chimpanzees
Gorillas
Chimpanzees
Gorillas
9
The Refinement

Given an annotated taxonomy, the user needs a
tool for automatic assignment of new documents to
the correct concept.
The tool is trained with a labeled taxonomy and
then is used to classify new documents with the
correct node (concept).

Refinement Tool
Trained Refinement Tool
human
10
System Schema
Primates
Bootstrap Tool
Apes
Monkey
Primates
Chimpanzees
Gorillas
Apes
Monkey
Chimpanzees
Gorillas
user correction
Primates
Taxonomy Model
Refinement Tool
Apes
Monkey
Chimpanzees
Gorillas
11
Run Time Advantages

Locally
Easy maintenance of taxonomies by automatic
assignment of document to concepts (within the
correct context), helping the user to categorize
new documents
Globally
Matching between concepts, performed using both
labels, documents, and structural information

12
Introduction to Self Organizing Maps (SOMs)

Unsupervised algorithm for data clustering and
projection
Based on a regular, low-dimensional (usually 2-D)
lattice of k neurons (models)
Models are represented by their coordinates and
described by their weights w1,...,wn
SOM is a function mapping the input space in a
discrete output display
It projects
large-dimensional input vectors on the grid
maintaining similarity relationships among data

13
SOM Usage

pattern recognition vector quantization
pattern transformed into the closest codebook
vector
data compression
pattern transformed into the indices of the
winner unit
projection and exploration
data distribution visualized in a lower
dimensional space

14
SOMs Learning

The learning algorithm creates a mapping that
preserves the
topological relationships among the input
vectors
initial weights are chosen randomly and
iteratively updated
a sample vector x is chosen randomly from the
data set and provided to the SOM as input
learning is composed of two interleaving phases
Competitive stage -- the winner unit is chosen on
the basis of the distances between pattern and
models
Cooperative stage -- weights of winner unit and
of neighbor units are moved closer to the input
vector

15
Propagation of training Information

The information propagation during training is
controlled by the function hi,i(t)
Gaussian
Kronecker
etc.

16
Example of Words Clustering
17
Hierarchical SOMs

SOMs can be organized hierarchically

18
WebSOM

used to self organize massive document
collections (European, U.S., and Japanese
patents)
Based on a growing approach
documents represented as vectors (weighted
frequencies of words in a vocabulary)
Large vocabulary reduced (manually or
automatically) selecting a subset of
representative words (Latent Semantic Indexing,
random projection, a special SOM)

19
Example of WebSOM Result
20
Back to Bootstrap

The proposal is to use the SOMs for the self
organization of documents in a given labeled
taxonomy.
Instead of a standard SOM (lattice) we build a
SOM with the topology equal to the given taxonomy

SOM
Primates
Apes
Monkey
Chimpanzees
Gorillas
21
Taxonomic SOM

Encode all documents in the data set as fixed
size and normalized vectors (frequencies of words
in a vocabulary)
Initial weights for nodes (models) are randomly
chosen forcing the presence of the labels (e.g.
max frequency)
Start learning iteratively updating weights. The
weights associated to the model labels are not
allowed to change

node labels
compare
pattern
codebooks
22
Methodology Advantages

Contrary to standard categorization methods,
TaxSOM should use structural information during
clustering
It should also use the labels describing the
concepts (nodes)
It allows to derive a matching measure between
concepts.

23
Future Work