Self Organization, Classification, and Matching of Hierarchical Data - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Self Organization, Classification, and Matching of Hierarchical Data

Description:

The user can then remove wrongly distributed documents; Refinement phase ... The user removes wrongly assigned documents from the taxonomy ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 21
Provided by: diego97
Category:

less

Transcript and Presenter's Notes

Title: Self Organization, Classification, and Matching of Hierarchical Data


1
Self Organization, Classification,and Matching
of Hierarchical Data
23 October 2002
Diego Sona
partners
2
Outline
  • Problem Description (Distributed Personal
    Knowledge)
  • Idea of a System
  • Bootstrap and Refinement
  • Introduction to Self Organizing Maps (SOMs)
  • A Solution for Bootstrap
  • Taxonomic SOM
  • Sketch of Supervised Learning Matching

3
Distributed Personal Knowledge
  • Ideally, any single user (re-)organizes its own
    knowledge base according to personal preferences
  • In a long term scenario, huge knowledge-bases
    will be spread among many smaller, locally
    organized, knowledge bases
  • For example the Web directory can be seen as a
    collection of specialized and/or competitive
    directories.

4
Needs for Personal KM
  • Tools supporting the user while creating,
    (re-)organizing, and maintaining (huge)
    structured collections of data.
  • Tools for helping and improving the search and
    the exchange of information between distributed
    (structured) knowledge bases

5
Supporting the Directories Creation
  • The directory creation is performed in two steps
  • Bootstrap phase
  • Automatic annotation of labeled taxonomies with
    flat sets of data, helping the user to design his
    own data structures
  • The user can then remove wrongly distributed
    documents
  • Refinement phase
  • Learn the correct relation documents/concepts,
    according to the user correction performed during
    the bootstrap phase

6
The Bootstrap
  • The user searches for an arrangement of a given
    set of documents in a labeled taxonomy
  • The data arrangement should be performed
    automatically
  • When?
  • At the beginning, when the user organizes his own
    flat data into structured taxonomies
  • At run-time, when the user changes his taxonomies
    (structure and/or labels) and she wants to
    populate the new taxonomies with the old
    documents.

7
The Bootstrap Hypothesis
  • The user has
  • a not organized (flat) personal knowledge base
  • a labeled taxonomy specified using a descriptive
    language (e.g. CTXML)
  • She wants a tool that automatically organizes
    data according to the chosen taxonomy.

Bootstrap Tool
8
The User Correction
  • The user removes wrongly assigned documents from
    the taxonomy
  • The number of removed documents should be
    minimum

Primates
Primates
Apes
Monkey
Apes
Monkey
Chimpanzees
Gorillas
Chimpanzees
Gorillas
9
The Refinement
  • Given an annotated taxonomy, the user needs a
    tool for automatic assignment of new documents to
    the correct concept.
  • The tool is trained with a labeled taxonomy and
    then is used to classify new documents with the
    correct node (concept).

Refinement Tool
Trained Refinement Tool
human
10
System Schema
Primates
Bootstrap Tool
Apes
Monkey
Primates
Chimpanzees
Gorillas
Apes
Monkey
Chimpanzees
Gorillas
user correction
Primates
Taxonomy Model
Refinement Tool
Apes
Monkey
Chimpanzees
Gorillas
11
Run Time Advantages
  • Locally
  • Easy maintenance of taxonomies by automatic
    assignment of document to concepts (within the
    correct context), helping the user to categorize
    new documents
  • Globally
  • Matching between concepts, performed using both
    labels, documents, and structural information

12
Introduction to Self Organizing Maps (SOMs)
  • Unsupervised algorithm for data clustering and
    projection
  • Based on a regular, low-dimensional (usually 2-D)
    lattice of k neurons (models)
  • Models are represented by their coordinates and
    described by their weights w1,...,wn
  • SOM is a function mapping the input space in a
    discrete output display
  • It projects
    large-dimensional input vectors on the grid

  • maintaining similarity relationships among data

13
SOM Usage
  • pattern recognition vector quantization
  • pattern transformed into the closest codebook
    vector
  • data compression
  • pattern transformed into the indices of the
    winner unit
  • projection and exploration
  • data distribution visualized in a lower
    dimensional space

14
SOMs Learning
  • The learning algorithm creates a mapping that
    preserves the
  • topological relationships among the input
    vectors
  • initial weights are chosen randomly and
    iteratively updated
  • a sample vector x is chosen randomly from the
    data set and provided to the SOM as input
  • learning is composed of two interleaving phases
  • Competitive stage -- the winner unit is chosen on
    the basis of the distances between pattern and
    models
  • Cooperative stage -- weights of winner unit and
    of neighbor units are moved closer to the input
    vector

15
Propagation of training Information
  • The information propagation during training is
    controlled by the function hi,i(t)
  • Gaussian
  • Kronecker
  • etc.

16
Example of Words Clustering
17
Hierarchical SOMs
  • SOMs can be organized hierarchically

18
WebSOM
  • used to self organize massive document
    collections (European, U.S., and Japanese
    patents)
  • Based on a growing approach
  • documents represented as vectors (weighted
    frequencies of words in a vocabulary)
  • Large vocabulary reduced (manually or
    automatically) selecting a subset of
    representative words (Latent Semantic Indexing,
    random projection, a special SOM)

19
Example of WebSOM Result
20
Back to Bootstrap
  • The proposal is to use the SOMs for the self
    organization of documents in a given labeled
    taxonomy.
  • Instead of a standard SOM (lattice) we build a
    SOM with the topology equal to the given taxonomy

SOM
Primates
Apes
Monkey
Chimpanzees
Gorillas
21
Taxonomic SOM
  • Encode all documents in the data set as fixed
    size and normalized vectors (frequencies of words
    in a vocabulary)
  • Initial weights for nodes (models) are randomly
    chosen forcing the presence of the labels (e.g.
    max frequency)
  • Start learning iteratively updating weights. The
    weights associated to the model labels are not
    allowed to change

node labels
compare
pattern
codebooks
22
Methodology Advantages
  • Contrary to standard categorization methods,
    TaxSOM should use structural information during
    clustering
  • It should also use the labels describing the
    concepts (nodes)
  • It allows to derive a matching measure between
    concepts.

23
Future Work
  • automatic rejection of unclassifiable patterns
  • (e.g. similarity threshold)
  • controlled propagation of information among the
    units
  • (symmetric, asymmetric, Gaussian, Kronecker,
    step, etc.)
  • encoding of documents into vector representations
    and vocabulary reduction
  • (frequencies, weighted frequencies, latent
    semantic indexing, etc.)
  • how many units to use for each concept?
  • (growing SOMs, one-sized nodes, etc.)
  • forcing of labels into the model
  • (by means of repeated documents, codebook, etc.)
  • Possible extension and/or normalization of the
    nodes labels
  • (Wordnet)

24
Future Works 2
  • Test it
Write a Comment
User Comments (0)
About PowerShow.com