Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization

Description:

Promote' operation only to generate neighbors ... As we optimize the hierarchy just based on training data, it's possible to over-fit the data. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 25
Provided by: scie262
Category:

less

Transcript and Presenter's Notes

Title: Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization


1
Acclimatizing Taxonomic Semantics for
Hierarchical Content Categorization
  • ---Lei Tang, Jianping Zhang and Huan Liu

2
Taxonomies and Hierarchical Models
  • Web pages can be organized as a tree-structured
    taxonomy (Yahoo!, Google directory)
  • Parental control Web filters to block childrens
    access to undesirable web sites.
  • Parents want accurate content categorization of
    different granularity
  • Service providers appreciate the decision path
    how a blocking/non-blocking is made for fine
    tuning.
  • Hierarchical Model Exploit the taxonomy for
    classification strategy or loss function

3
Quality of Taxonomy
  • Most hierarchical models use a predefined
    taxonomy, typically semantically sound.
  • A librarian is often employed to construct the
    semantic taxonomy.
  • Is semantically-sound taxonomy always good?
  • Subjectivity can result in different taxonomies
  • Semantics change for specific data

4
A Motivating Example
5
A Bayesian View
Inconsistent
  • Stagnant nature of predefined Taxonomy (Prior
    Knowledge)
  • Dynamic change of Semantics reflected in Data

Data-Driven Taxonomy
6
Start from Scratch - Clustering
  • Throw away the predefined taxonomy information,
    clustering based on labeled data.
  • Two categories divisive or hierarchical
  • Usually require human experts to specify some
    parameters like the maximum height of a tree, the
    number of nodes in each branch, etc.
  • Difficult to specify parameters without looking
    at the data

7
Optimal Hierarchy
  • Optimal hierarchy
  • How to estimate the likelihood?
  • Hierarchical models performance and the
    likelihood are positively related.
  • Use hierarchical models performance statistics
    on validation set to gauge the likelihood.
  • Brute-force approach to enumerate all taxonomies
    is infeasible.

8
Constrained Optimal Hierarchy
  • Predefined taxonomy can help.
  • Assumption the optimal hierarchy is near the
    neighborhood of predefined taxonomy H0
  • Constrained optimal hierarchy H for H0
  • H results from a series of elementary
    operations to adjust H0 until no likelihood
    increase is observed.

9
Elementary Operations
Promote
(All the leaf nodes remain unchanged)
10
Search in Hierarchy Space
  • Given a predefined taxonomy, find its best
    constrained optimal hierarchy.
  • Search in the hierarchy space.

11
Finding Best COH
  • Greedy Search
  • Follow the track with largest likelihood increase
    at each step to search for the best hierarchy.

12
Framework (a wrapper approach)
  • Given H0 , Training Data T, Validation Data V
  • Generate neighbor hierarchies for H0,
  • For each neighbor hierarchy, train hierarchical
    classification models on T
  • Evaluate hierarchical classifiers on V.
  • Pick the best neighbor hierarchy as H0
  • Repeat step 1 until no improvement

13
Hierarchy Neighbors
  • Elementary operations can be applied to any nodes
    in the tree.
  • Neighbors of a hierarchy could be huge.
  • Most operations are repeated for evaluation.

H1
H2
14
Finding Neighbors
  • Check nodes one by one rather than all the nodes
    at the same time in each search step.
  • Merge and Demote only consider the node most
    similar to the current one.
  • Nodes at higher levels affects more for
    classification.
  • Top-down traversal Generate neighbors by
    performing all possible elementary operations to
    the shallowest node first.

15
Further consideration
  • 2 types of top-down traversal
  • Promote operation only to generate neighbors
  • Demote and Merge operations only to generate
    neighbors
  • Repeat 2-traversals procedure until no
    improvement.

If a node is inproperly placed under a parent, we
need to promote it first.
16
Experiment Setting
  • 10-fold cross validation
  • Naïve Bayes Classifier (Multinomial)
  • Use information gain to select features
  • Due to the scarcity of documents in each class,
    we use training data to validate the likelihood
    of a hierarchy.

17
Data Sets
  • Data Soc and Kids
  • Human labeled web pages with a predefined taxonomy

18
Results on Soc
19
Results on Kids
20
Over-fitting?
  • As we optimize the hierarchy just based on
    training data, its possible to over-fit the
    data.

21
Robust Method
  • Instead of multiple traversals(iterations), just
    do 2-traversals once.

22
Conclusions
  • Semantically sound taxonomy does not necessarily
    lead to intended good classification performance.
  • Given a predefined taxonomy, we can accustom it
    to a data-driven taxonomy for more accurate
    classification
  • Taxonomy generated by our method outperforms
    human-constructed taxonomy and the taxonomy
    generated starting from scratch.

23
Future work
  • An initial work to combine noisy prior
    knowledge and data.
  • How to implement an efficient filter model that
    can find a good taxonomy by exploiting the
    predefined taxonomy?
  • Feature selection could alleviate the difference
    between taxonomies. How to use the taxonomy
    information for feature selection?

24
Questions?
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com