Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization presentation

About This Presentation

Transcript and Presenter's Notes

Title: Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization

1
Acclimatizing Taxonomic Semantics for
Hierarchical Content Categorization

---Lei Tang, Jianping Zhang and Huan Liu

2
Taxonomies and Hierarchical Models

Web pages can be organized as a tree-structured
taxonomy (Yahoo!, Google directory)
Parental control Web filters to block childrens
access to undesirable web sites.
Parents want accurate content categorization of
different granularity
Service providers appreciate the decision path
how a blocking/non-blocking is made for fine
tuning.
Hierarchical Model Exploit the taxonomy for
classification strategy or loss function

3
Quality of Taxonomy

Most hierarchical models use a predefined
taxonomy, typically semantically sound.
A librarian is often employed to construct the
semantic taxonomy.
Is semantically-sound taxonomy always good?
Subjectivity can result in different taxonomies
Semantics change for specific data

4
A Motivating Example
5
A Bayesian View
Inconsistent

Stagnant nature of predefined Taxonomy (Prior
Knowledge)

Dynamic change of Semantics reflected in Data

Data-Driven Taxonomy
6
Start from Scratch - Clustering

Throw away the predefined taxonomy information,
clustering based on labeled data.
Two categories divisive or hierarchical
Usually require human experts to specify some
parameters like the maximum height of a tree, the
number of nodes in each branch, etc.
Difficult to specify parameters without looking
at the data

7
Optimal Hierarchy

Optimal hierarchy
How to estimate the likelihood?
Hierarchical models performance and the
likelihood are positively related.
Use hierarchical models performance statistics
on validation set to gauge the likelihood.
Brute-force approach to enumerate all taxonomies
is infeasible.

8
Constrained Optimal Hierarchy

Predefined taxonomy can help.
Assumption the optimal hierarchy is near the
neighborhood of predefined taxonomy H0
Constrained optimal hierarchy H for H0
H results from a series of elementary
operations to adjust H0 until no likelihood
increase is observed.

9
Elementary Operations
Promote
(All the leaf nodes remain unchanged)
10
Search in Hierarchy Space

Given a predefined taxonomy, find its best
constrained optimal hierarchy.
Search in the hierarchy space.

11
Finding Best COH

Greedy Search
Follow the track with largest likelihood increase
at each step to search for the best hierarchy.

12
Framework (a wrapper approach)

Given H0 , Training Data T, Validation Data V
Generate neighbor hierarchies for H0,
For each neighbor hierarchy, train hierarchical
classification models on T
Evaluate hierarchical classifiers on V.
Pick the best neighbor hierarchy as H0
Repeat step 1 until no improvement

13
Hierarchy Neighbors

Elementary operations can be applied to any nodes
in the tree.
Neighbors of a hierarchy could be huge.
Most operations are repeated for evaluation.

H1
H2
14
Finding Neighbors

Check nodes one by one rather than all the nodes
at the same time in each search step.
Merge and Demote only consider the node most
similar to the current one.
Nodes at higher levels affects more for
classification.
Top-down traversal Generate neighbors by
performing all possible elementary operations to
the shallowest node first.

15
Further consideration

2 types of top-down traversal
Promote operation only to generate neighbors
Demote and Merge operations only to generate
neighbors
Repeat 2-traversals procedure until no
improvement.

If a node is inproperly placed under a parent, we
need to promote it first.
16
Experiment Setting

10-fold cross validation
Naïve Bayes Classifier (Multinomial)
Use information gain to select features
Due to the scarcity of documents in each class,
we use training data to validate the likelihood
of a hierarchy.

17
Data Sets

Data Soc and Kids
Human labeled web pages with a predefined taxonomy

18
Results on Soc
19
Results on Kids
20
Over-fitting?

As we optimize the hierarchy just based on
training data, its possible to over-fit the
data.

21
Robust Method

Instead of multiple traversals(iterations), just
do 2-traversals once.

22
Conclusions

Semantically sound taxonomy does not necessarily
lead to intended good classification performance.
Given a predefined taxonomy, we can accustom it
to a data-driven taxonomy for more accurate
classification
Taxonomy generated by our method outperforms
human-constructed taxonomy and the taxonomy
generated starting from scratch.

23
Future work

An initial work to combine noisy prior
knowledge and data.
How to implement an efficient filter model that
can find a good taxonomy by exploiting the
predefined taxonomy?
Feature selection could alleviate the difference
between taxonomies. How to use the taxonomy
information for feature selection?

24
Questions?
Thanks!

Write a Comment

User Comments (0)

About PowerShow.com

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization PowerPoint PPT Presentation