Hierarchical Text Classification - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Hierarchical Text Classification

Description:

Non-performing classifiers at higher levels lead to poor performance of the ... ODP hierarchy was used as a source of the training data for all the experiments. ... – PowerPoint PPT presentation

Number of Views:270

Avg rating:3.0/5.0

Slides: 25

Provided by: ntu84

Category:

more less

Transcript and Presenter's Notes

Title: Hierarchical Text Classification

1

Hierarchical Text Classification
Ashwin K Pulijala
Susan Gauch
Department of Electrical Engineering and Computer
Science
University of Kansas
Lawrence
KS-66049

2
Presentation Outline

Introduction
Goal
Text Classification
Hierarchical Classification Advantages
Types of Category Structures
Hierarchical Classification Approaches
System Architecture
Experiments and Evaluation
Conclusion
Future Work

3
Introduction

KeyConcept A conceptual search engine.
Indexes documents using a combination of keyword
and concept.
Retrieves documents based on a combination of
keyword and conceptual matching.
Finds information that is more relevant.
Classification is automatic but uses a flat
classifier.
Categories are selected from a hierarchical
arrangement of concepts.

4
Goal

Explore the use of hierarchical structure for
classifying web content.
Construct hierarchical classifiers.
Compare performance of flat classifier with
hierarchical classifier.

5
Text Classification

Two Step Process Training the classifier and
classification of new documents
Training Phase
Classifier is fed with documents that have been
classified manually.
Learns about the features (vocabulary) of the
various categories into which new documents can
be classified.

6
Text Classification contd

Classification Phase
Classifier assigns categories to new documents
based on the similarity of the features of input
document and of the categories that it learned
during training.

7
Flat Classification vs. Hierarchical
Classification
8
Hierarchical Classification Advantages

Increase accuracy by exploiting the relationship
among the categories.
Effectively deal with very large problems.
Decompose into a smaller set of problems and
solve each problem more accurately by focusing
only on a small set of categories.
Construction of specialized classifiers.

9
Types of Category Structures

Virtual category tree.
Category tree.
Virtual directed acyclic category graph.
Directed acyclic graph

10
Hierarchical Classification Approaches

Big bang approach.
Document is classified using a single step.
Top-down level based approach.
Classifiers are built at each level of the
category tree.
Non-performing classifiers at higher levels lead
to poor performance of the lower level
classifiers.

11
System Architecture
12
Text Classification Our Approach ..

Weight for a given term is given as
tcij tfij idfi
(1)
idfi Log (Number of documents in D /
documents in D that contain ti)
(2)
D the collection of super-documents.
ti ith term in vocabulary.
dj the jth super-document.
tfij number of occurrences of ti in dj.
(3)
Normalized Weight
ntcij (tcij / vector - lengthj)
(4)
where vector - lengthj ? tcij

13
Observations Using a Flat Classifier

Following observations were made using a flat
classifier
Using 20-40 documents per concept allows reaching
a good compromise between training data and
classifier precision.
Classifier precision is independent of the number
of concepts.
Classifier precision does not depend on the type
of concepts.
Flat classifier has an exact match precision of
around 51.

14
Experiments and Evaluation

The goal of the experiments was to
Evaluate and tune the classifiers at each level.
Determine the optimum number of documents to use
for training the concepts.
Determine the maximum classifier precision at
each level.
ODP hierarchy was used as a source of the
training data for all the experiments.
Test data Randomly-selected level 3 documents.

15
Baseline Setup And Results

All the concepts from levels 1, 2 and 3 with at
least 20 documents (total 3183 concepts).
20 randomly selected documents from each concept
used for Training.
750 randomly selected level-3 documents used for
testing.
Accuracy 48.2 .

16
Experiments and Evaluation

1 Classifier at level 1, 15 at level 2, 356 at
level 3.
Documents from parent children (
grandchildren put in the same pool to select).
Parameters we tune Depth and of documents .

17
Experiments and Evaluation

Trained the classifiers with single document
selected from each class (and subclass) and with
level-4 training data included.
On average, level-1 classes were trained with
approximately 7078 documents, level-2 classifiers
with 7063 and level-3 with 6707.
Same set of documents as used for Baseline was
used for testing.

18
Experiments and Evaluation
19
Experiments and Evaluation
20
Experiments and Evaluation
21
Experiments and Evaluation

Observations
79.5 of documents had an exact match at level 1,
71.3 of documents at level 2 and 70.1 at level
3.
Number of top words to use for categorizing the
documents at each level varied directly with the
number of concepts trained at that level with 3
words used for level 1, 5 words for level 2 and
19 words for level 3.
The exact match precision of the Hierarchical
classifier increased significantly (t-test value
2.56E-06) by 45.4 when compared to Flat
Classifier.

22
Flat Vs Hierarchical Classification
23
Conclusions

Explored the use of hierarchical structure for
classifying web content.
Built hierarchical classifiers, which increase
the classifier precision by focusing only on a
small set of categories.
Compared the performance of flat classification
with hierarchical classification

24
Future Work

Classifier precision can be further improved by
Reducing the number of features needed to
discriminate between the categories within the
same top-level category.
Introducing some recovery mechanism to remedy the
classification errors made early in the
hierarchy.
Selecting documents closest to the centroid for
training instead of random selection.

Write a Comment

User Comments (0)