Title: Hierarchical Text Classification
1- Hierarchical Text Classification
- Ashwin K Pulijala
- Susan Gauch
- Department of Electrical Engineering and Computer
Science - University of Kansas
- Lawrence
- KS-66049
2Presentation Outline
- Introduction
- Goal
- Text Classification
- Hierarchical Classification Advantages
- Types of Category Structures
- Hierarchical Classification Approaches
- System Architecture
- Experiments and Evaluation
- Conclusion
- Future Work
3Introduction
- KeyConcept A conceptual search engine.
- Indexes documents using a combination of keyword
and concept. - Retrieves documents based on a combination of
keyword and conceptual matching. - Finds information that is more relevant.
- Classification is automatic but uses a flat
classifier. - Categories are selected from a hierarchical
arrangement of concepts.
4Goal
- Explore the use of hierarchical structure for
classifying web content. - Construct hierarchical classifiers.
- Compare performance of flat classifier with
hierarchical classifier. -
5Text Classification
- Two Step Process Training the classifier and
classification of new documents - Training Phase
- Classifier is fed with documents that have been
classified manually. - Learns about the features (vocabulary) of the
various categories into which new documents can
be classified.
6Text Classification contd
- Classification Phase
- Classifier assigns categories to new documents
based on the similarity of the features of input
document and of the categories that it learned
during training.
7Flat Classification vs. Hierarchical
Classification
8Hierarchical Classification Advantages
- Increase accuracy by exploiting the relationship
among the categories. - Effectively deal with very large problems.
- Decompose into a smaller set of problems and
solve each problem more accurately by focusing
only on a small set of categories. - Construction of specialized classifiers.
9Types of Category Structures
- Virtual category tree.
- Category tree.
- Virtual directed acyclic category graph.
- Directed acyclic graph
10Hierarchical Classification Approaches
- Big bang approach.
- Document is classified using a single step.
- Top-down level based approach.
- Classifiers are built at each level of the
category tree. - Non-performing classifiers at higher levels lead
to poor performance of the lower level
classifiers.
11System Architecture
12Text Classification Our Approach ..
- Weight for a given term is given as
- tcij tfij idfi
(1) - idfi Log (Number of documents in D /
- documents in D that contain ti)
(2) - D the collection of super-documents.
- ti ith term in vocabulary.
- dj the jth super-document.
- tfij number of occurrences of ti in dj.
(3) - Normalized Weight
- ntcij (tcij / vector - lengthj)
(4) - where vector - lengthj ? tcij
13Observations Using a Flat Classifier
- Following observations were made using a flat
classifier - Using 20-40 documents per concept allows reaching
a good compromise between training data and
classifier precision. - Classifier precision is independent of the number
of concepts. - Classifier precision does not depend on the type
of concepts. - Flat classifier has an exact match precision of
around 51.
14Experiments and Evaluation
- The goal of the experiments was to
- Evaluate and tune the classifiers at each level.
- Determine the optimum number of documents to use
for training the concepts. - Determine the maximum classifier precision at
each level. - ODP hierarchy was used as a source of the
training data for all the experiments. - Test data Randomly-selected level 3 documents.
15 Baseline Setup And Results
- All the concepts from levels 1, 2 and 3 with at
least 20 documents (total 3183 concepts). - 20 randomly selected documents from each concept
used for Training. - 750 randomly selected level-3 documents used for
testing. - Accuracy 48.2 .
16Experiments and Evaluation
- 1 Classifier at level 1, 15 at level 2, 356 at
level 3. - Documents from parent children (
grandchildren put in the same pool to select). - Parameters we tune Depth and of documents .
17Experiments and Evaluation
- Trained the classifiers with single document
selected from each class (and subclass) and with
level-4 training data included. - On average, level-1 classes were trained with
approximately 7078 documents, level-2 classifiers
with 7063 and level-3 with 6707. - Same set of documents as used for Baseline was
used for testing.
18Experiments and Evaluation
19Experiments and Evaluation
20Experiments and Evaluation
21Experiments and Evaluation
- Observations
- 79.5 of documents had an exact match at level 1,
71.3 of documents at level 2 and 70.1 at level
3. - Number of top words to use for categorizing the
documents at each level varied directly with the
number of concepts trained at that level with 3
words used for level 1, 5 words for level 2 and
19 words for level 3. - The exact match precision of the Hierarchical
classifier increased significantly (t-test value
2.56E-06) by 45.4 when compared to Flat
Classifier.
22Flat Vs Hierarchical Classification
23Conclusions
- Explored the use of hierarchical structure for
classifying web content. - Built hierarchical classifiers, which increase
the classifier precision by focusing only on a
small set of categories. - Compared the performance of flat classification
with hierarchical classification
24Future Work
- Classifier precision can be further improved by
- Reducing the number of features needed to
discriminate between the categories within the
same top-level category. - Introducing some recovery mechanism to remedy the
classification errors made early in the
hierarchy. - Selecting documents closest to the centroid for
training instead of random selection.