Hierarchical Text Classification - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Hierarchical Text Classification

Description:

Non-performing classifiers at higher levels lead to poor performance of the ... ODP hierarchy was used as a source of the training data for all the experiments. ... – PowerPoint PPT presentation

Number of Views:270
Avg rating:3.0/5.0
Slides: 25
Provided by: ntu84
Category:

less

Transcript and Presenter's Notes

Title: Hierarchical Text Classification


1
  • Hierarchical Text Classification
  • Ashwin K Pulijala
  • Susan Gauch
  • Department of Electrical Engineering and Computer
    Science
  • University of Kansas
  • Lawrence
  • KS-66049

2
Presentation Outline
  • Introduction
  • Goal
  • Text Classification
  • Hierarchical Classification Advantages
  • Types of Category Structures
  • Hierarchical Classification Approaches
  • System Architecture
  • Experiments and Evaluation
  • Conclusion
  • Future Work

3
Introduction
  • KeyConcept A conceptual search engine.
  • Indexes documents using a combination of keyword
    and concept.
  • Retrieves documents based on a combination of
    keyword and conceptual matching.
  • Finds information that is more relevant.
  • Classification is automatic but uses a flat
    classifier.
  • Categories are selected from a hierarchical
    arrangement of concepts.

4
Goal
  • Explore the use of hierarchical structure for
    classifying web content.
  • Construct hierarchical classifiers.
  • Compare performance of flat classifier with
    hierarchical classifier.

5
Text Classification
  • Two Step Process Training the classifier and
    classification of new documents
  • Training Phase
  • Classifier is fed with documents that have been
    classified manually.
  • Learns about the features (vocabulary) of the
    various categories into which new documents can
    be classified.

6
Text Classification contd
  • Classification Phase
  • Classifier assigns categories to new documents
    based on the similarity of the features of input
    document and of the categories that it learned
    during training.

7
Flat Classification vs. Hierarchical
Classification
8
Hierarchical Classification Advantages
  • Increase accuracy by exploiting the relationship
    among the categories.
  • Effectively deal with very large problems.
  • Decompose into a smaller set of problems and
    solve each problem more accurately by focusing
    only on a small set of categories.
  • Construction of specialized classifiers.

9
Types of Category Structures
  • Virtual category tree.
  • Category tree.
  • Virtual directed acyclic category graph.
  • Directed acyclic graph

10
Hierarchical Classification Approaches
  • Big bang approach.
  • Document is classified using a single step.
  • Top-down level based approach.
  • Classifiers are built at each level of the
    category tree.
  • Non-performing classifiers at higher levels lead
    to poor performance of the lower level
    classifiers.

11
System Architecture
12
Text Classification Our Approach ..
  • Weight for a given term is given as
  • tcij tfij idfi
    (1)
  • idfi Log (Number of documents in D /
  • documents in D that contain ti)
    (2)
  • D the collection of super-documents.
  • ti ith term in vocabulary.
  • dj the jth super-document.
  • tfij number of occurrences of ti in dj.
    (3)
  • Normalized Weight
  • ntcij (tcij / vector - lengthj)
    (4)
  • where vector - lengthj ? tcij

13
Observations Using a Flat Classifier
  • Following observations were made using a flat
    classifier
  • Using 20-40 documents per concept allows reaching
    a good compromise between training data and
    classifier precision.
  • Classifier precision is independent of the number
    of concepts.
  • Classifier precision does not depend on the type
    of concepts.
  • Flat classifier has an exact match precision of
    around 51.

14
Experiments and Evaluation
  • The goal of the experiments was to
  • Evaluate and tune the classifiers at each level.
  • Determine the optimum number of documents to use
    for training the concepts.
  • Determine the maximum classifier precision at
    each level.
  • ODP hierarchy was used as a source of the
    training data for all the experiments.
  • Test data Randomly-selected level 3 documents.

15
Baseline Setup And Results
  • All the concepts from levels 1, 2 and 3 with at
    least 20 documents (total 3183 concepts).
  • 20 randomly selected documents from each concept
    used for Training.
  • 750 randomly selected level-3 documents used for
    testing.
  • Accuracy 48.2 .

16
Experiments and Evaluation
  • 1 Classifier at level 1, 15 at level 2, 356 at
    level 3.
  • Documents from parent children (
    grandchildren put in the same pool to select).
  • Parameters we tune Depth and of documents .

17
Experiments and Evaluation
  • Trained the classifiers with single document
    selected from each class (and subclass) and with
    level-4 training data included.
  • On average, level-1 classes were trained with
    approximately 7078 documents, level-2 classifiers
    with 7063 and level-3 with 6707.
  • Same set of documents as used for Baseline was
    used for testing.

18
Experiments and Evaluation
19
Experiments and Evaluation
20
Experiments and Evaluation
21
Experiments and Evaluation
  • Observations
  • 79.5 of documents had an exact match at level 1,
    71.3 of documents at level 2 and 70.1 at level
    3.
  • Number of top words to use for categorizing the
    documents at each level varied directly with the
    number of concepts trained at that level with 3
    words used for level 1, 5 words for level 2 and
    19 words for level 3.
  • The exact match precision of the Hierarchical
    classifier increased significantly (t-test value
    2.56E-06) by 45.4 when compared to Flat
    Classifier.

22
Flat Vs Hierarchical Classification
23
Conclusions
  • Explored the use of hierarchical structure for
    classifying web content.
  • Built hierarchical classifiers, which increase
    the classifier precision by focusing only on a
    small set of categories.
  • Compared the performance of flat classification
    with hierarchical classification

24
Future Work
  • Classifier precision can be further improved by
  • Reducing the number of features needed to
    discriminate between the categories within the
    same top-level category.
  • Introducing some recovery mechanism to remedy the
    classification errors made early in the
    hierarchy.
  • Selecting documents closest to the centroid for
    training instead of random selection.
Write a Comment
User Comments (0)
About PowerShow.com