Integrate Text Clustering Features in Text Categorization System - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Integrate Text Clustering Features in Text Categorization System

Description:

Automatically mapping from FIFA's topic tags to the domain tags used in 863 tag set. ... There is a information gap between FIFA and MPCA ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 31
Provided by: nlp8
Category:

less

Transcript and Presenter's Notes

Title: Integrate Text Clustering Features in Text Categorization System


1
Integrate Text Clustering Features in Text
Categorization System
  • ZHOU Qiang, ZHENG Yabin
  • Computer Science and Artificial Intelligence
    Division National Laboratory for Information
    Science and Technology
  • Dept. of Computer Science and technology
  • Tsinghua University, Beijing 100084

2
Outline
  • Background researches
  • Basic algorithms
  • The Integration algorithm
  • Experimental results
  • Conclusions

3
Automatic Text Categorization
  • Task description
  • To build software tools capable of classifying
    text documents under predefined categories or
    topic codes.
  • Dominant techniques Machine Learning and
    statictical model
  • Rocchio method 2
  • Naïve Bayes Model 3
  • K-NN algorithm 4
  • Decision Tree model 5
  • Neural Network 6
  • Support Vector Machine (SVM) 7
  • A comparison research by Yiming Yang 9
  • Among above ML models, SVM and K-NN have the
    better categorization performances

4
A New idea for ATC
  • Two key techniques for a typical ATC system
  • How to select the suitable high discriminative
    features to represent the topical characteristics
    of a document?
  • How to aggregate these features to form several
    suitable categories for the document?
  • Our method to integrate some supportive features
    extracted from a text clustering model to improve
    the classification performance of an ATC system
  • Text categorization algorithm FIFA
  • Developed by Dr. Jingbo Zhu in Northeast Univ.,
    China
  • Text clustering algorithm MPCA
  • Developed by Dr. Wray Buntine in HIIT, Finland

5
Basic Algorithm 1 FIFA
  • A supervised text categorization algorithm
  • Basic functions Features Identification and
    Features Aggregation
  • Based on a large scale knowledge base with about
    400,000 entries manually annotated with detailed
    topic and domain information.

FIFA
Topic Tags
Chinese Texts
6
Output of FIFA
  • Give 10 topic tags and their weights for a
    document
  • Top-3 topic tags are reliable
  • Other lower weighted tags may be noise
  • About 900 topic tags were used in the FIFA
    algorithm

7
Basic Algorithm 2 MPCA
  • An unsupervised text clustering algorithm
  • Basic functions a multinomial variation of the
    discrete Principal Component Analysis

MPCA
Text corpora
Clusters
Cluster Number N
N 2
8
Output of MPCA
  • Give several clusters and all the documents in
    them along with their probabilities
  • The documents with top high probabilities are
    reliable
  • The documents with the lowest probabilities may
    be useless

9
The Integration Algorithm
  • Topic pre-selection
  • Document pre-selection
  • Label the cluster with suitable topic tags
  • Feedback topic information of clusters
  • Information combination

10
Step 1 Topic pre-selection
  • Goal To generate a good DBT
  • DBT A Document-By-Topic matrix
  • Each element (DBT)ij represents the probability
    of document Di with topic Tj
  • Method remove the noise information based on the
    following heuristics
  • If a document has a salient topic tag, other tags
    can be regarded as noise information.
  • If one topic appears in almost all the documents
    or in just a small part of the documents, they
    can be regarded as a noisy topic.
  • Some thresholds were set for the above removing
    operations

11
Step 2 Document pre-selection
  • Goal To generate a good DBC
  • DBC A Document-By-Cluster matrix
  • Each element (DBC)ij represents the probability
    of document Di belongs to a cluster Cj
  • Method
  • Remove the documents with lower probabilities in
    a cluster.
  • The threshold is set to the average of the
    probabilities in the cluster.

12
Step 3 Label the clusters
  • Goal To compute a CBT
  • CBT A Cluster-By-Topic matrix
  • Each element (CBT)ij represents the probability
    of a Cluster Ci with a topic Tj
  • Computational formula
  • Where
  • DBC is a Document-By-Cluster matrix
  • DBT is a Document-By-Topic matrix

13
Step 4 Topic information feedback
  • Goal To compute a FDBT
  • FDBT A Feedback-Document-By-Topic matrix
  • Each element (FDBT)ij represents the feedbacking
    probability of a document Di with a topic Tj
  • Computational formula
  • Where
  • DBC is a Document-By-Cluster matrix
  • CBT is a Cluster-By-Topic matrix

14
Step 5 Information combination
  • Goal to integrate the following information
  • The original FIFA topic tags
  • The feedback topic tags from labeled clusters
  • Computational formula
  • Where
  • DBT is a Document-By-Topic matrix
  • CBT is a Feedback-Document-By-Topic matrix
  • are the parameters for tuning the
    weight between the original and feedback
    information.

15
Outline
  • Background researches
  • Basic algorithms
  • The Integration algorithm
  • Experimental results
  • Conclusions

16
Experiment 1 863 data set
  • Use the Chinese library classification system
  • 37 domain tags and 885 topic tags
  • Mainly designed for human experts, there are many
    ambiguous boundaries between different tags
  • It may not be suitable for ATC.
  • 3600 articles were manually annotated with
    different domain tags
  • Automatically mapping from FIFAs topic tags to
    the domain tags used in 863 tag set.
  • 885 topic tags ? 37 domain tags

17
Experimental result (1) 863 data set
  • Initial FIFA V.S. Result after topic pre-selection

18
Experimental result (2) 863 data set
  • Result after topic pre-selection V.S. Result
    after feedback and information combination

19
863 data set some examples
  • Correct domain tag philosophy(??)

20
863 data set some examples
  • FIFA

Integration Algorithm
21
863 data set some examples
  • Ambiguous tags Agriculture(??), Business(??),

22
863 data set some examples
  • FIFA

Integration Algorithm
23
Experimental result analysis
  • ATC Precision improvement
  • After topic pre-selection 46.4? 54.9 (Top-1
    tag)
  • After feedback and combination 54.9 ? 57.4
  • The result is not very good and far below our
    expectation, due to the following reasons
  • FIFA uses a very large topic tagset (about 900
    tags)
  • It is a big challenge for a ATC system
  • Even for a human expert, it is not a easy task
  • MPCA is sensible to its cluster number N
  • We set N20 in our experiment
  • There is a information gap between FIFA and MPCA
  • Question Can the new algorithm perform better in
    a smaller tag set?

24
Experiment 2 FD data set
  • Use a topic tag set with 9 tags
  • computer, traffic, education, economy, military,
    sports, medicine, arts and politics
  • Data set 2615 articles extracted from Internet
    and manually annotated with the above 9 topic
    tags
  • Training set 450 articles
  • Test set 2165 articles
  • Processing procedures stages 2-4 of our
    algorithm
  • Use the MPCA algorithm to cluster all the 2615
    documents into 9 clusters
  • Label them with suitable tags based on correct
    tag information of the training documents among
    them
  • Feedback the cluster tag to all the test
    documents in the cluster.

25
Experimental result Precision
26
Experimental result Recall
27
Experimental result F1-Measure
28
Experimental result Overall
INT-ALG
INT-ALG
29
Conclusions
  • We proposed a new algorithm to integrate the text
    clustering features in a text categorization
    system
  • The integrated algorithm showed a few performance
    improvement in a larger topic tag set
  • Its overall performance is between SVM and KNN in
    a smaller topic tag set
  • Some new techniques will be explored to further
    improve the performance of the integration
    algorithm in the future

30
  • Question and Answer
Write a Comment
User Comments (0)
About PowerShow.com