TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB

Description:

TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB. George Ferizis and Peter Bailey ... Genre classification usually follow this method by using either term frequency ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 2
Provided by: sau131
Category:

less

Transcript and Presenter's Notes

Title: TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB


1
TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE
WEB George Ferizis and Peter Bailey CSIRO ICT
Centre
Introduction
Experiment
A comparison of the confusion matrices for the
POS features and approximating POS features show
that they both confuse similar genres with each
other (table 3). The confusion matrix shows the
percentage of documents of genre A (corresponding
to the row) that are classified as genre B
(corresponding to the column). The value of each
row adds to 100.
Figure
Many classification methods apply statistical
methods to a set of features obtained from the
data to obtain a function that can differentiate
between classes. Genre classification usually
follow this method by using either term frequency
features or features obtained through
Part-Of-Speech (POS) tagging the documents. While
using features obtained from POS has resulted in
good accuracy the speed of the POS tagging
systems is unsuitable for time critical
applications.
  • These features were compared to two other sets of
    features for the genre classification problem
  • POS features
  • Term frequency features
  • Two experiments were run to compare these
    features
  • A comparison of the throughput of each method
  • A comparison of the classification accuracy of
    each method
  • The genres that were used during classification
    were
  • Newspaper editorial
  • Newspaper reportage
  • Scientific articles
  • Speeches

Table 1 The time spent in each phase during the
classification of 1000 documents
Results
Table 3 The confusion matrices for the POS
feature approach (darker triangular cells) and
the approximating approach (lighter triangular
cells). The matrices show that both methods
confused documents between genres in a similar
way, although with different magnitudes of
confusion.
A comparison of the number of documents that each
method classifies per second shows that the term
frequency and approximation approaches are two
orders of magnitude quicker than the POS approach
(table 2).
Table 1 shows the amount of time that is spent in
each phase during the classification of 1000
documents. 97 of the time spent classifying can
be attributed to the POS tagger. The results in
table 1 also show that it would take over 5 days
to classify a corpus containing 1,000,000
documents.
Conclusions
  • POS tagging is too slow for collections with
    millions of documents.
  • Approximating some POS tags reduces the time
    that is required to extract classification
    features from a corpus by two orders of
    magnitude.
  • Approximating the POS tags that are used as
    features results in a loss of 1-2 in
    classification accuracy.
  • The accuracy of classification when using
    approximated POS tags as features is still higher
    than using term frequency features.

Method
Table 2 The table shows the number of documents
classified per second by each method, including
the time each method requires to generate and
analyse the necessary features. Two orders of
magnitude of improvement can be gained by
approximating POS features. This reduces the time
required to classify a corpus of 1,000,000
documents from over 5 days to under 2 hours.
  • Since POS tagging is such a slow process some POS
    features that are critical to the performance of
    the classifier are approximated using some
    heuristics. These features are
  • Adverbs
  • Present participles
  • Personal pronouns
  • A restricted set of determiners
  • The classifier uses other simple features that
    can be determined quickly from the text in the
    document such as average word length, the number
    of long words in the document and average
    sentence length.

Using features that are derived from
approximating POS tags has similar accuracy to
actually using the POS tags as features. These
features are also more accurate than using
features from a term frequency approach (figure
1).
Authors George Ferizis george.ferizis_at_csiro.au P
eter Bailey peter.bailey_at_csiro.au
Figure 1 The number of documents classified per
second by each method. This includes the time
each methods requires to generate and analyse the
necessary features. Two orders of magnitude of
improvement can be gained by approximating POS
features.
Write a Comment
User Comments (0)
About PowerShow.com