Title: TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB
1TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE
WEB George Ferizis and Peter Bailey CSIRO ICT
Centre
Introduction
Experiment
A comparison of the confusion matrices for the
POS features and approximating POS features show
that they both confuse similar genres with each
other (table 3). The confusion matrix shows the
percentage of documents of genre A (corresponding
to the row) that are classified as genre B
(corresponding to the column). The value of each
row adds to 100.
Figure
Many classification methods apply statistical
methods to a set of features obtained from the
data to obtain a function that can differentiate
between classes. Genre classification usually
follow this method by using either term frequency
features or features obtained through
Part-Of-Speech (POS) tagging the documents. While
using features obtained from POS has resulted in
good accuracy the speed of the POS tagging
systems is unsuitable for time critical
applications.
- These features were compared to two other sets of
features for the genre classification problem - POS features
- Term frequency features
- Two experiments were run to compare these
features - A comparison of the throughput of each method
- A comparison of the classification accuracy of
each method - The genres that were used during classification
were - Newspaper editorial
- Newspaper reportage
- Scientific articles
- Speeches
Table 1 The time spent in each phase during the
classification of 1000 documents
Results
Table 3 The confusion matrices for the POS
feature approach (darker triangular cells) and
the approximating approach (lighter triangular
cells). The matrices show that both methods
confused documents between genres in a similar
way, although with different magnitudes of
confusion.
A comparison of the number of documents that each
method classifies per second shows that the term
frequency and approximation approaches are two
orders of magnitude quicker than the POS approach
(table 2).
Table 1 shows the amount of time that is spent in
each phase during the classification of 1000
documents. 97 of the time spent classifying can
be attributed to the POS tagger. The results in
table 1 also show that it would take over 5 days
to classify a corpus containing 1,000,000
documents.
Conclusions
- POS tagging is too slow for collections with
millions of documents. - Approximating some POS tags reduces the time
that is required to extract classification
features from a corpus by two orders of
magnitude. - Approximating the POS tags that are used as
features results in a loss of 1-2 in
classification accuracy. - The accuracy of classification when using
approximated POS tags as features is still higher
than using term frequency features.
Method
Table 2 The table shows the number of documents
classified per second by each method, including
the time each method requires to generate and
analyse the necessary features. Two orders of
magnitude of improvement can be gained by
approximating POS features. This reduces the time
required to classify a corpus of 1,000,000
documents from over 5 days to under 2 hours.
- Since POS tagging is such a slow process some POS
features that are critical to the performance of
the classifier are approximated using some
heuristics. These features are - Adverbs
- Present participles
- Personal pronouns
- A restricted set of determiners
- The classifier uses other simple features that
can be determined quickly from the text in the
document such as average word length, the number
of long words in the document and average
sentence length.
Using features that are derived from
approximating POS tags has similar accuracy to
actually using the POS tags as features. These
features are also more accurate than using
features from a term frequency approach (figure
1).
Authors George Ferizis george.ferizis_at_csiro.au P
eter Bailey peter.bailey_at_csiro.au
Figure 1 The number of documents classified per
second by each method. This includes the time
each methods requires to generate and analyse the
necessary features. Two orders of magnitude of
improvement can be gained by approximating POS
features.