Text Classification - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Text Classification

Description:

Infer a classification rule from a sample of labelled training documents ... such as noun phrases (e.g. adjective-noun) followed by co-occurrence patterns (e. ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 24

Provided by: osirisSun

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification

1
Text Classification

Chapter 2 of Learning to Classify Text Using
Support Vector Machines by Thorsten Joachims,
Kluwer, 2002.

2
Text Classification (TC) Definition

Infer a classification rule from a sample of
labelled training documents (training set) so
that it classifies new examples (test set) with
high accuracy.
Using the ModApte split, the ratio of training
documents to test documents is 31

3
Three settings

Binary setting (simplest). Only two classes, e.g.
relevant and non-relevant in IR, spam vs.
legitimate in spam filters.
Multi-class setting, e.g. email routing at a
service hotline to one out of ten customer
representatives, Can be reduced into binary
tasks one against the rest strategy.
Multi-label setting e.g. semantic topic
identifiers for indexing news articles. An
article can be in one, many, or no categories.
Can also be split into a set of binary
classification tasks.

4
Representing text as example vectors

The basic blocks for representing text will be
called indexing terms
Word-based are most common. Very effective in IR,
even though words such as bank have more than
one meaning.
Advantage of simplicity split the input text
into words by white space.
Assume the ordering of words is irrelevant the
bag of words model. Only the frequency of each
word in the document is recorded.
bag of words model ensures that each document
is represented by a vector of fixed
dimensionality. Each component of the vector
represents the value (e.g. the frequency of that
word in that document, TF) of one attribute.

5
(No Transcript)
6
Other levels of text representation

More sophisticated representations than the
bag-of-words have not yet shown consistent and
substantial improvements
Sub-word level, e.g. n-grams are robust against
spelling errors. See Kjells neural network.
Multi-word level. May use syntactic phrase
indexing such as noun phrases (e.g.
adjective-noun) followed by co-occurrence
patterns (e.g. speed limit)
Semantic level. Latent Semantic Indexing (LSI)
aims to automatically generate semantic
categories based on a bag of words
representation. Another approach would make use
of thesauri.

7
Feature Selection

To remove irrelevant or inappropriate attributes
from the representation.
Advantages are protection against over-fitting,
and increased computational efficiency with fewer
dimensions to work with.
2 most common strategies
a) Feature subset selection use a subset of the
original features
b) Feature construction new features are
introduced by combining original features.

8
Feature subset selection techniques

Stopword elimination (removes high frequency
words)
Document frequency thresholding (remove
infrequent words, e.g. those occurring less than
m times in the training corpus)
Mutual information
Chi-squared test (X²)
But an appropriate learning algorithm should be
able to detect irrelevant features as part of the
learning process.

9
Mutual Information

We consider the association between a term t and
a category c. How often do they occur together,
compared with how common the term is, and how
common is membership of the category?
A is the number of times t occurs in c
B is the number of times t occurs outside c
C is the number of times t does not occur in c
D is the number of times t does not occur outside
c
N A B C D.
MI(t,c) log (A.N / ((A C)(A B)) )
If MI gt 0 then there is a positive association
between t and c
If MI 0 there is no association between t and c
If MI lt 0 then t and c are in complementary
distribution
Units of MI are bits of information.

10
Chi-squared measure (X²)

X²(t,c) N.(AD-CB)² / (AC).(BD).(AB).(CD).
E.g. X² for words in US as opposed to UK English
(1990s)
percent 485.2 U 383.3 toward 327.0 program
324.4 Bush 319.1 Clinton 316.8 President
273.2 programs 262.0 American 224.9 S 222.0.
These feature subset selection methods do not
allow for dependencies between words, e.g. click
here.
See Yang and Pedersen (1997), A Comparative Study
on Feature Selection in Text Categorisation.

11
Term Weighting

A soft form of feature selection.
Does not remove attributes, but adjusts their
relative influence.
Three components
Document component (e.g. binary, present in
document 1, absent 0 term frequency (TF))
Collection component (e.g. inverse document
frequency log (N / DF))
Normalisation component, so that large and small
documents can be compared on the same scale e.g.
1 / sqrt(sum of xj²)
The final weight is found by multiplying the 3
components

12
Feature Construction

The new features should represent most of the
information in the original representation while
minimising the number of attributes.
Examples of techniques are
Stemming
Thesauri group words into semantic categories,
e.g. synonyms can be placed in equivalence
classes.
Latent Semantic Indexing
Term clustering

13
Learning Methods

Naïve Bayes classifier
Rocchio algorithm
K-nearest neighbours
Decision tree classifier
Neural Nets
Support Vector Machines

14
Naïve Bayesian Model (1)

Spam Filter example from Sahimi et al.
Odds(Relx) Odds(Rel) Pr(xRel) / Pr(xNRel)
Pr(cheap v1agra NOW! spam)
Pr(cheapspam) Pr(v1agraspam)
Pr(NOW!spam)
Only classify as spam if odds gt 100 1 on.

15
Naïve Bayesian model (2)

Sahimi et al. use word indicators, and also the
following non-word indicators
Phrases free money, only , over 21
Punctuation !!!!
Domain name of sender .edu less likely to be
spam than .com
Junk mail more likely to be sent at night than
legitimate mail.
Is recipient an individual user or a mailing
list?

16
Our Work on the Enron Corpus- The PERC (George
Ke)

Find a centroid ci for each category Ci
For each test document x
Find k nearest neighbouring training documents
to x
Similarity between x and the training document
dj is added to similarity between x and ci
Sort similarity scores sim(x,Ci) in descending
order
Decision to assign x to Ci can be made using
various thresholding strategies

17
Rationale for the PERC Hybrid Approach

Centroid method overcomes data sparseness emails
tend to be short.
kNN allows the topic of a folder to drift over
time. Considering the vector space locally allows
matching against features which are currently
dominant.

18
Kjell A Stylometric Multi-Layer Perceptron
19
Performance Measures (PM)

PM used for evaluating TC are often different
from those optimised by the learning algorithms.
Loss-based measures (error rate and cost models).
Precision and recall-based measures.

20
(No Transcript)
21
Error Rate and Asymmetric Cost

Error Rate is defined as the probability of the
classification rule predicting the wrong class,
Err (f- f-) / (f f- f- f--)
Problem negative examples tend to outnumber
positive examples. So if we always guess not in
category, it seems that we have a very low error
rate.
For many applications, predicting a positive
example correctly is of higher utility than
predicting a negative example correctly.
We can incorporate this into the performance
measure using a cost (or inversely, utility)
matrix
Err (Cf C-f- C-f- C--f--) /
(f f- f- f--)

22
Precision and Recall

The Recall of a classification rule is the
probability that a document that should be in the
category is classified correctly
R f / (f f-)
Precision is the probability that a document
classified into a category is indeed classified
correctly
P f / (f f-)
F 2PR / (P R) if P and R are equally important

23
Micro- and macro- averaging

Often it is useful to compute the average
performance of a learning algorithm over multiple
training/test sets or multiple classification
tasks.
In particular for the multi-label setting, one is
usually interested in how well all the labels can
be predicted, not only a single one.
This leads to the question of how the results of
m binary tasks can be averaged to get a single
performance value.
Macro-averaging the performance measure (e.g. R
or P) is computed separately for each of the m
experiments. The average is computed as the
arithmetic mean of the measure over all
experiments
Micro-averaging instead average the contingency
tables found for each of m experiments, to
produce f(ave), f-(ave), f-(ave), f--(ave).
For recall, this implies
R(micro) f(ave) / (f f-)