Title: Text Classification
1Text Classification
- Chapter 2 of Learning to Classify Text Using
Support Vector Machines by Thorsten Joachims,
Kluwer, 2002.
2Text Classification (TC) Definition
- Infer a classification rule from a sample of
labelled training documents (training set) so
that it classifies new examples (test set) with
high accuracy. - Using the ModApte split, the ratio of training
documents to test documents is 31
3Three settings
- Binary setting (simplest). Only two classes, e.g.
relevant and non-relevant in IR, spam vs.
legitimate in spam filters. - Multi-class setting, e.g. email routing at a
service hotline to one out of ten customer
representatives, Can be reduced into binary
tasks one against the rest strategy. - Multi-label setting e.g. semantic topic
identifiers for indexing news articles. An
article can be in one, many, or no categories.
Can also be split into a set of binary
classification tasks.
4Representing text as example vectors
- The basic blocks for representing text will be
called indexing terms - Word-based are most common. Very effective in IR,
even though words such as bank have more than
one meaning. - Advantage of simplicity split the input text
into words by white space. - Assume the ordering of words is irrelevant the
bag of words model. Only the frequency of each
word in the document is recorded. - bag of words model ensures that each document
is represented by a vector of fixed
dimensionality. Each component of the vector
represents the value (e.g. the frequency of that
word in that document, TF) of one attribute.
5(No Transcript)
6Other levels of text representation
- More sophisticated representations than the
bag-of-words have not yet shown consistent and
substantial improvements - Sub-word level, e.g. n-grams are robust against
spelling errors. See Kjells neural network. - Multi-word level. May use syntactic phrase
indexing such as noun phrases (e.g.
adjective-noun) followed by co-occurrence
patterns (e.g. speed limit) - Semantic level. Latent Semantic Indexing (LSI)
aims to automatically generate semantic
categories based on a bag of words
representation. Another approach would make use
of thesauri.
7Feature Selection
- To remove irrelevant or inappropriate attributes
from the representation. - Advantages are protection against over-fitting,
and increased computational efficiency with fewer
dimensions to work with. - 2 most common strategies
- a) Feature subset selection use a subset of the
original features - b) Feature construction new features are
introduced by combining original features.
8Feature subset selection techniques
- Stopword elimination (removes high frequency
words) - Document frequency thresholding (remove
infrequent words, e.g. those occurring less than
m times in the training corpus) - Mutual information
- Chi-squared test (X²)
- But an appropriate learning algorithm should be
able to detect irrelevant features as part of the
learning process.
9Mutual Information
- We consider the association between a term t and
a category c. How often do they occur together,
compared with how common the term is, and how
common is membership of the category? - A is the number of times t occurs in c
- B is the number of times t occurs outside c
- C is the number of times t does not occur in c
- D is the number of times t does not occur outside
c - N A B C D.
- MI(t,c) log (A.N / ((A C)(A B)) )
- If MI gt 0 then there is a positive association
between t and c - If MI 0 there is no association between t and c
- If MI lt 0 then t and c are in complementary
distribution - Units of MI are bits of information.
10Chi-squared measure (X²)
- X²(t,c) N.(AD-CB)² / (AC).(BD).(AB).(CD).
- E.g. X² for words in US as opposed to UK English
(1990s) - percent 485.2 U 383.3 toward 327.0 program
324.4 Bush 319.1 Clinton 316.8 President
273.2 programs 262.0 American 224.9 S 222.0. - These feature subset selection methods do not
allow for dependencies between words, e.g. click
here. - See Yang and Pedersen (1997), A Comparative Study
on Feature Selection in Text Categorisation.
11Term Weighting
- A soft form of feature selection.
- Does not remove attributes, but adjusts their
relative influence. - Three components
- Document component (e.g. binary, present in
document 1, absent 0 term frequency (TF)) - Collection component (e.g. inverse document
frequency log (N / DF)) - Normalisation component, so that large and small
documents can be compared on the same scale e.g. - 1 / sqrt(sum of xj²)
- The final weight is found by multiplying the 3
components
12Feature Construction
- The new features should represent most of the
information in the original representation while
minimising the number of attributes. - Examples of techniques are
- Stemming
- Thesauri group words into semantic categories,
e.g. synonyms can be placed in equivalence
classes. - Latent Semantic Indexing
- Term clustering
13Learning Methods
- Naïve Bayes classifier
- Rocchio algorithm
- K-nearest neighbours
- Decision tree classifier
- Neural Nets
- Support Vector Machines
14Naïve Bayesian Model (1)
- Spam Filter example from Sahimi et al.
- Odds(Relx) Odds(Rel) Pr(xRel) / Pr(xNRel)
- Pr(cheap v1agra NOW! spam)
Pr(cheapspam) Pr(v1agraspam)
Pr(NOW!spam) - Only classify as spam if odds gt 100 1 on.
15Naïve Bayesian model (2)
- Sahimi et al. use word indicators, and also the
following non-word indicators - Phrases free money, only , over 21
- Punctuation !!!!
- Domain name of sender .edu less likely to be
spam than .com - Junk mail more likely to be sent at night than
legitimate mail. - Is recipient an individual user or a mailing
list?
16Our Work on the Enron Corpus- The PERC (George
Ke)
- Find a centroid ci for each category Ci
- For each test document x
- Find k nearest neighbouring training documents
to x - Similarity between x and the training document
dj is added to similarity between x and ci - Sort similarity scores sim(x,Ci) in descending
order - Decision to assign x to Ci can be made using
various thresholding strategies
17Rationale for the PERC Hybrid Approach
- Centroid method overcomes data sparseness emails
tend to be short. - kNN allows the topic of a folder to drift over
time. Considering the vector space locally allows
matching against features which are currently
dominant.
18Kjell A Stylometric Multi-Layer Perceptron
19Performance Measures (PM)
- PM used for evaluating TC are often different
from those optimised by the learning algorithms. - Loss-based measures (error rate and cost models).
- Precision and recall-based measures.
20(No Transcript)
21Error Rate and Asymmetric Cost
- Error Rate is defined as the probability of the
classification rule predicting the wrong class, - Err (f- f-) / (f f- f- f--)
- Problem negative examples tend to outnumber
positive examples. So if we always guess not in
category, it seems that we have a very low error
rate. - For many applications, predicting a positive
example correctly is of higher utility than
predicting a negative example correctly. - We can incorporate this into the performance
measure using a cost (or inversely, utility)
matrix - Err (Cf C-f- C-f- C--f--) /
- (f f- f- f--)
22Precision and Recall
- The Recall of a classification rule is the
probability that a document that should be in the
category is classified correctly - R f / (f f-)
- Precision is the probability that a document
classified into a category is indeed classified
correctly - P f / (f f-)
- F 2PR / (P R) if P and R are equally important
23Micro- and macro- averaging
- Often it is useful to compute the average
performance of a learning algorithm over multiple
training/test sets or multiple classification
tasks. - In particular for the multi-label setting, one is
usually interested in how well all the labels can
be predicted, not only a single one. - This leads to the question of how the results of
m binary tasks can be averaged to get a single
performance value. - Macro-averaging the performance measure (e.g. R
or P) is computed separately for each of the m
experiments. The average is computed as the
arithmetic mean of the measure over all
experiments - Micro-averaging instead average the contingency
tables found for each of m experiments, to
produce f(ave), f-(ave), f-(ave), f--(ave).
For recall, this implies - R(micro) f(ave) / (f f-)