Title: Text Categorization
1Text Categorization
- Assigning documents to a fixed set of categories
- Applications
- Web pages
- Recommending pages
- Yahoo-like classification hierarchies
- Categorizing bookmarks
- Newsgroup Messages /News Feeds / Micro-blog Posts
- Recommending messages, posts, tweets, etc.
- Message filtering
- News articles
- Personalized news
- Email messages
- Routing
- Folderizing
- Spam filtering
2Learning for Text Categorization
- Text Categorization is an application of
classification - Typical Learning Algorithms
- Bayesian (naïve)
- Neural network
- Relevance Feedback (Rocchio)
- Nearest Neighbor
- Support Vector Machines (SVM)
3Nearest-Neighbor Learning Algorithm
- Learning is just storing the representations of
the training examples in data set D - Testing instance x
- Compute similarity between x and all examples in
D - Assign x the category of the most similar
examples in D - Does not explicitly compute a generalization or
category prototypes (i.e., no modeling) - Also called
- Case-based
- Memory-based
- Lazy learning
4K Nearest-Neighbor
- Using only the closest example to determine
categorization is subject to errors due to - A single atypical example.
- Noise (i.e. error) in the category label of a
single training example. - More robust alternative is to find the k
most-similar examples and return the majority
category of these k examples. - Value of k is typically odd to avoid ties, 3 and
5 are most common.
5Similarity Metrics
- Nearest neighbor method depends on a similarity
(or distance) metric - Simplest for continuous m-dimensional instance
space is Euclidian distance - Simplest for m-dimensional binary instance space
is Hamming distance (number of feature values
that differ) - For text, cosine similarity of TF-IDF weighted
vectors is typically most effective
6Basic Automatic Text Processing
- Parse documents to recognize structure and
meta-data - e.g. title, date, other fields, html tags, etc.
- Scan for word tokens
- lexical analysis to recognize keywords, numbers,
special characters, etc. - Stopword removal
- common words such as the, and, or which are
not semantically meaningful in a document - Stem words
- morphological processing to group word variants
(e.g., compute, computer, computing,
computes, can be represented by a single stem
comput in the index) - Assign weight to words
- using frequency in documents and across documents
- Store Index
- Stored in a Term-Document Matrix (inverted
index) which stores each document as a vector of
keyword weights
7tf x idf Weighs
- tf x idf measure
- term frequency (tf)
- inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution - Recall the Zipf distribution
- Want to weight terms highly if they are
- frequent in relevant documents BUT
- infrequent in the collection as a whole
- Goal assign a tf x idf weight to each term in
each document
8tf x idf
9Inverse Document Frequency
- IDF provides high values for rare words and low
values for common words
10tf x idf Example
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 df idf log2(N/df)
T1 0 2 4 0 1 0 3 1.00
T2 1 3 0 0 0 2 3 1.00
T3 0 1 0 2 0 0 2 1.58
T4 3 0 1 5 4 0 4 0.58
T5 0 4 0 0 0 1 2 1.58
T6 2 7 2 1 3 0 5 0.26
T7 1 0 0 5 5 1 4 0.58
T8 0 1 1 0 0 3 3 1.00
The initial Term x Doc matrix(Inverted Index)
Documents represented as vectors of words
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6
T1 0.00 2.00 4.00 0.00 1.00 0.00
T2 1.00 3.00 0.00 0.00 0.00 2.00
T3 0.00 1.58 0.00 3.17 0.00 0.00
T4 1.74 0.00 0.58 2.90 2.32 0.00
T5 0.00 6.34 0.00 0.00 0.00 1.58
T6 0.53 1.84 0.53 0.26 0.79 0.00
T7 0.58 0.00 0.00 2.92 2.92 0.58
T8 0.00 1.00 1.00 0.00 0.00 3.00
tf x idfTerm x Doc matrix
11K Nearest Neighbor for Text
Training For each each training example ltx,
c(x)gt ? D Compute the corresponding TF-IDF
vector, dx, for document x Test instance
y Compute TF-IDF vector d for document y For
each ltx, c(x)gt ? D Let sx cosSim(d,
dx) Sort examples, x, in D by decreasing value of
sx Let N be the first k examples in D. (get
most similar neighbors) Return the majority class
of examples in N