Text Categorization

About This Presentation

Title:

Text Categorization

Description:

Message filtering News articles ... or which are not semantically meaningful in a document Stem words morphological processing to group word ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 12

Provided by: Bamsh89

Learn more at: http://facweb.cs.depaul.edu

Category:

more less

Transcript and Presenter's Notes

Title: Text Categorization

1
Text Categorization

Assigning documents to a fixed set of categories
Applications
Web pages
Recommending pages
Yahoo-like classification hierarchies
Categorizing bookmarks
Newsgroup Messages /News Feeds / Micro-blog Posts
Recommending messages, posts, tweets, etc.
Message filtering
News articles
Personalized news
Email messages
Routing
Folderizing
Spam filtering

2
Learning for Text Categorization

Text Categorization is an application of
classification
Typical Learning Algorithms
Bayesian (naïve)
Neural network
Relevance Feedback (Rocchio)
Nearest Neighbor
Support Vector Machines (SVM)

3
Nearest-Neighbor Learning Algorithm

Learning is just storing the representations of
the training examples in data set D
Testing instance x
Compute similarity between x and all examples in
D
Assign x the category of the most similar
examples in D
Does not explicitly compute a generalization or
category prototypes (i.e., no modeling)
Also called
Case-based
Memory-based
Lazy learning

4
K Nearest-Neighbor

Using only the closest example to determine
categorization is subject to errors due to
A single atypical example.
Noise (i.e. error) in the category label of a
single training example.
More robust alternative is to find the k
most-similar examples and return the majority
category of these k examples.
Value of k is typically odd to avoid ties, 3 and
5 are most common.

5
Similarity Metrics

Nearest neighbor method depends on a similarity
(or distance) metric
Simplest for continuous m-dimensional instance
space is Euclidian distance
Simplest for m-dimensional binary instance space
is Hamming distance (number of feature values
that differ)
For text, cosine similarity of TF-IDF weighted
vectors is typically most effective

6
Basic Automatic Text Processing

Parse documents to recognize structure and
meta-data
e.g. title, date, other fields, html tags, etc.
Scan for word tokens
lexical analysis to recognize keywords, numbers,
special characters, etc.
Stopword removal
common words such as the, and, or which are
not semantically meaningful in a document
Stem words
morphological processing to group word variants
(e.g., compute, computer, computing,
computes, can be represented by a single stem
comput in the index)
Assign weight to words
using frequency in documents and across documents
Store Index
Stored in a Term-Document Matrix (inverted
index) which stores each document as a vector of
keyword weights

7
tf x idf Weighs

tf x idf measure
term frequency (tf)
inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole
Goal assign a tf x idf weight to each term in
each document

8
tf x idf
9
Inverse Document Frequency

IDF provides high values for rare words and low
values for common words

10
tf x idf Example
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 df idf log2(N/df)
T1 0 2 4 0 1 0 3 1.00
T2 1 3 0 0 0 2 3 1.00
T3 0 1 0 2 0 0 2 1.58
T4 3 0 1 5 4 0 4 0.58
T5 0 4 0 0 0 1 2 1.58
T6 2 7 2 1 3 0 5 0.26
T7 1 0 0 5 5 1 4 0.58
T8 0 1 1 0 0 3 3 1.00
The initial Term x Doc matrix(Inverted Index)
Documents represented as vectors of words
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6
T1 0.00 2.00 4.00 0.00 1.00 0.00
T2 1.00 3.00 0.00 0.00 0.00 2.00
T3 0.00 1.58 0.00 3.17 0.00 0.00
T4 1.74 0.00 0.58 2.90 2.32 0.00
T5 0.00 6.34 0.00 0.00 0.00 1.58
T6 0.53 1.84 0.53 0.26 0.79 0.00
T7 0.58 0.00 0.00 2.92 2.92 0.58
T8 0.00 1.00 1.00 0.00 0.00 3.00
tf x idfTerm x Doc matrix
11
K Nearest Neighbor for Text
Training For each each training example ltx,
c(x)gt ? D Compute the corresponding TF-IDF
vector, dx, for document x Test instance
y Compute TF-IDF vector d for document y For
each ltx, c(x)gt ? D Let sx cosSim(d,
dx) Sort examples, x, in D by decreasing value of
sx Let N be the first k examples in D. (get
most similar neighbors) Return the majority class
of examples in N

Write a Comment

User Comments (0)