Text Categorization - PowerPoint PPT Presentation

About This Presentation
Title:

Text Categorization

Description:

Message filtering News articles ... or which are not semantically meaningful in a document Stem words morphological processing to group word ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 12
Provided by: Bamsh89
Category:

less

Transcript and Presenter's Notes

Title: Text Categorization


1
Text Categorization
  • Assigning documents to a fixed set of categories
  • Applications
  • Web pages
  • Recommending pages
  • Yahoo-like classification hierarchies
  • Categorizing bookmarks
  • Newsgroup Messages /News Feeds / Micro-blog Posts
  • Recommending messages, posts, tweets, etc.
  • Message filtering
  • News articles
  • Personalized news
  • Email messages
  • Routing
  • Folderizing
  • Spam filtering

2
Learning for Text Categorization
  • Text Categorization is an application of
    classification
  • Typical Learning Algorithms
  • Bayesian (naïve)
  • Neural network
  • Relevance Feedback (Rocchio)
  • Nearest Neighbor
  • Support Vector Machines (SVM)

3
Nearest-Neighbor Learning Algorithm
  • Learning is just storing the representations of
    the training examples in data set D
  • Testing instance x
  • Compute similarity between x and all examples in
    D
  • Assign x the category of the most similar
    examples in D
  • Does not explicitly compute a generalization or
    category prototypes (i.e., no modeling)
  • Also called
  • Case-based
  • Memory-based
  • Lazy learning

4
K Nearest-Neighbor
  • Using only the closest example to determine
    categorization is subject to errors due to
  • A single atypical example.
  • Noise (i.e. error) in the category label of a
    single training example.
  • More robust alternative is to find the k
    most-similar examples and return the majority
    category of these k examples.
  • Value of k is typically odd to avoid ties, 3 and
    5 are most common.

5
Similarity Metrics
  • Nearest neighbor method depends on a similarity
    (or distance) metric
  • Simplest for continuous m-dimensional instance
    space is Euclidian distance
  • Simplest for m-dimensional binary instance space
    is Hamming distance (number of feature values
    that differ)
  • For text, cosine similarity of TF-IDF weighted
    vectors is typically most effective

6
Basic Automatic Text Processing
  • Parse documents to recognize structure and
    meta-data
  • e.g. title, date, other fields, html tags, etc.
  • Scan for word tokens
  • lexical analysis to recognize keywords, numbers,
    special characters, etc.
  • Stopword removal
  • common words such as the, and, or which are
    not semantically meaningful in a document
  • Stem words
  • morphological processing to group word variants
    (e.g., compute, computer, computing,
    computes, can be represented by a single stem
    comput in the index)
  • Assign weight to words
  • using frequency in documents and across documents
  • Store Index
  • Stored in a Term-Document Matrix (inverted
    index) which stores each document as a vector of
    keyword weights

7
tf x idf Weighs
  • tf x idf measure
  • term frequency (tf)
  • inverse document frequency (idf) -- a way to deal
    with the problems of the Zipf distribution
  • Recall the Zipf distribution
  • Want to weight terms highly if they are
  • frequent in relevant documents BUT
  • infrequent in the collection as a whole
  • Goal assign a tf x idf weight to each term in
    each document

8
tf x idf
9
Inverse Document Frequency
  • IDF provides high values for rare words and low
    values for common words

10
tf x idf Example
  Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6   df idf log2(N/df)
T1 0 2 4 0 1 0   3 1.00
T2 1 3 0 0 0 2   3 1.00
T3 0 1 0 2 0 0   2 1.58
T4 3 0 1 5 4 0   4 0.58
T5 0 4 0 0 0 1   2 1.58
T6 2 7 2 1 3 0   5 0.26
T7 1 0 0 5 5 1   4 0.58
T8 0 1 1 0 0 3   3 1.00
The initial Term x Doc matrix(Inverted Index)
Documents represented as vectors of words
  Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6
T1 0.00 2.00 4.00 0.00 1.00 0.00
T2 1.00 3.00 0.00 0.00 0.00 2.00
T3 0.00 1.58 0.00 3.17 0.00 0.00
T4 1.74 0.00 0.58 2.90 2.32 0.00
T5 0.00 6.34 0.00 0.00 0.00 1.58
T6 0.53 1.84 0.53 0.26 0.79 0.00
T7 0.58 0.00 0.00 2.92 2.92 0.58
T8 0.00 1.00 1.00 0.00 0.00 3.00
tf x idfTerm x Doc matrix
11
K Nearest Neighbor for Text
Training For each each training example ltx,
c(x)gt ? D Compute the corresponding TF-IDF
vector, dx, for document x Test instance
y Compute TF-IDF vector d for document y For
each ltx, c(x)gt ? D Let sx cosSim(d,
dx) Sort examples, x, in D by decreasing value of
sx Let N be the first k examples in D. (get
most similar neighbors) Return the majority class
of examples in N
Write a Comment
User Comments (0)
About PowerShow.com