Text Classification - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Text Classification

Description:

A function that returns the value of the dot product between the images of the ... Conversion Algo. BOW & SSK. 5-Jan-2004. 8. Document Domains. 5-Jan-2004. 9 ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 30
Provided by: Mrok6
Category:

less

Transcript and Presenter's Notes

Title: Text Classification


1
Text Classification
  • ____________________________________
  • Topics in Machine Learning (WS2003/04)
  • Presented by
  • Gulraj Singh Mrok Eugenie Giesbrecht

2
Problem
3
Solution SVM-linear classifier
GOOD GUYS
BAD GUYS
4

Non-Linear Classifiers
5
Kernel
  • A function that returns the value of the dot
    product between the images of the two arguments

6
Problem
SVM relies on Kernels for classifications
We know Kernels only for Vectors
But What if we have text?
I am a long text string.
How to use SVM for the classification of text
documents strings???
7
Solution steps
1. Convert Documents and Strings to a vector
space representation.
Conversion Algo BOW SSK
I am a long text string.
2. Develop a Kernel to discover feature space.
3. Run SVM classifier.
8
Document Domains
9
What is aText Categorization?
I am IT document Term "Information" 5 Term
"Company" 3
I am Survey document Term "Information" 2 Term
"Comapny" 4
10
(3, 5), IT
Distance Inner product of vectors
. . . . .
Information
(4,2), Survey
Company
. . . . .
11
Method 1 Bags of Words Document Vectors
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
12
Method 1 Bags of Words Document Vectors
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
13
Method 1 Bags of Words Stemmer
Stemmer removes inflection information
1. Computer 2. Computation 3. Compute
  • Result Comput

14
Method 1 Bags of Words Document Vectors
15
Method 1 Weighting vector for each word
frequency of term in document XRatio of total
documents total documents containing that
termk is normalization factor
K (x ,z ) lt (x). (z)gt
16
Disadvantages of Method 1
  • 1. Not good for large documents

17
2. Lost of word order3. Only terms and frequency
is known
Disadvantages of Method 1
18
Method 2 for text classification
Documents as a sequence of symbols i.e text
Strings
19
Bag of Words
  • map a document into a bag of words
  • Bag set with repetitions allowed
  • We loose - all the word order information
  • - inflection information

20
String Representations
  • Documents as symbol sequences symbols can be
    letters, syllables, words, etc
  • The feature space - generated by the set of all
    (non-contiguous) substrings of k-symbols.
  • The more substrings two documents have in common,
    the more similar they are considered.

21
Example k2
22
Example Kernel
23
SSK- String Subsequence Kernel
Lets make a SSK kernel now with an example
We define a Finite alphabet S
support vector machine is cool
Finite sequence of char from S i.e a string s t
s port vector t cool
Lenght of s s 11 t t 4
ts coolport vector
Sn set of all finite strings of lenght n
S3 sup, upp, ppo.... ool or S2 su, up,
pp... ol
sij s510 vector
1lt i lt s , 1lt i lt 11
S S0 S1 S2 ... S? ( very big set...)
24
In short
  • Feature space
  • Kernel

25
Complexity
  • Big Oh notation
  • Difficult to compute

26
Recursive computation of the subsequence Kernel
27
N- gram To reduce the text subsequences.
Eg hello hellgo
2-grams he, el, ll, lo, o_, _h, he, el, ll,
lg ,go
  • 3-grams hel, ell, llo, lo_, o_h, _he, hel, ell,
    llg, lgo

28
Conclusion
  • SSK - an effective alternative to BOW
  • NB - excellent results on smaller datasets
  • - less encouraging on the full Reuters dataset

29
Questions???
Write a Comment
User Comments (0)
About PowerShow.com