Title: Text Classification
1Text Classification
- ____________________________________
- Topics in Machine Learning (WS2003/04)
- Presented by
- Gulraj Singh Mrok Eugenie Giesbrecht
3Solution SVM-linear classifier
4 Non-Linear Classifiers
- A function that returns the value of the dot
product between the images of the two arguments
SVM relies on Kernels for classifications
We know Kernels only for Vectors
But What if we have text?
I am a long text string.
How to use SVM for the classification of text
documents strings???
7Solution steps
1. Convert Documents and Strings to a vector
space representation.
Conversion Algo BOW SSK
I am a long text string.
2. Develop a Kernel to discover feature space.
3. Run SVM classifier.
8Document Domains
9What is aText Categorization?
I am IT document Term "Information" 5 Term
"Company" 3
I am Survey document Term "Information" 2 Term
"Comapny" 4
10(3, 5), IT
Distance Inner product of vectors
. . . . .
(4,2), Survey
. . . . .
11Method 1 Bags of Words Document Vectors
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
12Method 1 Bags of Words Document Vectors
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
13Method 1 Bags of Words Stemmer
Stemmer removes inflection information
1. Computer 2. Computation 3. Compute
14Method 1 Bags of Words Document Vectors
15Method 1 Weighting vector for each word
frequency of term in document XRatio of total
documents total documents containing that
termk is normalization factor
K (x ,z ) lt (x). (z)gt
16Disadvantages of Method 1
- 1. Not good for large documents
172. Lost of word order3. Only terms and frequency
is known
Disadvantages of Method 1
18 Method 2 for text classification
Documents as a sequence of symbols i.e text
19Bag of Words
- map a document into a bag of words
- Bag set with repetitions allowed
- We loose - all the word order information
- - inflection information
20String Representations
- Documents as symbol sequences symbols can be
letters, syllables, words, etc - The feature space - generated by the set of all
(non-contiguous) substrings of k-symbols. - The more substrings two documents have in common,
the more similar they are considered.
21Example k2
22Example Kernel
23SSK- String Subsequence Kernel
Lets make a SSK kernel now with an example
We define a Finite alphabet S
support vector machine is cool
Finite sequence of char from S i.e a string s t
s port vector t cool
Lenght of s s 11 t t 4
ts coolport vector
Sn set of all finite strings of lenght n
S3 sup, upp, ppo.... ool or S2 su, up,
pp... ol
sij s510 vector
1lt i lt s , 1lt i lt 11
S S0 S1 S2 ... S? ( very big set...)
24In short
- Big Oh notation
- Difficult to compute
26Recursive computation of the subsequence Kernel
27N- gram To reduce the text subsequences.
Eg hello hellgo
2-grams he, el, ll, lo, o_, _h, he, el, ll,
lg ,go
- 3-grams hel, ell, llo, lo_, o_h, _he, hel, ell,
llg, lgo
- SSK - an effective alternative to BOW
- NB - excellent results on smaller datasets
- - less encouraging on the full Reuters dataset