Text Classification

About This Presentation

Title:

Text Classification

Description:

A function that returns the value of the dot product between the images of the ... Conversion Algo. BOW & SSK. 5-Jan-2004. 8. Document Domains. 5-Jan-2004. 9 ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 30

Provided by: Mrok6

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification

1
Text Classification

____________________________________
Topics in Machine Learning (WS2003/04)
Presented by
Gulraj Singh Mrok Eugenie Giesbrecht

2
Problem
3
Solution SVM-linear classifier
GOOD GUYS
BAD GUYS
4

Non-Linear Classifiers
5
Kernel

A function that returns the value of the dot
product between the images of the two arguments

6
Problem
SVM relies on Kernels for classifications
We know Kernels only for Vectors
But What if we have text?
I am a long text string.
How to use SVM for the classification of text
documents strings???
7
Solution steps
1. Convert Documents and Strings to a vector
space representation.
Conversion Algo BOW SSK
I am a long text string.
2. Develop a Kernel to discover feature space.
3. Run SVM classifier.
8
Document Domains
9
What is aText Categorization?
I am IT document Term "Information" 5 Term
"Company" 3
I am Survey document Term "Information" 2 Term
"Comapny" 4
10
(3, 5), IT
Distance Inner product of vectors
. . . . .
Information
(4,2), Survey
Company
. . . . .
11
Method 1 Bags of Words Document Vectors
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
12
Method 1 Bags of Words Document Vectors
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
13
Method 1 Bags of Words Stemmer
Stemmer removes inflection information
1. Computer 2. Computation 3. Compute

Result Comput

14
Method 1 Bags of Words Document Vectors
15
Method 1 Weighting vector for each word
frequency of term in document XRatio of total
documents total documents containing that
termk is normalization factor
K (x ,z ) lt (x). (z)gt
16
Disadvantages of Method 1

1. Not good for large documents

17
2. Lost of word order3. Only terms and frequency
is known
Disadvantages of Method 1
18
Method 2 for text classification
Documents as a sequence of symbols i.e text
Strings
19
Bag of Words

map a document into a bag of words
Bag set with repetitions allowed
We loose - all the word order information
- inflection information

20
String Representations

Documents as symbol sequences symbols can be
letters, syllables, words, etc
The feature space - generated by the set of all
(non-contiguous) substrings of k-symbols.
The more substrings two documents have in common,
the more similar they are considered.

21
Example k2
22
Example Kernel
23
SSK- String Subsequence Kernel
Lets make a SSK kernel now with an example
We define a Finite alphabet S
support vector machine is cool
Finite sequence of char from S i.e a string s t
s port vector t cool
Lenght of s s 11 t t 4
ts coolport vector
Sn set of all finite strings of lenght n
S3 sup, upp, ppo.... ool or S2 su, up,
pp... ol
sij s510 vector
1lt i lt s , 1lt i lt 11
S S0 S1 S2 ... S? ( very big set...)
24
In short

Feature space
Kernel

25
Complexity

Big Oh notation
Difficult to compute

26
Recursive computation of the subsequence Kernel
27
N- gram To reduce the text subsequences.
Eg hello hellgo
2-grams he, el, ll, lo, o_, _h, he, el, ll,
lg ,go