IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai - PowerPoint PPT Presentation

About This Presentation

Title:

IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai

Description:

IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai Overview Definition of text classification ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 53

Provided by: baij6

Category:

more less

Transcript and Presenter's Notes

Title: IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai

1
IFT6255 Information RetrievalA synthesis,
analysis and comparison of text classification
algorithms Ligen WangJing Bai
2
Overview

Definition of text classification
Important processes in classification
Classification algorithms
Advantages and disadvantages of algorithms
Performance comparison of algorithms
Conclusion

3
Text Classification

Text classification (text categorization)
assign documents to one or more predefined
categories
classes
Documents ?
class1
class2
.
.
.
classn

4
Illustration of Text Classification
Science
Sport
Art
5
Applications of Text Classification

Organize web pages into hierarchies
Domain-specific information extraction
Sort email into different folders
Find interests of users
Etc.

6
Text Classification Framework
Documents
Preprocessing
Indexing
Feature selection
Applying classification algorithms
Performance measure
7
Preprocessing

Preprocessing
transform documents into a suitable
representation for classification task
Remove HTML or other tags
Remove stopwords
Perform word stemming (Remove suffix)

8
Indexing

Indexing by different weighing schemes
Boolean weighing
Word frequency weighing
tfidf weighing
ltc weighing
Entropy weighing

9
Feature Selection

Feature selection
remove non-informative terms from documents
gtimprove classification effectiveness
gtreduce computational complexity

10
Different Feature Selection Methods

Document Frequency Thresholding (DF)
Information Gain (IG)
?2statistic (CHI)
Mutual Information (MI)
Term Strength (TS)

11
Classification Algorithms

Rocchios algorithm
K-Nearest-Neighbor algorithm (KNN)
Decision Tree algorithm (DT)
Naive Bayes algorithm (NB)
Artificial Neural Network (ANN)
Support Vector Machine (SVM)
Voting algorithms

12
Rocchios Algorithm

Build prototype vector for each class
prototype vector average vector over all
training document vectors that belong to class ci
Calculate similarity between test document and
each of prototype vectors
Assign test document to the class with maximum
similarity

13
Analysis of Rocchios Algorithm

Advantages
Easy to implement
Very fast learner
Relevance feedback mechanism
Disadvantages
Low classification accuracy
Linear combination too simple for classification
Constant ? and ? are empirical

14
K-Nearest-Neighbor Algorithm

Principle points (documents) that are close in
the space belong to the same class

15
K-Nearest-Neighbor Algorithm

Calculate similarity between test document and
each neighbor
Select k nearest neighbors of a test document
among training examples
Assign test document to the class which contains
most of the neighbors

16
Analysis of KNN Algorithm

Advantages
Effective
Non-parametric
More local characteristics of document are
considered comparing with Rocchio
Disadvantages
Classification time is long
Difficult to find optimal value of k

17
Decision Tree Algorithm

Decision tree associated with document
Root node contains all documents
Each internal node is subset of documents
separated according to one attribute
Each arc is labeled with predicate which can be
applied to attribute at parent
Each leaf node is labeled with a class

18
Decision Tree Algorithm

Recursive partition procedure from root node
Set of documents separated into subsets according
to an attribute
Use the most discriminative attribute first
Pruning to deal with overfitting

19
Analysis of Decision Tree Algorithm

Advantages
Easy to understand
Easy to generate rules
Reduce problem complexity
Disadvantages
Training time is relatively expensive
A document is only connected with one branch
Once a mistake is made at a higher level, any
subtree is wrong
Does not handle continuous variable well
May suffer from overfitting

20
Naïve Bayes Algorithm

Estimate the probability of each class for a
document
Compute the posterior probability (Bayes rule)
Assumption of word independency

21
Naïve Bayes Algorithm

P(Ci)
P(djci)

22
Analysis of Naïve Bayes Algorithm

Advantages
Work well on numeric and textual data
Easy to implement and computation comparing with
other algorithms
Disadvantages
Conditional independence assumption is violated
by real-world data, perform very poorly when
features are highly correlated
Does not consider frequency of word occurrences

23
Basic Neuron Model In A Feedforward Network

Inputs xi arrive through pre-synaptic connections
Synaptic efficacy is modeled using real weights
wi
The response of the neuron is a nonlinear
function f of its weighted inputs

24
Inputs To Neurons

Arise from other neurons or from outside the
network
Nodes whose inputs arise outside the network are
called input nodes and simply copy values
An input may excite or inhibit the response of
the neuron to which it is applied, depending upon
the weight of the connection

25
Weights

Represent synaptic efficacy and may be excitatory
or inhibitory
Normally, positive weights are considered as
excitatory while negative weights are thought of
as inhibitory
Learning is the process of modifying the weights
in order to produce a network that performs some
function

26
Output

The response function is normally nonlinear
Samples include
Sigmoid
Piecewise linear

27
Backpropagation Preparation

Training SetA collection of input-output
patterns that are used to train the network
Testing SetA collection of input-output patterns
that are used to assess network performance
Learning Rate-?A scalar parameter, analogous to
step size in numerical integration, used to set
the rate of adjustments

28
Network Error

Total-Sum-Squared-Error (TSSE)
Root-Mean-Squared-Error (RMSE)

29
A Pseudo-Code Algorithm

Randomly choose the initial weights
While error is too large
For each training pattern
Apply the inputs to the network
Calculate the output for every neuron from the
input layer, through the hidden layer(s), to the
output layer
Calculate the error at the outputs
Use the output error to compute error signals for
pre-output layers
Use the error signals to compute weight
adjustments
Apply the weight adjustments
Periodically evaluate the network performance

30
Apply Inputs From A Pattern

Apply the value of each input parameter to each
input node
Input nodes computer only the identity function

31
Calculate Outputs For Each Neuron Based On The
Pattern

The output from neuron j for pattern p is Opj
where
and
k ranges over the input indices and Wjk is the
weight on the connection from input k to neuron j

32
Calculate The Error Signal For Each Output Neuron

The output neuron error signal dpj is given by
dpj(Tpj-Opj) Opj (1-Opj)
Tpj is the target value of output neuron j for
pattern p
Opj is the actual output value of output neuron j
for pattern p

33
Calculate The Error Signal For Each Hidden Neuron

The hidden neuron error signal dpj is given by
where dpk is the error signal of a post-synaptic
neuron k and Wkj is the weight of the connection
from hidden neuron j to the post-synaptic neuron
k

34
Calculate And Apply Weight Adjustments

Compute weight adjustments DWji byDWji ? dpj
Opi
Apply weight adjustments according to Wji Wji
DWji

35
Analysis of ANN Algorithm

Advantages
Produce good results in complex domains
Suitable for both discrete and continuous data
(especially better for the continuous domain)
Testing is very fast
Disadvantages
Training is relatively slow
Learned results are difficult for users to
interpret than learned rules (comparing with DT)
Empirical Risk Minimization (ERM) makes ANN try
to minimize training error, may lead to
overfitting

36
Support Vector Machines

Main idea of SVMs
Find out the linear separating hyperplane which
maximize the margin, i.e., the optimal separating
hyperplane (OSH)
Nonlinear separable case
Kernel function and Hilbert space

37
SVM classification
Maximizing the margin is equivalent to
Introducing Lagrange multipliers , the
Lagrangian is
Dual problem
subject to
The solution is given by
The problem of classifying a new data point x is
now simply solved by looking at the sigh of
38
Analysis of SVM Algorithm

Advantages
Comparing with ANN, SVM capture the inherent
characteristics of the data better
Embedding the Structural Risk Minimization (SRM)
principle which minimizes the upper bound on the
generalization error (better than the Empirical
Risk Minimization principle)
Ability to learn can be independent of the
dimensionality of the feature space
Global minima vs. local minima
Disadvantage
Parameter tuning
kernel selection

39
Voting Algorithm

Principle using multiple evidence (multiple poor
classifiersgt single good classifier)
Generate some base classifiers
Combine them to make the final decision

40
Bagging Algorithm

Use multiple versions of a training set D of size
N, each created by resampling N examples from D
with bootstrap
Each of data sets is used to train a base
classifier, the final classification decision is
made by the majority voting of these classifiers

41
Adaboost

Main idea
The main idea of this algorithm is to maintain a
distribution or set of weights over the training
set. Initially, all weights are set equally, but
in each iteration the weights of incorrectly
classified examples are increased so that the
base classifier is forced to focus on the hard
examples in the training set. For those correctly
classified examples, their weights are decreased
so that they are less important in next
iteration.

Why ensembles can improve performance
Uncorrelated errors made by the individual
classifiers can be removed by voting.
Our hypothesis space H may not contain the true
function f. Instead, H may include several
equally good approximations to f. By taking
weighted combinations of these approximations, we
may be able to represent classifiers that lie
outside of H.

42
Adaboost algorithm
Given m examples
where
Initialize
for all i 1m

For t 1,,T
Train base classifier using distribution

Get a hypothesis

with error

Choose

Update

where
is a normalization factor (chosen so that
will be a distribution).
Output the final hypothesis
43
Analysis of Voting Algorithms

Advantage
Surprisingly effective
Robust to noise
Decrease the overfitting effect
Disadvantage
Require more calculation and memory

44
Performance Measure

Performance of algorithm
Training time
Testing time
Classification accuracy
Precision, Recall
Micro-average / Macro-average
Breakeven precision recall
Goal high classification quality and
computation efficiency

45
Comparison Based on Six Classifiers

Classification accuracy six classifiers
(Reuters-21578 collection)

1 2 3 4
Author Dumais Joachims Weiss Yang
1 Training 9603 9603 9603 7789
2 Test 3299 3299 3299 3309
3 Topics 118 90 95 93
4 Indexing Boolean tfc Frequency ltc
5 Selection MI IG - ?2
7 Measure Breakeven Microavg. Breakeven Breakeven
8 Rocchio 61.7 79.9 78.7 75
9 NB 75.2 72 73.4 71
10 KNN N/A 82.3 86.3 85
11 DT N/A 79.4 78.9 79
12 SVM 87 86 86.3 N/A
13 Voting N/A N/A 87.8 N/A
46
Analysis of Results

SVM, Voting and KNN are showed good performance
DT, NB and Rocchio showed relatively poor
performance

47
Comparison Based on Feature Selection

Classification accuracy NB vs. KNN vs. SVM
(Reuter collection)

of features NB KNN SVM
10 48.66 0.10 57.31 0.2 60.78 0.17
20 52.28 0.15 62.57 0.16 73.67 0.11
40 59.19 0.15 68.39 0.13 77.07 0.14
50 60.32 0.14 74.22 0.11 79.02 0.13
75 66.18 0.19 76.41 0.11 83.0 0.10
100 77.9 0.19 80.2 0.09 84.3 0.12
200 78.26 0.15 82.5 0.09 86.94 0.11
500 80.80 0.12 82.19 0.08 86.59 0.10
1000 80.88 0.11 82.91 0.07 86.31 0.08
5000 79.26 0.07 82.97 0.06 86.57 0.04
48
Analysis of Results

Accuracy is improved with an increase in the
number of features until some level
Top level approximately 500-1000 features
accuracy reaches its peak and begins to decline
SVM obtains the best performance

49
Comparison Based on Training Time (1)

Training time SVM vs. NB ( features 100)

documents Training Time for SVM Training Time for NB
9603 5 8
19206 15 25
28809 27 60
38412 32 120
48015 40 340
57618 50 410
67221 65 498
76824 78 600
86427 100 630
50
Comparison Based on Training Time (2)

Training time SVM vs. NB ( of features
increasing)

features Training Time for SVM Training Time for NB
20 2.2 3
50 3 5
100 3.1 11
200 3.3 22
300 3.5 27
500 4.1 35
51
Analysis of Results

Table1
Training time of SVM is less than NB w.r.t. the
number of documents
Table2
Training time of SVM increases slowly with the
number of features
Training time of NB increases more quickly

52
Conclusion

Different algorithms perform differently
depending on data collections
Some algorithms (e.g. Rocchio) do not perform
well
None of them appears to be globally superior over
the others however, SVM and Voting are good
choices by considering all the factors

Write a Comment

User Comments (0)