Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification

About This Presentation

Title:

Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification

Description:

Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification Yun Jeong Choi Dept. of Computer Science & Engineering – PowerPoint PPT presentation

Number of Views:211

Avg rating:3.0/5.0

Slides: 98

Provided by: 10062

Category:

more less

Transcript and Presenter's Notes

Title: Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification

1
Topic Refinement Method of Post-processing and
Training for Improvement of Automated
Classification

Yun Jeong Choi
Dept. of Computer Science Engineering
Ewha Womans University

2
Contents

Background
Text Classification Framework
Classification Problem
Related Work
Motivation
Problem Area
Approach Strategy
Proposed Method
Refinement Text Classification System
Part 1 Reinforcement Training Method
Part 2 Post-Processing Method for assigning
and Feedback Analysis
Experimental Results
Discussion Next Plan

3
Background Text Classification

Objectives Assign a new document into one of the
predefined categories using the generated
classifier or rule
Given
a document dj where D is the domain of documents
a description of an instance, x?X, X is the
instance space(feature vector).
E.g how to represent text documents.
a fixed set of categories C c1, c2,, cn
Determine
The category of x c(x)?C, where c(x) is a
categorization function(classifier)
whose domain is X and whose range is C.
We want to
Build classifier that assign these values
assign a Boolean value to the pair ltdj, cigt

Preprocessing
Indexing
Applying Classification Algorithms(Classifier)
New Documents
Assign
4
Background Data Classification vs. Text
Classification
Data Classification
Supervised Prediction
Example Target Marketing
Inputs
Variable
Role
Target
Cases
Annual Income
Text Classification
5
Background Representing Documents

Usually, an example is represented as a series of
feature-value pairs. The features can be
arbitrarily abstract (as long as they are easily
computable.?) or very simple.
For example, the features could be the set of all
words and the values(weighting score), their
number of occurrences in a particular document.
Learning is the process of modifying the weights

Computer1 Machine1 AI1 Information2 Korea1 U.
S.1 .. ..
Representation
A Bag of words
6
Background Text Classification Framework
Assign
Classifying
Applying Classification Algorithms(Classifier)
Preprocessing
Indexing
New Documents
Training
Training Documents
Indexing
Feature Selection
Preprocessing
7
Related Work (1/3)

Approach to improving performance based on
classification algorithms
In a Classification model
Weighting scheme in feature extraction and
selection
Indexing method
Probabilistic computation
Problem
Vector-space model (considering only term
frequency)
Cannot consider semantics
Simple assign method
In several classification model
Competition Voting
Select better one In weak classifiers
Ensemble framework
Combination of some classifiers
Cooperative models and classifiers

Regression based on Least Squares Fit (1991)
Nearest Neighbor Classification (1992)
Bayesian Probabilistic Models (1992)
Symbolic Rule Induction (1994)
Neural Networks (1995)
Rocchio approach (traditional IR, 1996)
Support Vector Machines (1997)
Maximum Entropy (1999)
Hidden Markov Models (1999)
Error-Correcting Output Coding (1999)...

8
Related Work (2/3)

Approach to improving performance based on
learning algorithms
Constructing method for Sample data set and
training set
Selection Problem for sample documents
Ensemble structure
Bootstrapping and Bagging Algorithms
Adaboost
Active Learning
Lijuan Cai, Thomas Hofmann,ACM SIGIR 2003
Text Categorization by Boosting Automatically
Extracted Concepts
Ioannis Tsochantaridis,ICML,2004
Support Vector Machine Learning for
Interdependent and Structured Output Spaces

Combine several weak classifier -gt make strong
classifier (choose more strong classifier)
9
Related Work (3/3)

Approach to improve performance with assigning
method for complex documents
Assigning method Wimbledon Championship
Tournament method
Yunqing Xia, wei Liu, NLDB, 2005
Email Categorization with Tournament Methods
Modify Algorithms for specific domain problem
Haoran Wu, Tong Heng Phang, SIGKDD. 2002
A Refinement Approach to Handling Model Misfit in
Text Categorization
Anti-Spam Classifier
Optimization and combine method
Trial and Error -gt Feedback
Evaluation and Feedback in Data Mining Strategy

10
Motivation Some Difficulty in Classification

Human Expert make Different Decision about a
Situation.
At Hospital Error-Risk(businessmans risk)
Vehicles Classification Subjective subject!
meaningless argument as many vehicles fall
between classes or even outside all of them

Data/ Documents
Very Easy Cases to machine human
Disease
Movie
Vehicles
Diagnostics
??? gt 2?
Similar Symptoms
11
Problem Definition (A/2)

Problem A Characteristics of Documents
Most Documents have high Complexity Uncertainty
in Contents with
multiple Concepts and Features
Complex reports, Spam-letter, New materials,
Biological Literature,. Etc
The training error may happen while we have
efficient learning method and human expert
High cost in training time -gt inaccuracy
Problem B Classification Method/Performance
Measure
Not important What kind of classifier was used ?
Classifier and algorithms has a limitation of
analysis for dealing with documents as a bag of
words.
The accuracy is high only when data fits model
well Specific model for Specific domain area)
Assign method by simple ranking method
Not important How many documents were used for
training?
How Informative documents were selected or
constructed ?

12
Approach(1/A) Border Problem in Uncertainty Data

To find optimal boundary
Classifier makes rules for classification
Line or Area
Goal
More accurate
Automatic process

But, Data size and complexity grow faster, which
makes the finding optimal line to be more
difficult!
In this(red) area, complex data which have
multiple concepts or no meaning, are located

finding decision rule or line

13
Strategy(1/A) Hierarchical Structure in Class
Definition

Define Target Category Scheme using Hierarchical
Scheme
Final Target Categories C1, C2
Intermediate Categories. X
Subcategories of target category C11, C12,
C21, C22

docoment
14
Training Algorithms

How many documents ?
How Informative documents were selected or
constructed!!
Training Algorithms deal with

Selection
Combination
Number of training documents

Simple Random selection Active sample
selection (contain uncertainty sample)
15
Active Sample Selection vs. Random Selection
16
Problem Definition (B/2)

Problem A Characteristics of Documents
Most Documents have high Complexity Uncertainty
in Contents with
multiple Concepts and Features
Complex reports, Spam-letter, New materials,
Biological Literature,. Etc
The training error may happen while we have
efficient learning method and human expert
High cost in training time -gt inaccuracy
Problem B Classification Method
Not important What kind of classifier was used
Classifier and algorithms has a limitation of
analysis for dealing with documents as a bag of
words.
The accuracy is high only when data fits model
well Specific model for Specific domain area)
Too Simple Assign method by ranking system

17
Approach(1/B) Assign Process

Assign Process in Text Classification
1st Candidate Category is always right?
Analysis Ranking Order and Score

Rank Data Rank Data 1 2 3 4 5 Target (C1, C2 U, X)
1 Category C21 C11 X C22 C12 C2 -gt U
1 29 28 17 15 10 C2 -gt U
2 Category C22 C11 X C21 C12 C2
2 97 1 1 1 0 C2
18
Strategy(B) Post-Processing for Assignment

Goals in each Step
Step 1 Classify 1st Candidate Category in a
documents
Minimum Threshold a gap of 1st and 2nd by
rank score(probability score)
Step 2 Classify Undecided cases in Step 1
Compute a Distance between X (Intermediate
Category) and Other Category
Step 3 Gather Common / Uncommon Cases in Step1
and Step 2
Ranking Order from Text Classification gt
representing document
ltcategory , scoregt vs. ltterm , frequencygt
Step 4 Analysis Experience from Step1 Step3
-gt Feedback to Training Process
Modifying Classification Rule
Add Eliminate training documents

Classify reliable ones vs. unreliable ones
19
Strategy(1/B) Post-Processing for Assignment

Goal
overcome the problems and limitations of
traditional method with the mining approach which
is focused on risk minimization analysis.

Assignment a category to documents using initial
score

Input Document Di, Candidate category list Li
normalized and resorted by descend order
Step1 for i 0 to N( number of input
documents)
If (Disize min_support) ((Li1.score
min_value)
(Li1.score - Li2.score diff_value)) then
assign Di to 1ST candidate category
else assign Di to U(undecided)
Step2 for n0 to N ( number of unassigned
documents in step1)
for n0 to N ( number of target category)
Calculate distance of
category between Pivot Category(x), cnk
assign Di to more closer side cn

20
Strategy(2/B) Post-Processing Recomputation

Make another training data from candidate lists
of documents
Perform Data Classification with these uncommon
patterns

Li Di 1 (wm 0.02) 1 (wm 0.02) 2 (wm 0.15) 2 (wm 0.15) 3 (wm 0.25) 3 (wm 0.25) 4 (wm 0.31) 4 (wm 0.31) 5 (wm 0.35) 5 (wm 0.35) Disize Step1 Step2 Step3 Actual Class
1 C21 .98 C11 .01 X1 .01 X2 .01 C12 .00 726.33 C21?C2 - C2 C2
2 C21 .39 C12 .20 X2 .17 C11 .13 X1 .10 31.6 U C2 C2 C2
3 X2 .29 C11 .28 C12 .17 C21 .15 X1 .01 514.42 U C1 C1 C1
4 X1 .28 C21 .23 X2 .17 C11 .16 C12 .15 287.12 U C2 C2 C2
21
Multi-Training Strategy

where Cn.process,
n feedback time 0,1,2
process step1, step2, step3
Cn.process 1 when actual class predict class
Cn.process 0 when actual class ? predict class

Training Data
Document set used in text classification
Cn.step1 Cn.step2
Examples ltdocument, classgt
Common/uncommon cases for data classification in
step1,step2 Cn.step3
Examples lt an order of candidate classes in a
documents, actual classgt
In results of above classification of each
document using Reinforcement Learning
Examples ltltstate, actiongt, rewardgt
Evaluation Function for Estimate accuracy
Fault Detection and Error Correction

Target
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn.step1 Cn.step2 Cn.step3 Cn1.step1 Cn1.step2 Cn1.step3
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -

22
Approach(2/B) Error Correction and Feedback

Reinforcement Learning
Learning from Interaction with Environment
Delayed Reward
Trial and Error
Learning by Error Correction
Prediction-gtPrediction-gtPrediction-gtConservation
Experience
Transition model, how action influence states
Reward R, immediate value of state-action
transition
Policy ?, maps states to actions

Agent
Policy
1) 3 6) 2
2) 2 7) 1
3) 1 8) 3
4) 2 9) 3
5) 4 10) 3
Environment
23
Strategy(3/B) Post-Processing Evaluation
Feedback

Evaluation matrix for relevance feedback
Goal Estimate Accuracy , Efficiency,
Over-fitting.
Action Add/Eliminate documents -gt Modify class
definition

1 If ((Cn1.step1 X) and (Cn11.step1
True)
(Cn1.step2 True))
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn.step1 Cn.step2 Cn.step3 Cn1.step1 Cn1.step2 Cn1.step3
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -

Good(di)
0 otherwise
1 If (Cn1.step1 True ) and
Fair(di)
0 otherwise
1 If ((Cn1.step1 False ) (Cn1.step2 ?
Cn2) )
Poor(di)
0 otherwise
24
Summary Concept of Proposed Method

We dont care and dont depend on
A type of Classifiers and its performance
Ex)Functional definition of similarity measure
COS, Euclidean, Kernel Functions, ...
How many extracted features do we consider?
Didnt solve Complex and uncertainty problem
We deals
Definition of Classes based on hierarchical
classification
Construction of Training Set for Active Learning
Post-processing analysis for improving
traditional assign method by simple ranking
system.
Feedback to Training Process and Schema by
Reinforcement training

25
Proposed Method concepts
Applying Classification Algorithms(Classifier)
Assign
Classifying
Preprocessing
Indexing
New Documents
Performance measure
How about Classification power ? - Add or
elimination of documents in a class.

Assign Problem
Post-Processing Analysis for assign
Improving Simple Ranking System by Order of
score
Feedback to Training process by Reinforcement
Training(error correction)
Multi Training Strategy
Re-Classification using a pattern of
text-classification result

Preprocessing
Define Classes Scheme
Target category vs. Immediate Category
vs.Sub-category
Hierarchical Construction in Classes

26
Contribution

More accurate determination in a segment
Important basis for uncertainty data similar
concepts and features
Reduce an error risk
Proposed Method for Automated Text Classification
Main Objective provide high performance and
reliability in system while reducing a human
intervention
Can be used for many applications
Other Classification Problem, IR, Bioinformatics
as well as Data Classification/Text
Classification
Improves performance of class quality
Class quality accuracy
Stability from training-error
Reducing Learning cost
Over-fitting Problem Stop Learning, Disuse
Learning

27
Experimental Result using Proposed Method-
Newsgroup Data- Biological Data
28
Implementation of Refinement Text/Data Mining
System

Part 1 Hierarchical Construction in Training
Method
Part 2 Post-Processing by Reinforcement Learning

29
Experimental Method in Newsgroup Data

Characteristic 20 Newsgroup Dataset
5 Group Computer, Science, Recreation, Talk and
Misc.
There are partially multi-labeled documents and
Each class has more than two subclasses.
We use subset of Computer. group
C comp.graphics, comp.sys.ibm.pc.hardware
, comp.sys.mac.hardware, comp.windows.x
200 training documents for each target category,
50 documents for intermediate category adding
noises about 10 of the total.
We use
Accuracy True Cases / N
Test set did not contain Training data

30
Experimental Condition and Evaluation

Condition
Same classifiers NB, SVM
With and without post-processing
With and without noisy data in training samples
Define Category and Experimental Condition

Definition of Category Definition of Category Definition of Category Number of Training Documents Number of Training Documents Method Method
Target Category, C A of D for Inter- Mediate Category,X a of Subcategory Correct Documents Incorrect Documents(10 Base Classifier With(W) and Without(WO) noise post-processing
E1 c1,c2,c3,c4 0 0 800 0 SVM WO post-processing, WO noise
E2 c1,c2,c3,c4 0 0 800 80 SVM WO post-processing, W noise
E3 c1,c2,c3,c4 0 0 800 0 NB WO post-processing, WO noise
E4 c1,c2,c3,c4 0 0 800 80 NB WO post-processing, W noise
E5 c1,c2,c3,c4 150 2 800 0 SVM W post-processing, WO noise
E6 c1,c2,c3,c4 150 2 800 80 SVM W post-processing, W noise
E7 c1,c2,c3,c4 150 2 800 0 NB W post-processing, WO noise
E8 c1,c2,c3,c4 150 2 800 80 NB W post-processing, W noise
31
Experimental Results in Newsgroups
Yun Jeong Choi. Seung Soo Park,
Refinement Method of Post-processing and
Training for Improvement of Automated Text

Classification,
ICCSA 2006, LNCS 3981, pp 02980309, 2006

Comparison of Predict Power
Existing method(E1,E2,E3,E4) vs. Proposed
method(E5,E6,E7,E8)
Correct documents vs. 10 of incorrect documents

Accuracy
Accuracy
Accuracy
Accuracy
32
?? Biological texts- Background- Results
33
Problem Definition

Ambiguous Entity gt Multiplicity
Genes, enzymes, and their transcripts often share
the same name
The task of annotation gt identifying and
classifying the terms
In documents, we have to treat as uncertain
documents
There are too many topics and key-words
Classification Problem
Automated Classification is to classify free
texts into predefine categories
Goal finding optimal decision line or surface
and reducing manual process
Low Training Cost
High Accuracy

34
Examples of Ambiguous Entities

P130 i mediates TGF-beta-induced cell-cycle
arrest in Rb mutant HT-3 cells.
The INK4alpha/ARF locus encodes p14(ARF) and
p16(INK4alpha) , that function to arrest the cell
cycle through the p53 and RB pathways,
respectively.
Many tumor types are associated with genetic
changes in the retinoblastoma pathway,leading to
hyperactivation of cyclin-dependent kinases and
incorrect progression through the cell cycle.
The Y79 and WERI-Rb1 retinoblastoma cells, as
well as MCF7 breast cancer epithelial cells, all
of which express T-channel current and mRNA for
T-channel subunits, is inhibited by pimozide and
mibefradil with IC(50) 8 and 5 microM for
pimozide and mibefradil, respectively).

35
Experiments

Data RB-relate documents
Collected about 20000 abstracts from PubMed.
Select 100 of documents for verifying our system
This documents has many ambiguous features and
high classification error
Define Category and Experimental Condition
most documents is connected with protein(P),
gene(G) and cancer(C).
CI, others, then do the one-against-one
classification about CP, G, D.

Definition of category Definition of category Definition of category Number of training documents (correct incorrect) Number of training documents (correct incorrect) Number of training documents (correct incorrect)
Target category (C) Candidate category (S) Intermediate category(X) Correct documents Incorrect documents (10) Total (300, 318)
Protein P1 30 5 60(36)
Protein P2 30 1 60(36)
X1 60 0 60
Gene G1 30 3 60(36)
Gene G2 30 3 60(36)
X2 60 0 60
Disease, Cancer D1 30 6 60(36)
Disease, Cancer D2 30 0 60(36)
36
Experimental Results
Yun Jeong Choi. Seung Soo Park,
Efficient Classification Method for Complex
Biological Literature using Text and

Data Mining Combination, IDEAL 2006,
LNCS 4224 , pp. 688 696, 2006
Result Table comparison between existing method
and RTPost method with only correct documents
Performance Method Accuracy Protein Predict Power Gene Predict Power Disease Predict Power Misclassification rate
Naïve Baysian 0.69 51 82 74 31.
SVM 0.74 64 83 76 29
RTPost Algorithm(with Naïve Baysian) 0.89 81 94 92 11
RTPost Algorithm(with SVM) 0.91 88 91 94 8
Result Table comparison between existing method
and RTPost method containing incorrect documents
Performance Method Accuracy Protein Predict Power Gene Predict Power Disease Predict Power Misclassification rate
Naïve Baysian 0.45 52 65 17 55.
SVM 0.47 54 61 26 64
RTPost Algorithm(with Naïve Baysian) 0.85 84 92 75 15
RTPost Algorithm(with SVM) 0.87 87 91 81 11
37
Summary

Many effective techniques have been proposed to
classify these documents. But,
not proper to classify complex documents
Proposed method
Refinement classification system
It can be easily adapted to deal with other data
or other data mining algorithms.
developed by using component based style using
BOW toolkit and C language.
In Experiments, proposed method is very
successful!
Compare the accuracy and the stability under
actual condition.
Not depend on classification algorithms and
techniques
In the future, well simplify the effectiveness
function without raising the running costs of the
entire process.

38
Publication

International Journal Conference
Yun Jeong Choi. Seung Soo Park, Efficient
Classification Method for Complex Biological
Literature using Text and Data Mining
Combination, IDEAL 2006, LNCS 4224 , pp. 688
696, 2006.
Yun Jeong Choi. Seung Soo Park, Refinement
Method of Post-processing and Training for
Improvement of Automated Text Classification,
ICCSA 2006, LNCS 3981, pp 02980309, 2006
Yun Jeong Choi. Seung Soo Park, Automated
Classification of PubMed texts for Disambiguated
Annotation using Text and Data Mining, in
Proceedings of the 2005 International Joint
conference of InCob, AASBi and KSBI, BIOINFO
2005, pp. 101 106, 2005
Yun Jeong Choi. Seung Soo Park. Min Kyung Kim.
Hyun Seok Park, Sense Disambiguation in PubMed
Abstracts using Text and Data Mining,
International Conference on Genome informatics,
GIW 2002, Vol. 13, pp. 578-579, 2002
Domestic Journal
???, ???, ???? ??? ?????? ??? ??????? ???? ??,
???????? ???, Vol. 12 No 7, pp 811 822, 2005
???, ???, ? ???? ??? ?? ??????? ??????? ?? ?? ??,
??????? ???, Vol. 23 No. 3, pp. 33-46, 2002
Domestic Conference
???, ???, ????(Anomaly Detection) ? ????(Misuse
Detection) ??? ??? ??? ?? ??? ?????? ?? ??,
????????, KCC, 2006,
???, ???, ????? ??? ?? ????????? ?? ? ??,
??????? 2001? ??????, VOL. 28 NO. 01 pp.
0247 0249 2001 . 04

39
Discussion and Next Plan

Application Area
Text Classification is very closely connected to
Information Retrieval System, filtering system
Rich Tagging System
Text Mining, Data Mining
Dynamic classification for the ever-changing
needs of user
Reflection of various viewpoint(angle) vs.
tagging system
Traditional definition
ltdocument di, class ci gt disjoint mapping.
A Limitation of disjoint classification a
documents which have multiple concepts must
assign only one class?
Exact only one result vs. a liberal, Enrich
results
Support to rich tagging system
Ranking Information ltcategory, probability
scoregt
Efficient Evaluation Function
Definition Good / Fair / Poor cases

40
Discussion and Next Plan

Application Area
Text Classification is very closely connected to
Information Retrieval System, filtering system
Rich Tagging System
Text Mining, Data Mining
Dynamic classification for the ever-changing
needs of user
Reflection of various viewpoint(angle) vs.
tagging system
Traditional definition
ltdocument di, class ci gt disjoint mapping.
A Limitation of disjoint classification a
documents which have multiple concepts must
assign only one class?
Exact only one result vs. a liberal, Enrich
results
Support to rich tagging system
Ranking Information ltcategory, probability
scoregt
Efficient Evaluation Function
Definition Good / Fair / Poor cases

41
References

Basic
Mitchell, T. Machine learning. McGraw-Hill, 1997.
Hearst, M.A. Trends and discoveries support
vector macines. In IEEE Intelligent Systems,
July/August 1998, pages 18-28.
Yang, Y., Slattery, S., Ghani, R. A study of
approaches to hypertext categorization. Journal
of Intelligent Information Systems,
18(2/3)219-241, 2002.
Fabrizio Sebastiani, Machine Learning in
Automated Text Categorization, ACM Computing
Surveys, 34(1)1-47, 2002. http//www.math.unipd.i
t/fabseb60/Publications/ACMCS02.pdf
Classification algorithms, Feature Extraction
Yang, Y. and Liu, X. A re-examination of text
categorization methods. In Proceedings of ACM
SIGIR, 1999.
Dumais, S. and Chen, H. Hierarchical
classification of web content. In Proceedings of
the 23rd ACM SIGIR Conference, pages 256-263,
2000.
Yiming Yang, Shinjae Yoo, Jian Zhang and Bryan
Kisiel. Robustness of Adaptive Filtering Methods
in a Cross-benchmark Evaluation. In the 28th
Annual International ACM SIGIR Conference (SIGIR
2005), Brazil, 2005.
Lewis, D. D. An evaluation of phrasal and
clustered representations on a text
categorization task. In Proceedings of the 15th
ACM SIGIR Conference, pages 37-50, 1992.
Yiming Yang, Thomas Ault, Thomas Pierce.
Combining multiple learning strategies for
effective cross validation (ps.gz). The
Seventeenth International Conference on Machine
Learning (ICML'00), pp1167-1182, 2000.
Bekkerman, R., El-Yaniv, R., Tishby, N., and
Winter, Y. 2003. Distributional word clusters vs.
words for text categorization. J. Mach. Learn.
Res. 3 (Mar. 2003), 1183-1208.http//www.cs.techni
on.ac.il/ronb/papers/jmlr.pdf
N. Slonim and N. Tishby. The power of word
clusters for text classification. In 23rd
European Colloquium on Information Retrieval
Research. 2001. http//www.cs.huji.ac.il/noamm/pu
blications/ECIR2001.ps.gz
Landauer, T. K., Foltz, P. W., Laham, D.
Introduction to Latent Semantic Analysis.
Discourse Processes, 25, 259-284,1998.
http//lsa.colorado.edu/papers/dp1.LSAintro.pdf
Fan Li, Yiming Yang. Using recursive
classification to discover predictive features.
ACM SAC 2005.

42
Reference

Learning method
Joachims, T. Text categorization with support
vector machines learning with many relevant
features. In Proceedings of 10th European
Conference on Machine Learning, pages 137-142,
1998.
Lijuan Cai, Thomas Hofmann , Text Categorization
by Boosting Automatically Extracted Concepts,
26th Annual International ACM SIGIR Conference,
2003
Ioannis Tsochantaridis, Thomas Hofmann, Thorsten
Joachims, Yasemin Altun,Support Vector Machine
Learning for Interdependent and Structured Output
Spaces,International Conference on Machine
Learning (ICML), 2004
Optimizing method
Lewis, D.D. Evaluating and optimizing autonomous
text classification systems. In Proceedings of
ACM SIGIR, 1995.
Özgür, L., Güngör, T., and Gürgen, F. 2004.
Adaptive anti-spam filtering for agglutinative
languages a special case for Turkish. Pattern
Recogn. Lett. 25, 16 (Dec. 2004), 1819-1831.
Yiming Yang, Thomas Ault, Thomas Pierce and
Charles W Lattimer. Improving text categorization
methods for event tracking (ps.gz). Proceedings
of ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'00),
pp65-72. 2000.

43
(No Transcript)
44
????- Stemming- Information Retrieval-
Classification Process- Expert System- Learning
Algorithms- Voting, Adaboost-Reinforcement
Learning- Performance Measure
45
Motivation Complex Case
Linear Decision Boundary
Non Linear Boundary
46
Experimental Results

Comparison of Predict Power
Existing method vs. Proposed method
Correct documents vs. 10 of incorrect documents

When training data contains 10 of incorrect
documents

Accuracy Existing method vs. Proposed method

47
Knowledge Engineering Process
48
Proposed Method concepts
Assign
Classifying
Applying Classification Algorithms(Classifier)
Preprocessing
Indexing
New Documents
Performance measure
Training
Training Documents
Indexing
Feature Selection
Preprocessing
49

?? vs ????

?? ??
?? ??
???
???
???? (Feature Selection)
?? (Indexing)
?? (Indexing)
?? ??? (Text Classifier)
?? ?? (Assign)
????
??????
50
Pre-processing Stemming ??
51
Information Retrieval
52
Weights

How should weights be assigned to terms?
A very simple method the weight of a term is its
document frequency
But different words normally have different
frequencies

53
Related Work Expert System
54
Classification types

Single label vs. multi-label
exactly 1 category assigned to each document vs.
0 to C
Binary vs. multi-way classification
binary a special case of single label, dj ? D is
assigned either to ci or to its complement
(e.g.spam non spam)
Document-pivoted (DPC) vs. category pivoted (CPC)
given a dj ? D , we want to find all the ci ? C
under which it should be classified
(document-pivoted)
DPC is suitable when documents become available
at different moments in time, e.g. filtering
e-mail
given a ci ? C , we want to find all the dj ? D
under that should be classified under it
(category-pivoted)
CPC is suitable when new categories are likely to
be be added to C

55
Some more types

Hard categorisation vs. ranking categorisation
making a hard decision about the categories
that a document should be classified under
or, ranking the categories based on their
estimated appropriateness and allowing the choice
of category to be done e.g by a human expert
ranking categorisation may lead to interactive
categorisation systems
may be useful in critical applications
Hierarchical vs. flat
just like in clustering in flat the relations
between classes are undefined, in hierarchical
the classes are ordered

56
Learning Algorithms

Learning algorithms
Error-correction learning
Boltzmann learning
Thorndikes law of effect
Hebbian learning
Competitive learning
Learning paradigms
Supervised learning
Reinforcement learning
Self-organized(unsupervised) learning

57
Voting Algorithm

Principle using multiple evidence (multiple poor
classifiersgt single good classifier)
Generate some base classifiers
Combine them to make the final decision

58
Bagging Algorithm

Use multiple versions of a training set D of size
N, each created by resampling N examples from D
with bootstrap
Each of data sets is used to train a base
classifier, the final classification decision is
made by the majority voting of these classifiers

59
Adaboost

Main idea
The main idea of this algorithm is to maintain a
distribution or set of weights over the training
set. Initially, all weights are set equally, but
in each iteration the weights of incorrectly
classified examples are increased so that the
base classifier is forced to focus on the hard
examples in the training set. For those correctly
classified examples, their weights are decreased
so that they are less important in next
iteration.

Why ensembles can improve performance
Uncorrelated errors made by the individual
classifiers can be removed by voting.
Our hypothesis space H may not contain the true
function f. Instead, H may include several
equally good approximations to f. By taking
weighted combinations of these approximations, we
may be able to represent classifiers that lie
outside of H.

60
Adaboost algorithm
Given m examples
where
Initialize
for all i 1m

For t 1,,T
Train base classifier using distribution

Get a hypothesis

with error

Choose

Update

where
is a normalization factor (chosen so that
will be a distribution).
Output the final hypothesis
61
Analysis of Voting Algorithms

Advantage
Surprisingly effective
Robust to noise
Decrease the overfitting effect
Disadvantage
Require more calculation and memory

62
Decision Tree Learning

Learn a sequence of tests on features, typically
using top-down, greedy search
At each stage choose the unused feature with
highest Information Gain (feature/class MI)
Binary (yes/no) or continuous decisions

f1
!f1
f7
!f7
63
Performance Measure

Performance of algorithm
Training time
Testing time
Classification accuracy
Precision, Recall
Micro-average / Macro-average
Breakeven precision recall
Goal high classification quality and
computation efficiency

64
Elements of Reinforcement Learning
Agent
Policy
Environment

Transition model, how action influence states
Reward R, immediate value of state-action
transition
Policy ?, maps states to actions

65
???? ??(Artificial Life Brain)
????? ?? ?? ??? ???? ?? ? ?? ???? ?? ?
???? 3?? ????? ??
??(development)
??(evolution)
??(learning)
??(development) ?? L-system, cellular automata
??(learning) ?? reinforcement learning,
classifier system
??(evolution) ?? evolutionary algorithm,
co-evolution, DNA coding
66
Classification Performance Measures

Given n test documents and m classes in
consideration, a classifier makes n ? m binary
decisions. A two-by-two contingency table can
be computed for each class.
Recall a/(ac) where a c gt 0 (o.w.
undefined).
Did we find all of those that belonged in the
class?
Precision a/(ab) where abgt0 (o.w. undefined).
Of the times we predicted it was in class, how
often are we correct?
Accuracy (a d) / n
When one classes is overwhelmingly in the
majority, this may not paint an accurate picture.
Others miss, false alarm (fallout), error,
F-measure, break-even point, ...

truly YES truly NO
system YES a b
system NO c d

67
Data Mining Approach
???
? ?
??? ??
???
? ?
68
?? ??- Background- Results
69

?? 8? 16?.
18??? ?? ???? -gt ??...
-------------------------------
-????,??,
-Related Work. ????? ????? ?? ???? ??..
-??????? ?? ???? ??? ? ??????...
????? ??? ? ??? ....??.....
-???
-???? ? ?? ????
-----------------------------------------------

70
Related Work Main

???????? ?? ? ?? ??
????(References) !
??? ????(ranking system)
??? ?? ????
???? ? ?????? ??
????(References) !
?? ???? ??? ???? ??
??(??)??(Feature Extraction Vector)
???? ?? ?????? ? ????(best team!)
??? ???? ???? ? ?? ??.! But.!
Boost ????
??? ? ?? ??
??? ???, ?? ???, ??? ???
??? ??/ ??? ??? ??? ?? ??
Reference

71
????

????
Misclassification ? ???? ??? ?? ??? ????? ????
???.. ?? ??? ? ?? ????? ???? ??? ??.
????? ?? Sampling ????
????
????
??, ???(???)? ???? ? ???
?????? (learning by error correction)
???? ?? ?? ???? ??? ??? ???? ???? ??? ??? ?? ??,
??? ??? ?? ???? ?? ??? ???? ???? ???? ???? ???
??.
? ???? ?? ??? ???? ?? ??? ???? ???? ????? ?? ???
???? ??? ??? ?? ????? ??.
?? ?? ??? ?? ??? ???? ?? ??? ???? ??? ??? ?? ???
?? ?????? ????? ??.

72
(Motivation)

??(????? ?? ??)
???
?? ?????? ? ??
?????? ??? ????
??????? (????? ?? ??)
???
?? ????????, ????, ???? ? ????, ??????, ???,
?????? ????
???? ???? ????, ????? ??? ????? ?.!
????????? ??? ??? ? ?? ???! -gt ???? ???? ??? ?.
???? ??? ?????, ??? ??? ??? ? ?????? ?? ??? ?.
Most of documents have high complexity in
contents with multiple topics and features.
Complex Reports, News Materials, Biological
Literatures
So, various kinds of algorithms have been
proposed for this problem. However, the results
are not satisfactory.

73
Background(1)

Classification (??,????,problem)
Objective classify a new object into one of
the predefined categories using the generated
classifier or rule
Definition D , C
Border Problem in Complex Data
Automated Classification Problem
??????? ?? ????
????? ???? ?? - ??? ????? ?????.
?? ? ?? ????? ??? ??? -gt ???-gt ???? -gt ????
? ??? ?? cost ??, ??? ??
????1 ?? ?????? ???? ?? ??
????2 ???? ?? ??? ??, ????? ??

74
Motivation Problem Definition

Classification Problem
Goal finding optimal decision line or surface
while reducing manual intervention
Low Training Cost
High Accuracy
In real world problem
Representation of document A bag of word
The errors in training sets/samples may occur
while we have efficient learning methods and
human experts
Most of documents have high complexity in
contents with multiple topics and features.
Complex Reports, Spam letter, News Materials,
Biological Literatures..
Various kinds of algorithms and methods have been
proposed for this problem. However, the results
are not satisfactory.
Accuracy is high when data fits model well.
Assign method by simple ranking system

75
Problem Definition

Maeng et al, SIGIR 1998Navaro and Baeza, SIGIR
1995
Model for querying and indexing
Need to analyze similarity between documents for
mining(classification and clustering)
YiSanderson, SIGKDD 2000
Target uniform structure from one database or
website.
Problem
Classification model, which doesnt consider text
semantics
may not deal with documents with various
concepts.
cannot identify similar meanings

76
Background Data Classification vs. Text
Classification
Target
Feature
Feature
ID Age Income Car Class
1
2
3
4
77
Approach Type of Misclassification
Brownell
Training error Causing Misclassification
Distortion
Incompleteness
Uncertainty
Absence
Vagueness
Probability
Ambiguity
Fuzziness
Non-specificity
78
Motivation

Recent Document
Similar symptoms -gt Different Disease , Similar
Feature Vector -gt Difference Category
Multiple Concepts

Ranking System
79

???? ?????, ?? ??? ??
??? ??? ????? ?? ? ????? ??.
Similar Symptoms -gt Different Disease ???
????? ??? ???.
??? ??/ ??
????? ?, ??, ??, ??,??
????? ???, ?????
Starting ?? ?? ??/??/??/????..
Existing Decision Rule But, ?????, ??? ?????
??????.
?) ??? ?? ??? ???? ????? ????? ???? ??? ????.
Musso Sports ?? ? ?? ?
??? ??? ?? ??.
??,??,??
RV/SUV
Sports
??

80
??? ?? ??? ??
??
??
?
???
??
???
??
????
???
????? ?? ??, ????, RV, SUV? ??? ??? ?? ???.
????
81
Proposed System
82
Proposed Method ????
Step 2 Text Mining
83
Refinement Method using Text/Data Mining System

Proposed System
Objective Maximize accuracy, Minimize training
cost
Methodology Automated Text Classification based
on knowledge Discovery Process using
Reinforcement Learning and Post Processing
Target Data a Set of Documents which have
Multiple Concepts (uncertain complex documents)

84
Implementation of Refinement Text/Data Mining
System

Part 1 Hierarchical Construction in Training
Method
Part 2 Post-Processing by Reinforcement Learning

85
Part 1 Reinforcement Training Method Progress

Define Target Category Scheme
Definition 1 C c1, c2,.., cn is a set of
final target categories, where ci and cj are
disjoint each other.
Definition 2 SCn cn1, cn2, ,cnk is
a set of subcategories of target category ci ,
where each cnj are disjoint.
Definition 3 X x1, x2 xn-1 is set of
intermediate categories.
The data located around decision boundary belongs
to X. Also, unclassified documents are denoted by
X

Organizing method of training data
Documents or Sentences
86
Part 2 Post-Processing - Assignment

Goal
overcome the problems and limitations of
traditional method with the mining approach which
is focused on risk minimization analysis.

Assignment a category to documents using initial
score

Input Document Di, Candidate category list Li
normalized and resorted by descend order
Step1 for i 0 to N( number of input
documents)
If (Disize min_support) ((Li1.score
min_value)
(Li1.score - Li2.score diff_value)) then
assign Di to Li1
else assign Di to X
Step2 for n0 to N ( number of unassigned
documents in step1)
for n0 to N ( number of target category)
Calculate distance of category between P, cnk
assign Di to more closer side cn

87
Part 2 Post-Processing Recomputation

Make another training data from candidate lists
of documents
Perform Data Mining Analysis with these uncommon
patterns

Li Di 1 (wm 0.02) 1 (wm 0.02) 2 (wm 0.15) 2 (wm 0.15) 3 (wm 0.25) 3 (wm 0.25) 4 (wm 0.31) 4 (wm 0.31) 5 (wm 0.35) 5 (wm 0.35) Disize Step1 Step2 Assign Actual Class
1 C21 .98 C11 .01 X1 .01 X2 .01 C12 .00 726.33 C21?C2 - C2 C2
2 C21 .39 C12 .20 X2 .17 C11 .13 X1 .10 31.6 X C2 C2 C2
3 X2 .29 C11 .28 C12 .17 C21 .15 X1 .01 514.42 X C1 C1 C1
4 X1 .28 C21 .23 X2 .17 C11 .16 C12 .15 287.12 X C2 C2 C2
88
Part 2 Post-Processing EvaluationFeedback

Evaluation matrix for effectiveness by variance
of results

A (Cn1.step1 X) B (Cn1.step1 True ) C (Cn11.step1 True) (Cn1.step2 True) D (Cn1.step1 False ) (Cn1.step2 ? Cn2)
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn1.step1 Cn1.step2 Cn2 Cn11.step1 Cn11.step2 Cn12
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -
where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2
89
Evaluation Methodology and Performance Measure

Performance goal high classification quality
and computation efficiency
Training time , Testing time
Classification accuracy
Precision, Recall
Micro-average / Macro-average
Breakeven precision recall
We use
Measure
Accuracy True Casees / N
Test set has Not contain Training data
Condition
Same classifiers NB, SVM
With and without post-processing
With and without noisy data in training samples

90
Contribution

Improving of Automated text classification using
Refinement training and post-processing method
Related Area Construction method of training
dataset, Machine learning, Text Mining, Data
mining, Information Retrieval
Not depend on classification algorithms,
selection algorithms..
Main Objective
provide high performance and reliability in
system while reducing a human intervention
Application area
a data area difficult to Classify
Anti-Spam system

91
Classification Learning

Full Category Schema
category1
category2
classifier
Assign
document
document
category3
Document
category4
92
Background Classification Example
93
Comparison Based on Six Classifiers

Classification accuracy six classifiers
(Reuters-21578 collection)
SVM, Voting and KNN are showed good performance ,
DT, NB and Rocchio showed relatively poor
performance

1 2 3 4
Author Dumais Joachims Weiss Yang
Experimental Setup condition Training 9603 9603 9603 7789
Experimental Setup condition Test 3299 3299 3299 3309
Experimental Setup condition Topics 118 90 95 93
Experimental Setup condition Indexing Boolean tfc Frequency ltc
Experimental Setup condition Selection MI IG - ?2
Experimental Setup condition Measure Breakeven Microavg. Breakeven Breakeven
classifiers Rocchio 61.7 79.9 78.7 75
classifiers NB 75.2 72 73.4 71
classifiers KNN N/A 82.3 86.3 85
classifiers DT N/A 79.4 78.9 79
classifiers SVM 87 86 86.3 N/A
classifiers Voting N/A N/A 87.8 N/A
94
Comparison Based on Feature Selection

Classification accuracy NB vs. KNN vs. SVM
(Reuter collection)

of features NB KNN SVM
10 48.66 0.10 57.31 0.2 60.78 0.17
20 52.28 0.15 62.57 0.16 73.67 0.11
40 59.19 0.15 68.39 0.13 77.07 0.14
50 60.32 0.14 74.22 0.11 79.02 0.13
75 66.18 0.19 76.41 0.11 83.0 0.10
100 77.9 0.19 80.2 0.09 84.3 0.12
200 78.26 0.15 82.5 0.09 86.94 0.11
500 80.80 0.12 82.19 0.08 86.59 0.10
1000 80.88 0.11 82.91 0.07 86.31 0.08
5000 79.26 0.07 82.97 0.06 86.57 0.04
95
The main approach to Automated Text classification

The machine learning approach
build a class ci by observing the properties of
the set of documents manually pre-classified
under ci(learning)
major drawback learning cost (human experts,
learning algorithms)
i.e. how do we need a human expert to make
concept in each different domain (experts system)
The knowledge engineering approach
need a large set of rules if ltgt then ltcategorygt
rules manually constructed
major drawback knowledge acquisition bottleneck
i.e. how do we deal with new categories,
different domain, etc.

96
Uncertainty

Uncertainty Problem
True value is unknown
Too complex to compute prior to make decision
Characteristics of real-world applications
Source of Uncertainty
Cannot be explained by deterministic model
decay of radioactive substances
Dont understand well
disease transmit mechanism
Too complex to compute
classification of complex document

97
Approach(2/A) Selection Problem in Training
Brownell
Training error Causing Misclassification
Distortion
Incompleteness
Absence
Uncertainty
Confidence
Randomness
Vagueness