Title: Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification
1Topic Refinement Method of Post-processing and
Training for Improvement of Automated
Classification
- Yun Jeong Choi
- Dept. of Computer Science Engineering
- Ewha Womans University
2Contents
- Background
- Text Classification Framework
- Classification Problem
- Related Work
- Motivation
- Problem Area
- Approach Strategy
- Proposed Method
- Refinement Text Classification System
- Part 1 Reinforcement Training Method
- Part 2 Post-Processing Method for assigning
and Feedback Analysis - Experimental Results
- Discussion Next Plan
3Background Text Classification
- Objectives Assign a new document into one of the
predefined categories using the generated
classifier or rule - Given
- a document dj where D is the domain of documents
- a description of an instance, x?X, X is the
instance space(feature vector). - E.g how to represent text documents.
- a fixed set of categories C c1, c2,, cn
- Determine
- The category of x c(x)?C, where c(x) is a
categorization function(classifier) - whose domain is X and whose range is C.
- We want to
- Build classifier that assign these values
- assign a Boolean value to the pair ltdj, cigt
Preprocessing
Indexing
Applying Classification Algorithms(Classifier)
New Documents
Assign
4Background Data Classification vs. Text
Classification
Data Classification
Supervised Prediction
Example Target Marketing
Inputs
Variable
Role
Target
Cases
Annual Income
Text Classification
5Background Representing Documents
- Usually, an example is represented as a series of
feature-value pairs. The features can be
arbitrarily abstract (as long as they are easily
computable.?) or very simple. - For example, the features could be the set of all
words and the values(weighting score), their
number of occurrences in a particular document. - Learning is the process of modifying the weights
Computer1 Machine1 AI1 Information2 Korea1 U.
S.1 .. ..
Representation
A Bag of words
6Background Text Classification Framework
Assign
Classifying
Applying Classification Algorithms(Classifier)
Preprocessing
Indexing
New Documents
Training
Training Documents
Indexing
Feature Selection
Preprocessing
7Related Work (1/3)
- Approach to improving performance based on
classification algorithms - In a Classification model
- Weighting scheme in feature extraction and
selection - Indexing method
- Probabilistic computation
- Problem
- Vector-space model (considering only term
frequency) - Cannot consider semantics
- Simple assign method
- In several classification model
- Competition Voting
- Select better one In weak classifiers
- Ensemble framework
- Combination of some classifiers
- Cooperative models and classifiers
- Regression based on Least Squares Fit (1991)
- Nearest Neighbor Classification (1992)
- Bayesian Probabilistic Models (1992)
- Symbolic Rule Induction (1994)
- Neural Networks (1995)
- Rocchio approach (traditional IR, 1996)
- Support Vector Machines (1997)
- Maximum Entropy (1999)
- Hidden Markov Models (1999)
- Error-Correcting Output Coding (1999)...
8Related Work (2/3)
- Approach to improving performance based on
learning algorithms - Constructing method for Sample data set and
training set - Selection Problem for sample documents
- Ensemble structure
- Bootstrapping and Bagging Algorithms
- Adaboost
- Active Learning
- Lijuan Cai, Thomas Hofmann,ACM SIGIR 2003
- Text Categorization by Boosting Automatically
Extracted Concepts - Ioannis Tsochantaridis,ICML,2004
- Support Vector Machine Learning for
Interdependent and Structured Output Spaces
Combine several weak classifier -gt make strong
classifier (choose more strong classifier)
9Related Work (3/3)
- Approach to improve performance with assigning
method for complex documents - Assigning method Wimbledon Championship
Tournament method - Yunqing Xia, wei Liu, NLDB, 2005
- Email Categorization with Tournament Methods
- Modify Algorithms for specific domain problem
- Haoran Wu, Tong Heng Phang, SIGKDD. 2002
- A Refinement Approach to Handling Model Misfit in
Text Categorization - Anti-Spam Classifier
- Optimization and combine method
- Trial and Error -gt Feedback
- Evaluation and Feedback in Data Mining Strategy
10Motivation Some Difficulty in Classification
- Human Expert make Different Decision about a
Situation. - At Hospital Error-Risk(businessmans risk)
- Vehicles Classification Subjective subject!
- meaningless argument as many vehicles fall
between classes or even outside all of them
Data/ Documents
Very Easy Cases to machine human
Disease
Movie
Vehicles
Diagnostics
??? gt 2?
Similar Symptoms
11Problem Definition (A/2)
- Problem A Characteristics of Documents
- Most Documents have high Complexity Uncertainty
in Contents with - multiple Concepts and Features
- Complex reports, Spam-letter, New materials,
Biological Literature,. Etc - The training error may happen while we have
efficient learning method and human expert - High cost in training time -gt inaccuracy
- Problem B Classification Method/Performance
Measure - Not important What kind of classifier was used ?
- Classifier and algorithms has a limitation of
analysis for dealing with documents as a bag of
words. - The accuracy is high only when data fits model
well Specific model for Specific domain area) - Assign method by simple ranking method
- Not important How many documents were used for
training? - How Informative documents were selected or
constructed ?
12Approach(1/A) Border Problem in Uncertainty Data
- To find optimal boundary
- Classifier makes rules for classification
- Line or Area
- Goal
- More accurate
- Automatic process
- But, Data size and complexity grow faster, which
makes the finding optimal line to be more
difficult! - In this(red) area, complex data which have
multiple concepts or no meaning, are located
- finding decision rule or line
13 Strategy(1/A) Hierarchical Structure in Class
Definition
- Define Target Category Scheme using Hierarchical
Scheme - Final Target Categories C1, C2
- Intermediate Categories. X
- Subcategories of target category C11, C12,
C21, C22
docoment
14Training Algorithms
- How many documents ?
- How Informative documents were selected or
constructed!! -
- Training Algorithms deal with
- Selection
- Combination
- Number of training documents
Simple Random selection Active sample
selection (contain uncertainty sample)
15Active Sample Selection vs. Random Selection
16Problem Definition (B/2)
- Problem A Characteristics of Documents
- Most Documents have high Complexity Uncertainty
in Contents with - multiple Concepts and Features
- Complex reports, Spam-letter, New materials,
Biological Literature,. Etc - The training error may happen while we have
efficient learning method and human expert - High cost in training time -gt inaccuracy
- Problem B Classification Method
- Not important What kind of classifier was used
- Classifier and algorithms has a limitation of
analysis for dealing with documents as a bag of
words. - The accuracy is high only when data fits model
well Specific model for Specific domain area) - Too Simple Assign method by ranking system
17Approach(1/B) Assign Process
- Assign Process in Text Classification
- 1st Candidate Category is always right?
- Analysis Ranking Order and Score
Rank Data Rank Data 1 2 3 4 5 Target (C1, C2 U, X)
1 Category C21 C11 X C22 C12 C2 -gt U
1 29 28 17 15 10 C2 -gt U
2 Category C22 C11 X C21 C12 C2
2 97 1 1 1 0 C2
18Strategy(B) Post-Processing for Assignment
- Goals in each Step
- Step 1 Classify 1st Candidate Category in a
documents - Minimum Threshold a gap of 1st and 2nd by
rank score(probability score) - Step 2 Classify Undecided cases in Step 1
- Compute a Distance between X (Intermediate
Category) and Other Category - Step 3 Gather Common / Uncommon Cases in Step1
and Step 2 - Ranking Order from Text Classification gt
representing document - ltcategory , scoregt vs. ltterm , frequencygt
- Step 4 Analysis Experience from Step1 Step3
-gt Feedback to Training Process - Modifying Classification Rule
- Add Eliminate training documents
Classify reliable ones vs. unreliable ones
19Strategy(1/B) Post-Processing for Assignment
- Goal
- overcome the problems and limitations of
traditional method with the mining approach which
is focused on risk minimization analysis.
Assignment a category to documents using initial
score
- Input Document Di, Candidate category list Li
normalized and resorted by descend order - Step1 for i 0 to N( number of input
documents) - If (Disize min_support) ((Li1.score
min_value) - (Li1.score - Li2.score diff_value)) then
- assign Di to 1ST candidate category
- else assign Di to U(undecided)
-
- Step2 for n0 to N ( number of unassigned
documents in step1) - for n0 to N ( number of target category)
- Calculate distance of
category between Pivot Category(x), cnk - assign Di to more closer side cn
-
20Strategy(2/B) Post-Processing Recomputation
- Make another training data from candidate lists
of documents - Perform Data Classification with these uncommon
patterns
Li Di 1 (wm 0.02) 1 (wm 0.02) 2 (wm 0.15) 2 (wm 0.15) 3 (wm 0.25) 3 (wm 0.25) 4 (wm 0.31) 4 (wm 0.31) 5 (wm 0.35) 5 (wm 0.35) Disize Step1 Step2 Step3 Actual Class
1 C21 .98 C11 .01 X1 .01 X2 .01 C12 .00 726.33 C21?C2 - C2 C2
2 C21 .39 C12 .20 X2 .17 C11 .13 X1 .10 31.6 U C2 C2 C2
3 X2 .29 C11 .28 C12 .17 C21 .15 X1 .01 514.42 U C1 C1 C1
4 X1 .28 C21 .23 X2 .17 C11 .16 C12 .15 287.12 U C2 C2 C2
21Multi-Training Strategy
- where Cn.process,
- n feedback time 0,1,2
- process step1, step2, step3
- Cn.process 1 when actual class predict class
- Cn.process 0 when actual class ? predict class
- Training Data
- Document set used in text classification
Cn.step1 Cn.step2 - Examples ltdocument, classgt
- Common/uncommon cases for data classification in
step1,step2 Cn.step3 - Examples lt an order of candidate classes in a
documents, actual classgt - In results of above classification of each
document using Reinforcement Learning - Examples ltltstate, actiongt, rewardgt
- Evaluation Function for Estimate accuracy
- Fault Detection and Error Correction
Target
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn.step1 Cn.step2 Cn.step3 Cn1.step1 Cn1.step2 Cn1.step3
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -
22Approach(2/B) Error Correction and Feedback
- Reinforcement Learning
- Learning from Interaction with Environment
- Delayed Reward
- Trial and Error
- Learning by Error Correction
- Prediction-gtPrediction-gtPrediction-gtConservation
Experience - Transition model, how action influence states
- Reward R, immediate value of state-action
transition - Policy ?, maps states to actions
Agent
Policy
1) 3 6) 2
2) 2 7) 1
3) 1 8) 3
4) 2 9) 3
5) 4 10) 3
Environment
23Strategy(3/B) Post-Processing Evaluation
Feedback
- Evaluation matrix for relevance feedback
- Goal Estimate Accuracy , Efficiency,
Over-fitting. - Action Add/Eliminate documents -gt Modify class
definition
1 If ((Cn1.step1 X) and (Cn11.step1
True)
(Cn1.step2 True))
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn.step1 Cn.step2 Cn.step3 Cn1.step1 Cn1.step2 Cn1.step3
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -
Good(di)
0 otherwise
1 If (Cn1.step1 True ) and
Fair(di)
0 otherwise
1 If ((Cn1.step1 False ) (Cn1.step2 ?
Cn2) )
Poor(di)
0 otherwise
24Summary Concept of Proposed Method
- We dont care and dont depend on
- A type of Classifiers and its performance
- Ex)Functional definition of similarity measure
- COS, Euclidean, Kernel Functions, ...
- How many extracted features do we consider?
- Didnt solve Complex and uncertainty problem
- We deals
- Definition of Classes based on hierarchical
classification - Construction of Training Set for Active Learning
- Post-processing analysis for improving
traditional assign method by simple ranking
system. - Feedback to Training Process and Schema by
Reinforcement training
25Proposed Method concepts
Applying Classification Algorithms(Classifier)
Assign
Classifying
Preprocessing
Indexing
New Documents
Performance measure
How about Classification power ? - Add or
elimination of documents in a class.
- Assign Problem
- Post-Processing Analysis for assign
- Improving Simple Ranking System by Order of
score - Feedback to Training process by Reinforcement
- Training(error correction)
- Multi Training Strategy
- Re-Classification using a pattern of
text-classification result
- Preprocessing
- Define Classes Scheme
- Target category vs. Immediate Category
vs.Sub-category - Hierarchical Construction in Classes
26Contribution
- More accurate determination in a segment
- Important basis for uncertainty data similar
concepts and features - Reduce an error risk
- Proposed Method for Automated Text Classification
- Main Objective provide high performance and
reliability in system while reducing a human
intervention - Can be used for many applications
- Other Classification Problem, IR, Bioinformatics
as well as Data Classification/Text
Classification - Improves performance of class quality
- Class quality accuracy
- Stability from training-error
- Reducing Learning cost
- Over-fitting Problem Stop Learning, Disuse
Learning
27Experimental Result using Proposed Method-
Newsgroup Data- Biological Data
28Implementation of Refinement Text/Data Mining
System
- Part 1 Hierarchical Construction in Training
Method - Part 2 Post-Processing by Reinforcement Learning
29Experimental Method in Newsgroup Data
- Characteristic 20 Newsgroup Dataset
- 5 Group Computer, Science, Recreation, Talk and
Misc. - There are partially multi-labeled documents and
- Each class has more than two subclasses.
- We use subset of Computer. group
- C comp.graphics, comp.sys.ibm.pc.hardware
, comp.sys.mac.hardware, comp.windows.x - 200 training documents for each target category,
50 documents for intermediate category adding
noises about 10 of the total. - We use
- Accuracy True Cases / N
- Test set did not contain Training data
30Experimental Condition and Evaluation
- Condition
- Same classifiers NB, SVM
- With and without post-processing
- With and without noisy data in training samples
- Define Category and Experimental Condition
Definition of Category Definition of Category Definition of Category Number of Training Documents Number of Training Documents Method Method
Target Category, C A of D for Inter- Mediate Category,X a of Subcategory Correct Documents Incorrect Documents(10 Base Classifier With(W) and Without(WO) noise post-processing
E1 c1,c2,c3,c4 0 0 800 0 SVM WO post-processing, WO noise
E2 c1,c2,c3,c4 0 0 800 80 SVM WO post-processing, W noise
E3 c1,c2,c3,c4 0 0 800 0 NB WO post-processing, WO noise
E4 c1,c2,c3,c4 0 0 800 80 NB WO post-processing, W noise
E5 c1,c2,c3,c4 150 2 800 0 SVM W post-processing, WO noise
E6 c1,c2,c3,c4 150 2 800 80 SVM W post-processing, W noise
E7 c1,c2,c3,c4 150 2 800 0 NB W post-processing, WO noise
E8 c1,c2,c3,c4 150 2 800 80 NB W post-processing, W noise
31Experimental Results in Newsgroups
Yun Jeong Choi. Seung Soo Park,
Refinement Method of Post-processing and
Training for Improvement of Automated Text
Classification,
ICCSA 2006, LNCS 3981, pp 02980309, 2006
- Comparison of Predict Power
- Existing method(E1,E2,E3,E4) vs. Proposed
method(E5,E6,E7,E8) - Correct documents vs. 10 of incorrect documents
Accuracy
Accuracy
Accuracy
Accuracy
32?? Biological texts- Background- Results
33Problem Definition
- Ambiguous Entity gt Multiplicity
- Genes, enzymes, and their transcripts often share
the same name - The task of annotation gt identifying and
classifying the terms - In documents, we have to treat as uncertain
documents - There are too many topics and key-words
- Classification Problem
- Automated Classification is to classify free
texts into predefine categories - Goal finding optimal decision line or surface
and reducing manual process - Low Training Cost
- High Accuracy
34Examples of Ambiguous Entities
- P130 i mediates TGF-beta-induced cell-cycle
arrest in Rb mutant HT-3 cells. - The INK4alpha/ARF locus encodes p14(ARF) and
p16(INK4alpha) , that function to arrest the cell
cycle through the p53 and RB pathways,
respectively. - Many tumor types are associated with genetic
changes in the retinoblastoma pathway,leading to
hyperactivation of cyclin-dependent kinases and
incorrect progression through the cell cycle. - The Y79 and WERI-Rb1 retinoblastoma cells, as
well as MCF7 breast cancer epithelial cells, all
of which express T-channel current and mRNA for
T-channel subunits, is inhibited by pimozide and
mibefradil with IC(50) 8 and 5 microM for
pimozide and mibefradil, respectively).
35Experiments
- Data RB-relate documents
- Collected about 20000 abstracts from PubMed.
- Select 100 of documents for verifying our system
- This documents has many ambiguous features and
high classification error - Define Category and Experimental Condition
- most documents is connected with protein(P),
gene(G) and cancer(C). - CI, others, then do the one-against-one
classification about CP, G, D.
Definition of category Definition of category Definition of category Number of training documents (correct incorrect) Number of training documents (correct incorrect) Number of training documents (correct incorrect)
Target category (C) Candidate category (S) Intermediate category(X) Correct documents Incorrect documents (10) Total (300, 318)
Protein P1 30 5 60(36)
Protein P2 30 1 60(36)
X1 60 0 60
Gene G1 30 3 60(36)
Gene G2 30 3 60(36)
X2 60 0 60
Disease, Cancer D1 30 6 60(36)
Disease, Cancer D2 30 0 60(36)
36Experimental Results
Yun Jeong Choi. Seung Soo Park,
Efficient Classification Method for Complex
Biological Literature using Text and
Data Mining Combination, IDEAL 2006,
LNCS 4224 , pp. 688 696, 2006
Result Table comparison between existing method
and RTPost method with only correct documents
Performance Method Accuracy Protein Predict Power Gene Predict Power Disease Predict Power Misclassification rate
Naïve Baysian 0.69 51 82 74 31.
SVM 0.74 64 83 76 29
RTPost Algorithm(with Naïve Baysian) 0.89 81 94 92 11
RTPost Algorithm(with SVM) 0.91 88 91 94 8
Result Table comparison between existing method
and RTPost method containing incorrect documents
Performance Method Accuracy Protein Predict Power Gene Predict Power Disease Predict Power Misclassification rate
Naïve Baysian 0.45 52 65 17 55.
SVM 0.47 54 61 26 64
RTPost Algorithm(with Naïve Baysian) 0.85 84 92 75 15
RTPost Algorithm(with SVM) 0.87 87 91 81 11
37Summary
- Many effective techniques have been proposed to
classify these documents. But, - not proper to classify complex documents
- Proposed method
- Refinement classification system
- It can be easily adapted to deal with other data
or other data mining algorithms. - developed by using component based style using
BOW toolkit and C language. - In Experiments, proposed method is very
successful! - Compare the accuracy and the stability under
actual condition. - Not depend on classification algorithms and
techniques - In the future, well simplify the effectiveness
function without raising the running costs of the
entire process.
38Publication
- International Journal Conference
- Yun Jeong Choi. Seung Soo Park, Efficient
Classification Method for Complex Biological
Literature using Text and Data Mining
Combination, IDEAL 2006, LNCS 4224 , pp. 688
696, 2006. - Yun Jeong Choi. Seung Soo Park, Refinement
Method of Post-processing and Training for
Improvement of Automated Text Classification,
ICCSA 2006, LNCS 3981, pp 02980309, 2006 - Yun Jeong Choi. Seung Soo Park, Automated
Classification of PubMed texts for Disambiguated
Annotation using Text and Data Mining, in
Proceedings of the 2005 International Joint
conference of InCob, AASBi and KSBI, BIOINFO
2005, pp. 101 106, 2005 - Yun Jeong Choi. Seung Soo Park. Min Kyung Kim.
Hyun Seok Park, Sense Disambiguation in PubMed
Abstracts using Text and Data Mining,
International Conference on Genome informatics,
GIW 2002, Vol. 13, pp. 578-579, 2002 - Domestic Journal
- ???, ???, ???? ??? ?????? ??? ??????? ???? ??,
???????? ???, Vol. 12 No 7, pp 811 822, 2005 - ???, ???, ? ???? ??? ?? ??????? ??????? ?? ?? ??,
??????? ???, Vol. 23 No. 3, pp. 33-46, 2002 - Domestic Conference
- ???, ???, ????(Anomaly Detection) ? ????(Misuse
Detection) ??? ??? ??? ?? ??? ?????? ?? ??,
????????, KCC, 2006, - ???, ???, ????? ??? ?? ????????? ?? ? ??,
??????? 2001? ??????, VOL. 28 NO. 01 pp.
0247 0249 2001 . 04
39Discussion and Next Plan
- Application Area
- Text Classification is very closely connected to
Information Retrieval System, filtering system - Rich Tagging System
- Text Mining, Data Mining
- Dynamic classification for the ever-changing
needs of user - Reflection of various viewpoint(angle) vs.
tagging system - Traditional definition
- ltdocument di, class ci gt disjoint mapping.
- A Limitation of disjoint classification a
documents which have multiple concepts must
assign only one class? - Exact only one result vs. a liberal, Enrich
results - Support to rich tagging system
- Ranking Information ltcategory, probability
scoregt - Efficient Evaluation Function
- Definition Good / Fair / Poor cases
40Discussion and Next Plan
- Application Area
- Text Classification is very closely connected to
Information Retrieval System, filtering system - Rich Tagging System
- Text Mining, Data Mining
- Dynamic classification for the ever-changing
needs of user - Reflection of various viewpoint(angle) vs.
tagging system - Traditional definition
- ltdocument di, class ci gt disjoint mapping.
- A Limitation of disjoint classification a
documents which have multiple concepts must
assign only one class? - Exact only one result vs. a liberal, Enrich
results - Support to rich tagging system
- Ranking Information ltcategory, probability
scoregt - Efficient Evaluation Function
- Definition Good / Fair / Poor cases
41References
- Basic
- Mitchell, T. Machine learning. McGraw-Hill, 1997.
- Hearst, M.A. Trends and discoveries support
vector macines. In IEEE Intelligent Systems,
July/August 1998, pages 18-28. - Yang, Y., Slattery, S., Ghani, R. A study of
approaches to hypertext categorization. Journal
of Intelligent Information Systems,
18(2/3)219-241, 2002. - Fabrizio Sebastiani, Machine Learning in
Automated Text Categorization, ACM Computing
Surveys, 34(1)1-47, 2002. http//www.math.unipd.i
t/fabseb60/Publications/ACMCS02.pdf - Classification algorithms, Feature Extraction
- Yang, Y. and Liu, X. A re-examination of text
categorization methods. In Proceedings of ACM
SIGIR, 1999. - Dumais, S. and Chen, H. Hierarchical
classification of web content. In Proceedings of
the 23rd ACM SIGIR Conference, pages 256-263,
2000. - Yiming Yang, Shinjae Yoo, Jian Zhang and Bryan
Kisiel. Robustness of Adaptive Filtering Methods
in a Cross-benchmark Evaluation. In the 28th
Annual International ACM SIGIR Conference (SIGIR
2005), Brazil, 2005. - Lewis, D. D. An evaluation of phrasal and
clustered representations on a text
categorization task. In Proceedings of the 15th
ACM SIGIR Conference, pages 37-50, 1992. - Yiming Yang, Thomas Ault, Thomas Pierce.
Combining multiple learning strategies for
effective cross validation (ps.gz). The
Seventeenth International Conference on Machine
Learning (ICML'00), pp1167-1182, 2000. - Bekkerman, R., El-Yaniv, R., Tishby, N., and
Winter, Y. 2003. Distributional word clusters vs.
words for text categorization. J. Mach. Learn.
Res. 3 (Mar. 2003), 1183-1208.http//www.cs.techni
on.ac.il/ronb/papers/jmlr.pdf - N. Slonim and N. Tishby. The power of word
clusters for text classification. In 23rd
European Colloquium on Information Retrieval
Research. 2001. http//www.cs.huji.ac.il/noamm/pu
blications/ECIR2001.ps.gz - Landauer, T. K., Foltz, P. W., Laham, D.
Introduction to Latent Semantic Analysis.
Discourse Processes, 25, 259-284,1998.
http//lsa.colorado.edu/papers/dp1.LSAintro.pdf - Fan Li, Yiming Yang. Using recursive
classification to discover predictive features.
ACM SAC 2005.
42Reference
- Learning method
- Joachims, T. Text categorization with support
vector machines learning with many relevant
features. In Proceedings of 10th European
Conference on Machine Learning, pages 137-142,
1998. - Lijuan Cai, Thomas Hofmann , Text Categorization
by Boosting Automatically Extracted Concepts,
26th Annual International ACM SIGIR Conference,
2003 - Ioannis Tsochantaridis, Thomas Hofmann, Thorsten
Joachims, Yasemin Altun,Support Vector Machine
Learning for Interdependent and Structured Output
Spaces,International Conference on Machine
Learning (ICML), 2004 - Optimizing method
- Lewis, D.D. Evaluating and optimizing autonomous
text classification systems. In Proceedings of
ACM SIGIR, 1995. - Özgür, L., Güngör, T., and Gürgen, F. 2004.
Adaptive anti-spam filtering for agglutinative
languages a special case for Turkish. Pattern
Recogn. Lett. 25, 16 (Dec. 2004), 1819-1831. - Yiming Yang, Thomas Ault, Thomas Pierce and
Charles W Lattimer. Improving text categorization
methods for event tracking (ps.gz). Proceedings
of ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'00),
pp65-72. 2000.
43(No Transcript)
44????- Stemming- Information Retrieval-
Classification Process- Expert System- Learning
Algorithms- Voting, Adaboost-Reinforcement
Learning- Performance Measure
45Motivation Complex Case
Linear Decision Boundary
Non Linear Boundary
46Experimental Results
- Comparison of Predict Power
- Existing method vs. Proposed method
- Correct documents vs. 10 of incorrect documents
When training data contains 10 of incorrect
documents
- Accuracy Existing method vs. Proposed method
47Knowledge Engineering Process
48Proposed Method concepts
Assign
Classifying
Applying Classification Algorithms(Classifier)
Preprocessing
Indexing
New Documents
Performance measure
Training
Training Documents
Indexing
Feature Selection
Preprocessing
49?? ??
?? ??
???
???
???? (Feature Selection)
?? (Indexing)
?? (Indexing)
?? ??? (Text Classifier)
?? ?? (Assign)
????
??????
50Pre-processing Stemming ??
51Information Retrieval
52Weights
- How should weights be assigned to terms?
- A very simple method the weight of a term is its
document frequency - But different words normally have different
frequencies
53Related Work Expert System
54Classification types
- Single label vs. multi-label
- exactly 1 category assigned to each document vs.
0 to C - Binary vs. multi-way classification
- binary a special case of single label, dj ? D is
assigned either to ci or to its complement
(e.g.spam non spam) - Document-pivoted (DPC) vs. category pivoted (CPC)
- given a dj ? D , we want to find all the ci ? C
under which it should be classified
(document-pivoted) - DPC is suitable when documents become available
at different moments in time, e.g. filtering
e-mail - given a ci ? C , we want to find all the dj ? D
under that should be classified under it
(category-pivoted) - CPC is suitable when new categories are likely to
be be added to C
55Some more types
- Hard categorisation vs. ranking categorisation
- making a hard decision about the categories
that a document should be classified under - or, ranking the categories based on their
estimated appropriateness and allowing the choice
of category to be done e.g by a human expert - ranking categorisation may lead to interactive
categorisation systems - may be useful in critical applications
- Hierarchical vs. flat
- just like in clustering in flat the relations
between classes are undefined, in hierarchical
the classes are ordered
56Learning Algorithms
- Learning algorithms
- Error-correction learning
- Boltzmann learning
- Thorndikes law of effect
- Hebbian learning
- Competitive learning
- Learning paradigms
- Supervised learning
- Reinforcement learning
- Self-organized(unsupervised) learning
57Voting Algorithm
- Principle using multiple evidence (multiple poor
classifiersgt single good classifier) - Generate some base classifiers
- Combine them to make the final decision
58Bagging Algorithm
- Use multiple versions of a training set D of size
N, each created by resampling N examples from D
with bootstrap - Each of data sets is used to train a base
classifier, the final classification decision is
made by the majority voting of these classifiers
59Adaboost
- Main idea
- The main idea of this algorithm is to maintain a
distribution or set of weights over the training
set. Initially, all weights are set equally, but
in each iteration the weights of incorrectly
classified examples are increased so that the
base classifier is forced to focus on the hard
examples in the training set. For those correctly
classified examples, their weights are decreased
so that they are less important in next
iteration.
- Why ensembles can improve performance
- Uncorrelated errors made by the individual
classifiers can be removed by voting. - Our hypothesis space H may not contain the true
function f. Instead, H may include several
equally good approximations to f. By taking
weighted combinations of these approximations, we
may be able to represent classifiers that lie
outside of H.
60Adaboost algorithm
Given m examples
where
Initialize
for all i 1m
- For t 1,,T
- Train base classifier using distribution
with error
.
where
is a normalization factor (chosen so that
will be a distribution).
Output the final hypothesis
61Analysis of Voting Algorithms
- Advantage
- Surprisingly effective
- Robust to noise
- Decrease the overfitting effect
- Disadvantage
- Require more calculation and memory
62Decision Tree Learning
- Learn a sequence of tests on features, typically
using top-down, greedy search - At each stage choose the unused feature with
highest Information Gain (feature/class MI) - Binary (yes/no) or continuous decisions
f1
!f1
f7
!f7
63Performance Measure
- Performance of algorithm
- Training time
- Testing time
- Classification accuracy
- Precision, Recall
- Micro-average / Macro-average
- Breakeven precision recall
- Goal high classification quality and
computation efficiency
64Elements of Reinforcement Learning
Agent
Policy
Environment
- Transition model, how action influence states
- Reward R, immediate value of state-action
transition - Policy ?, maps states to actions
65???? ??(Artificial Life Brain)
????? ?? ?? ??? ???? ?? ? ?? ???? ?? ?
???? 3?? ????? ??
??(development)
??(evolution)
??(learning)
??(development) ?? L-system, cellular automata
??(learning) ?? reinforcement learning,
classifier system
??(evolution) ?? evolutionary algorithm,
co-evolution, DNA coding
66Classification Performance Measures
- Given n test documents and m classes in
consideration, a classifier makes n ? m binary
decisions. A two-by-two contingency table can
be computed for each class. - Recall a/(ac) where a c gt 0 (o.w.
undefined). - Did we find all of those that belonged in the
class? - Precision a/(ab) where abgt0 (o.w. undefined).
- Of the times we predicted it was in class, how
often are we correct? - Accuracy (a d) / n
- When one classes is overwhelmingly in the
majority, this may not paint an accurate picture. - Others miss, false alarm (fallout), error,
F-measure, break-even point, ...
truly YES truly NO
system YES a b
system NO c d
67Data Mining Approach
???
? ?
??? ??
???
? ?
68?? ??- Background- Results
69- ?? 8? 16?.
- 18??? ?? ???? -gt ??...
- -------------------------------
- -????,??,
- -Related Work. ????? ????? ?? ???? ??..
- -??????? ?? ???? ??? ? ??????...
- ????? ??? ? ??? ....??.....
- -???
- -???? ? ?? ????
- -----------------------------------------------
70Related Work Main
- ???????? ?? ? ?? ??
- ????(References) !
- ??? ????(ranking system)
- ??? ?? ????
- ???? ? ?????? ??
- ????(References) !
- ?? ???? ??? ???? ??
- ??(??)??(Feature Extraction Vector)
- ???? ?? ?????? ? ????(best team!)
- ??? ???? ???? ? ?? ??.! But.!
- Boost ????
- ??? ? ?? ??
- ??? ???, ?? ???, ??? ???
- ??? ??/ ??? ??? ??? ?? ??
- Reference
71????
- ????
- Misclassification ? ???? ??? ?? ??? ????? ????
???.. ?? ??? ? ?? ????? ???? ??? ??. - ????? ?? Sampling ????
- ????
- ????
- ??, ???(???)? ???? ? ???
- ?????? (learning by error correction)
- ???? ?? ?? ???? ??? ??? ???? ???? ??? ??? ?? ??,
??? ??? ?? ???? ?? ??? ???? ???? ???? ???? ???
??. - ? ???? ?? ??? ???? ?? ??? ???? ???? ????? ?? ???
???? ??? ??? ?? ????? ??. - ?? ?? ??? ?? ??? ???? ?? ??? ???? ??? ??? ?? ???
?? ?????? ????? ??.
72(Motivation)
- ??(????? ?? ??)
- ???
- ?? ?????? ? ??
- ?????? ??? ????
- ??????? (????? ?? ??)
- ???
- ?? ????????, ????, ???? ? ????, ??????, ???,
?????? ???? - ???? ???? ????, ????? ??? ????? ?.!
- ????????? ??? ??? ? ?? ???! -gt ???? ???? ??? ?.
- ???? ??? ?????, ??? ??? ??? ? ?????? ?? ??? ?.
- Most of documents have high complexity in
contents with multiple topics and features. - Complex Reports, News Materials, Biological
Literatures - So, various kinds of algorithms have been
proposed for this problem. However, the results
are not satisfactory.
73Background(1)
- Classification (??,????,problem)
- Objective classify a new object into one of
the predefined categories using the generated
classifier or rule - Definition D , C
- Border Problem in Complex Data
- Automated Classification Problem
- ??????? ?? ????
- ????? ???? ?? - ??? ????? ?????.
- ?? ? ?? ????? ??? ??? -gt ???-gt ???? -gt ????
- ? ??? ?? cost ??, ??? ??
- ????1 ?? ?????? ???? ?? ??
- ????2 ???? ?? ??? ??, ????? ??
74Motivation Problem Definition
- Classification Problem
- Goal finding optimal decision line or surface
while reducing manual intervention - Low Training Cost
- High Accuracy
- In real world problem
- Representation of document A bag of word
- The errors in training sets/samples may occur
while we have efficient learning methods and
human experts - Most of documents have high complexity in
contents with multiple topics and features. - Complex Reports, Spam letter, News Materials,
Biological Literatures.. - Various kinds of algorithms and methods have been
proposed for this problem. However, the results
are not satisfactory. - Accuracy is high when data fits model well.
- Assign method by simple ranking system
75Problem Definition
- Maeng et al, SIGIR 1998Navaro and Baeza, SIGIR
1995 - Model for querying and indexing
- Need to analyze similarity between documents for
mining(classification and clustering) - YiSanderson, SIGKDD 2000
- Target uniform structure from one database or
website. - Problem
- Classification model, which doesnt consider text
semantics - may not deal with documents with various
concepts. - cannot identify similar meanings
76Background Data Classification vs. Text
Classification
Target
Feature
Feature
ID Age Income Car Class
1
2
3
4
77Approach Type of Misclassification
Brownell
Training error Causing Misclassification
Distortion
Incompleteness
Uncertainty
Absence
Vagueness
Probability
Ambiguity
Fuzziness
Non-specificity
78Motivation
- Recent Document
- Similar symptoms -gt Different Disease , Similar
Feature Vector -gt Difference Category - Multiple Concepts
Ranking System
79- ???? ?????, ?? ??? ??
- ??? ??? ????? ?? ? ????? ??.
- Similar Symptoms -gt Different Disease ???
????? ??? ???. - ??? ??/ ??
- ????? ?, ??, ??, ??,??
- ????? ???, ?????
- Starting ?? ?? ??/??/??/????..
- Existing Decision Rule But, ?????, ??? ?????
??????. - ?) ??? ?? ??? ???? ????? ????? ???? ??? ????.
- Musso Sports ?? ? ?? ?
- ??? ??? ?? ??.
- ??,??,??
- RV/SUV
- Sports
- ??
80??? ?? ??? ??
??
??
?
???
??
???
??
????
???
????? ?? ??, ????, RV, SUV? ??? ??? ?? ???.
????
81Proposed System
82Proposed Method ????
Step 2 Text Mining
83Refinement Method using Text/Data Mining System
- Proposed System
- Objective Maximize accuracy, Minimize training
cost -
- Methodology Automated Text Classification based
on knowledge Discovery Process using
Reinforcement Learning and Post Processing - Target Data a Set of Documents which have
Multiple Concepts (uncertain complex documents)
84Implementation of Refinement Text/Data Mining
System
- Part 1 Hierarchical Construction in Training
Method - Part 2 Post-Processing by Reinforcement Learning
85Part 1 Reinforcement Training Method Progress
- Define Target Category Scheme
- Definition 1 C c1, c2,.., cn is a set of
final target categories, where ci and cj are
disjoint each other. - Definition 2 SCn cn1, cn2, ,cnk is
a set of subcategories of target category ci ,
where each cnj are disjoint. - Definition 3 X x1, x2 xn-1 is set of
intermediate categories. - The data located around decision boundary belongs
to X. Also, unclassified documents are denoted by
X
Organizing method of training data
Documents or Sentences
86Part 2 Post-Processing - Assignment
- Goal
- overcome the problems and limitations of
traditional method with the mining approach which
is focused on risk minimization analysis.
Assignment a category to documents using initial
score
- Input Document Di, Candidate category list Li
normalized and resorted by descend order - Step1 for i 0 to N( number of input
documents) - If (Disize min_support) ((Li1.score
min_value) - (Li1.score - Li2.score diff_value)) then
- assign Di to Li1
- else assign Di to X
-
- Step2 for n0 to N ( number of unassigned
documents in step1) - for n0 to N ( number of target category)
- Calculate distance of category between P, cnk
-
- assign Di to more closer side cn
-
87Part 2 Post-Processing Recomputation
- Make another training data from candidate lists
of documents - Perform Data Mining Analysis with these uncommon
patterns
Li Di 1 (wm 0.02) 1 (wm 0.02) 2 (wm 0.15) 2 (wm 0.15) 3 (wm 0.25) 3 (wm 0.25) 4 (wm 0.31) 4 (wm 0.31) 5 (wm 0.35) 5 (wm 0.35) Disize Step1 Step2 Assign Actual Class
1 C21 .98 C11 .01 X1 .01 X2 .01 C12 .00 726.33 C21?C2 - C2 C2
2 C21 .39 C12 .20 X2 .17 C11 .13 X1 .10 31.6 X C2 C2 C2
3 X2 .29 C11 .28 C12 .17 C21 .15 X1 .01 514.42 X C1 C1 C1
4 X1 .28 C21 .23 X2 .17 C11 .16 C12 .15 287.12 X C2 C2 C2
88Part 2 Post-Processing EvaluationFeedback
- Evaluation matrix for effectiveness by variance
of results
A (Cn1.step1 X) B (Cn1.step1 True ) C (Cn11.step1 True) (Cn1.step2 True) D (Cn1.step1 False ) (Cn1.step2 ? Cn2)
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn1.step1 Cn1.step2 Cn2 Cn11.step1 Cn11.step2 Cn12
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -
where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2
89Evaluation Methodology and Performance Measure
- Performance goal high classification quality
and computation efficiency - Training time , Testing time
- Classification accuracy
- Precision, Recall
- Micro-average / Macro-average
- Breakeven precision recall
- We use
- Measure
- Accuracy True Casees / N
- Test set has Not contain Training data
- Condition
- Same classifiers NB, SVM
- With and without post-processing
- With and without noisy data in training samples
90Contribution
- Improving of Automated text classification using
Refinement training and post-processing method - Related Area Construction method of training
dataset, Machine learning, Text Mining, Data
mining, Information Retrieval - Not depend on classification algorithms,
selection algorithms.. - Main Objective
- provide high performance and reliability in
system while reducing a human intervention - Application area
- a data area difficult to Classify
- Anti-Spam system
91Classification Learning
Full Category Schema
category1
category2
classifier
Assign
document
document
category3
Document
category4
92Background Classification Example
93Comparison Based on Six Classifiers
- Classification accuracy six classifiers
(Reuters-21578 collection) - SVM, Voting and KNN are showed good performance ,
DT, NB and Rocchio showed relatively poor
performance
1 2 3 4
Author Dumais Joachims Weiss Yang
Experimental Setup condition Training 9603 9603 9603 7789
Experimental Setup condition Test 3299 3299 3299 3309
Experimental Setup condition Topics 118 90 95 93
Experimental Setup condition Indexing Boolean tfc Frequency ltc
Experimental Setup condition Selection MI IG - ?2
Experimental Setup condition Measure Breakeven Microavg. Breakeven Breakeven
classifiers Rocchio 61.7 79.9 78.7 75
classifiers NB 75.2 72 73.4 71
classifiers KNN N/A 82.3 86.3 85
classifiers DT N/A 79.4 78.9 79
classifiers SVM 87 86 86.3 N/A
classifiers Voting N/A N/A 87.8 N/A
94Comparison Based on Feature Selection
- Classification accuracy NB vs. KNN vs. SVM
(Reuter collection)
of features NB KNN SVM
10 48.66 0.10 57.31 0.2 60.78 0.17
20 52.28 0.15 62.57 0.16 73.67 0.11
40 59.19 0.15 68.39 0.13 77.07 0.14
50 60.32 0.14 74.22 0.11 79.02 0.13
75 66.18 0.19 76.41 0.11 83.0 0.10
100 77.9 0.19 80.2 0.09 84.3 0.12
200 78.26 0.15 82.5 0.09 86.94 0.11
500 80.80 0.12 82.19 0.08 86.59 0.10
1000 80.88 0.11 82.91 0.07 86.31 0.08
5000 79.26 0.07 82.97 0.06 86.57 0.04
95The main approach to Automated Text classification
- The machine learning approach
- build a class ci by observing the properties of
the set of documents manually pre-classified
under ci(learning) - major drawback learning cost (human experts,
learning algorithms) - i.e. how do we need a human expert to make
concept in each different domain (experts system) - The knowledge engineering approach
- need a large set of rules if ltgt then ltcategorygt
- rules manually constructed
- major drawback knowledge acquisition bottleneck
- i.e. how do we deal with new categories,
different domain, etc.
96Uncertainty
- Uncertainty Problem
- True value is unknown
- Too complex to compute prior to make decision
- Characteristics of real-world applications
- Source of Uncertainty
- Cannot be explained by deterministic model
- decay of radioactive substances
- Dont understand well
- disease transmit mechanism
- Too complex to compute
- classification of complex document
97Approach(2/A) Selection Problem in Training
Brownell
Training error Causing Misclassification
Distortion
Incompleteness
Absence
Uncertainty
Confidence
Randomness
Vagueness
- Which side will be up if I toss a coin ?
- How much do you Confident on your decision