Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification

Description:

Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification Yun Jeong Choi Dept. of Computer Science & Engineering – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 98
Provided by: 10062
Category:

less

Transcript and Presenter's Notes

Title: Topic: Refinement Method of Post-processing and Training for Improvement of Automated Classification


1
Topic Refinement Method of Post-processing and
Training for Improvement of Automated
Classification
  • Yun Jeong Choi
  • Dept. of Computer Science Engineering
  • Ewha Womans University

2
Contents
  • Background
  • Text Classification Framework
  • Classification Problem
  • Related Work
  • Motivation
  • Problem Area
  • Approach Strategy
  • Proposed Method
  • Refinement Text Classification System
  • Part 1 Reinforcement Training Method
  • Part 2 Post-Processing Method for assigning
    and Feedback Analysis
  • Experimental Results
  • Discussion Next Plan

3
Background Text Classification
  • Objectives Assign a new document into one of the
    predefined categories using the generated
    classifier or rule
  • Given
  • a document dj where D is the domain of documents
  • a description of an instance, x?X, X is the
    instance space(feature vector).
  • E.g how to represent text documents.
  • a fixed set of categories C c1, c2,, cn
  • Determine
  • The category of x c(x)?C, where c(x) is a
    categorization function(classifier)
  • whose domain is X and whose range is C.
  • We want to
  • Build classifier that assign these values
  • assign a Boolean value to the pair ltdj, cigt

Preprocessing
Indexing
Applying Classification Algorithms(Classifier)
New Documents
Assign
4
Background Data Classification vs. Text
Classification
Data Classification
Supervised Prediction
Example Target Marketing
Inputs
Variable
Role
Target
Cases
Annual Income
Text Classification
5
Background Representing Documents
  • Usually, an example is represented as a series of
    feature-value pairs. The features can be
    arbitrarily abstract (as long as they are easily
    computable.?) or very simple.
  • For example, the features could be the set of all
    words and the values(weighting score), their
    number of occurrences in a particular document.
  • Learning is the process of modifying the weights

Computer1 Machine1 AI1 Information2 Korea1 U.
S.1 .. ..
Representation
A Bag of words
6
Background Text Classification Framework
Assign
Classifying
Applying Classification Algorithms(Classifier)
Preprocessing
Indexing
New Documents
Training
Training Documents
Indexing
Feature Selection
Preprocessing
7
Related Work (1/3)
  • Approach to improving performance based on
    classification algorithms
  • In a Classification model
  • Weighting scheme in feature extraction and
    selection
  • Indexing method
  • Probabilistic computation
  • Problem
  • Vector-space model (considering only term
    frequency)
  • Cannot consider semantics
  • Simple assign method
  • In several classification model
  • Competition Voting
  • Select better one In weak classifiers
  • Ensemble framework
  • Combination of some classifiers
  • Cooperative models and classifiers
  • Regression based on Least Squares Fit (1991)
  • Nearest Neighbor Classification (1992)
  • Bayesian Probabilistic Models (1992)
  • Symbolic Rule Induction (1994)
  • Neural Networks (1995)
  • Rocchio approach (traditional IR, 1996)
  • Support Vector Machines (1997)
  • Maximum Entropy (1999)
  • Hidden Markov Models (1999)
  • Error-Correcting Output Coding (1999)...

8
Related Work (2/3)
  • Approach to improving performance based on
    learning algorithms
  • Constructing method for Sample data set and
    training set
  • Selection Problem for sample documents
  • Ensemble structure
  • Bootstrapping and Bagging Algorithms
  • Adaboost
  • Active Learning
  • Lijuan Cai, Thomas Hofmann,ACM SIGIR 2003
  • Text Categorization by Boosting Automatically
    Extracted Concepts
  • Ioannis Tsochantaridis,ICML,2004
  • Support Vector Machine Learning for
    Interdependent and Structured Output Spaces

Combine several weak classifier -gt make strong
classifier (choose more strong classifier)
9
Related Work (3/3)
  • Approach to improve performance with assigning
    method for complex documents
  • Assigning method Wimbledon Championship
    Tournament method
  • Yunqing Xia, wei Liu, NLDB, 2005
  • Email Categorization with Tournament Methods
  • Modify Algorithms for specific domain problem
  • Haoran Wu, Tong Heng Phang, SIGKDD. 2002
  • A Refinement Approach to Handling Model Misfit in
    Text Categorization
  • Anti-Spam Classifier
  • Optimization and combine method
  • Trial and Error -gt Feedback
  • Evaluation and Feedback in Data Mining Strategy

10
Motivation Some Difficulty in Classification
  • Human Expert make Different Decision about a
    Situation.
  • At Hospital Error-Risk(businessmans risk)
  • Vehicles Classification Subjective subject!
  • meaningless argument as many vehicles fall
    between classes or even outside all of them

Data/ Documents
Very Easy Cases to machine human
Disease
Movie
Vehicles
Diagnostics
??? gt 2?
Similar Symptoms
11
Problem Definition (A/2)
  • Problem A Characteristics of Documents
  • Most Documents have high Complexity Uncertainty
    in Contents with
  • multiple Concepts and Features
  • Complex reports, Spam-letter, New materials,
    Biological Literature,. Etc
  • The training error may happen while we have
    efficient learning method and human expert
  • High cost in training time -gt inaccuracy
  • Problem B Classification Method/Performance
    Measure
  • Not important What kind of classifier was used ?
  • Classifier and algorithms has a limitation of
    analysis for dealing with documents as a bag of
    words.
  • The accuracy is high only when data fits model
    well Specific model for Specific domain area)
  • Assign method by simple ranking method
  • Not important How many documents were used for
    training?
  • How Informative documents were selected or
    constructed ?

12
Approach(1/A) Border Problem in Uncertainty Data
  • To find optimal boundary
  • Classifier makes rules for classification
  • Line or Area
  • Goal
  • More accurate
  • Automatic process
  • But, Data size and complexity grow faster, which
    makes the finding optimal line to be more
    difficult!
  • In this(red) area, complex data which have
    multiple concepts or no meaning, are located
  • finding decision rule or line

13
Strategy(1/A) Hierarchical Structure in Class
Definition
  • Define Target Category Scheme using Hierarchical
    Scheme
  • Final Target Categories C1, C2
  • Intermediate Categories. X
  • Subcategories of target category C11, C12,
    C21, C22

docoment
14
Training Algorithms
  • How many documents ?
  • How Informative documents were selected or
    constructed!!
  • Training Algorithms deal with
  • Selection
  • Combination
  • Number of training documents

Simple Random selection Active sample
selection (contain uncertainty sample)
15
Active Sample Selection vs. Random Selection
16
Problem Definition (B/2)
  • Problem A Characteristics of Documents
  • Most Documents have high Complexity Uncertainty
    in Contents with
  • multiple Concepts and Features
  • Complex reports, Spam-letter, New materials,
    Biological Literature,. Etc
  • The training error may happen while we have
    efficient learning method and human expert
  • High cost in training time -gt inaccuracy
  • Problem B Classification Method
  • Not important What kind of classifier was used
  • Classifier and algorithms has a limitation of
    analysis for dealing with documents as a bag of
    words.
  • The accuracy is high only when data fits model
    well Specific model for Specific domain area)
  • Too Simple Assign method by ranking system

17
Approach(1/B) Assign Process
  • Assign Process in Text Classification
  • 1st Candidate Category is always right?
  • Analysis Ranking Order and Score

Rank Data Rank Data 1 2 3 4 5 Target (C1, C2 U, X)
1 Category C21 C11 X C22 C12 C2 -gt U
1 29 28 17 15 10 C2 -gt U
2 Category C22 C11 X C21 C12 C2
2 97 1 1 1 0 C2
18
Strategy(B) Post-Processing for Assignment
  • Goals in each Step
  • Step 1 Classify 1st Candidate Category in a
    documents
  • Minimum Threshold a gap of 1st and 2nd by
    rank score(probability score)
  • Step 2 Classify Undecided cases in Step 1
  • Compute a Distance between X (Intermediate
    Category) and Other Category
  • Step 3 Gather Common / Uncommon Cases in Step1
    and Step 2
  • Ranking Order from Text Classification gt
    representing document
  • ltcategory , scoregt vs. ltterm , frequencygt
  • Step 4 Analysis Experience from Step1 Step3
    -gt Feedback to Training Process
  • Modifying Classification Rule
  • Add Eliminate training documents

Classify reliable ones vs. unreliable ones
19
Strategy(1/B) Post-Processing for Assignment
  • Goal
  • overcome the problems and limitations of
    traditional method with the mining approach which
    is focused on risk minimization analysis.

Assignment a category to documents using initial
score
  • Input Document Di, Candidate category list Li
    normalized and resorted by descend order
  • Step1 for i 0 to N( number of input
    documents)
  • If (Disize min_support) ((Li1.score
    min_value)
  • (Li1.score - Li2.score diff_value)) then
  • assign Di to 1ST candidate category
  • else assign Di to U(undecided)
  • Step2 for n0 to N ( number of unassigned
    documents in step1)
  • for n0 to N ( number of target category)
  • Calculate distance of
    category between Pivot Category(x), cnk
  • assign Di to more closer side cn

20
Strategy(2/B) Post-Processing Recomputation
  • Make another training data from candidate lists
    of documents
  • Perform Data Classification with these uncommon
    patterns

Li Di 1 (wm 0.02) 1 (wm 0.02) 2 (wm 0.15) 2 (wm 0.15) 3 (wm 0.25) 3 (wm 0.25) 4 (wm 0.31) 4 (wm 0.31) 5 (wm 0.35) 5 (wm 0.35) Disize Step1 Step2 Step3 Actual Class
1 C21 .98 C11 .01 X1 .01 X2 .01 C12 .00 726.33 C21?C2 - C2 C2
2 C21 .39 C12 .20 X2 .17 C11 .13 X1 .10 31.6 U C2 C2 C2
3 X2 .29 C11 .28 C12 .17 C21 .15 X1 .01 514.42 U C1 C1 C1
4 X1 .28 C21 .23 X2 .17 C11 .16 C12 .15 287.12 U C2 C2 C2
21
Multi-Training Strategy
  • where Cn.process,
  • n feedback time 0,1,2
  • process step1, step2, step3
  • Cn.process 1 when actual class predict class
  • Cn.process 0 when actual class ? predict class
  • Training Data
  • Document set used in text classification
    Cn.step1 Cn.step2
  • Examples ltdocument, classgt
  • Common/uncommon cases for data classification in
    step1,step2 Cn.step3
  • Examples lt an order of candidate classes in a
    documents, actual classgt
  • In results of above classification of each
    document using Reinforcement Learning
  • Examples ltltstate, actiongt, rewardgt
  • Evaluation Function for Estimate accuracy
  • Fault Detection and Error Correction

Target
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn.step1 Cn.step2 Cn.step3 Cn1.step1 Cn1.step2 Cn1.step3
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -

22
Approach(2/B) Error Correction and Feedback
  • Reinforcement Learning
  • Learning from Interaction with Environment
  • Delayed Reward
  • Trial and Error
  • Learning by Error Correction
  • Prediction-gtPrediction-gtPrediction-gtConservation
    Experience
  • Transition model, how action influence states
  • Reward R, immediate value of state-action
    transition
  • Policy ?, maps states to actions

Agent
Policy
1) 3 6) 2
2) 2 7) 1
3) 1 8) 3
4) 2 9) 3
5) 4 10) 3
Environment
23
Strategy(3/B) Post-Processing Evaluation
Feedback
  • Evaluation matrix for relevance feedback
  • Goal Estimate Accuracy , Efficiency,
    Over-fitting.
  • Action Add/Eliminate documents -gt Modify class
    definition

1 If ((Cn1.step1 X) and (Cn11.step1
True)
(Cn1.step2 True))
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn.step1 Cn.step2 Cn.step3 Cn1.step1 Cn1.step2 Cn1.step3
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -

Good(di)
0 otherwise
1 If (Cn1.step1 True ) and
Fair(di)
0 otherwise
1 If ((Cn1.step1 False ) (Cn1.step2 ?
Cn2) )
Poor(di)
0 otherwise
24
Summary Concept of Proposed Method
  • We dont care and dont depend on
  • A type of Classifiers and its performance
  • Ex)Functional definition of similarity measure
  • COS, Euclidean, Kernel Functions, ...
  • How many extracted features do we consider?
  • Didnt solve Complex and uncertainty problem
  • We deals
  • Definition of Classes based on hierarchical
    classification
  • Construction of Training Set for Active Learning
  • Post-processing analysis for improving
    traditional assign method by simple ranking
    system.
  • Feedback to Training Process and Schema by
    Reinforcement training

25
Proposed Method concepts
Applying Classification Algorithms(Classifier)
Assign
Classifying
Preprocessing
Indexing
New Documents
Performance measure
How about Classification power ? - Add or
elimination of documents in a class.
  • Assign Problem
  • Post-Processing Analysis for assign
  • Improving Simple Ranking System by Order of
    score
  • Feedback to Training process by Reinforcement
  • Training(error correction)
  • Multi Training Strategy
  • Re-Classification using a pattern of
    text-classification result
  • Preprocessing
  • Define Classes Scheme
  • Target category vs. Immediate Category
    vs.Sub-category
  • Hierarchical Construction in Classes

26
Contribution
  • More accurate determination in a segment
  • Important basis for uncertainty data similar
    concepts and features
  • Reduce an error risk
  • Proposed Method for Automated Text Classification
  • Main Objective provide high performance and
    reliability in system while reducing a human
    intervention
  • Can be used for many applications
  • Other Classification Problem, IR, Bioinformatics
    as well as Data Classification/Text
    Classification
  • Improves performance of class quality
  • Class quality accuracy
  • Stability from training-error
  • Reducing Learning cost
  • Over-fitting Problem Stop Learning, Disuse
    Learning

27
Experimental Result using Proposed Method-
Newsgroup Data- Biological Data
28
Implementation of Refinement Text/Data Mining
System
  • Part 1 Hierarchical Construction in Training
    Method
  • Part 2 Post-Processing by Reinforcement Learning

29
Experimental Method in Newsgroup Data
  • Characteristic 20 Newsgroup Dataset
  • 5 Group Computer, Science, Recreation, Talk and
    Misc.
  • There are partially multi-labeled documents and
  • Each class has more than two subclasses.
  • We use subset of Computer. group
  • C comp.graphics, comp.sys.ibm.pc.hardware
    , comp.sys.mac.hardware, comp.windows.x
  • 200 training documents for each target category,
    50 documents for intermediate category adding
    noises about 10 of the total.
  • We use
  • Accuracy True Cases / N
  • Test set did not contain Training data

30
Experimental Condition and Evaluation
  • Condition
  • Same classifiers NB, SVM
  • With and without post-processing
  • With and without noisy data in training samples
  • Define Category and Experimental Condition

Definition of Category Definition of Category Definition of Category Number of Training Documents Number of Training Documents Method Method
Target Category, C A of D for Inter- Mediate Category,X a of Subcategory Correct Documents Incorrect Documents(10 Base Classifier With(W) and Without(WO) noise post-processing
E1 c1,c2,c3,c4 0 0 800 0 SVM WO post-processing, WO noise
E2 c1,c2,c3,c4 0 0 800 80 SVM WO post-processing, W noise
E3 c1,c2,c3,c4 0 0 800 0 NB WO post-processing, WO noise
E4 c1,c2,c3,c4 0 0 800 80 NB WO post-processing, W noise
E5 c1,c2,c3,c4 150 2 800 0 SVM W post-processing, WO noise
E6 c1,c2,c3,c4 150 2 800 80 SVM W post-processing, W noise
E7 c1,c2,c3,c4 150 2 800 0 NB W post-processing, WO noise
E8 c1,c2,c3,c4 150 2 800 80 NB W post-processing, W noise
31
Experimental Results in Newsgroups
Yun Jeong Choi. Seung Soo Park,
Refinement Method of Post-processing and
Training for Improvement of Automated Text


Classification,
ICCSA 2006, LNCS 3981, pp 02980309, 2006
  • Comparison of Predict Power
  • Existing method(E1,E2,E3,E4) vs. Proposed
    method(E5,E6,E7,E8)
  • Correct documents vs. 10 of incorrect documents

Accuracy
Accuracy
Accuracy
Accuracy
32
?? Biological texts- Background- Results
33
Problem Definition
  • Ambiguous Entity gt Multiplicity
  • Genes, enzymes, and their transcripts often share
    the same name
  • The task of annotation gt identifying and
    classifying the terms
  • In documents, we have to treat as uncertain
    documents
  • There are too many topics and key-words
  • Classification Problem
  • Automated Classification is to classify free
    texts into predefine categories
  • Goal finding optimal decision line or surface
    and reducing manual process
  • Low Training Cost
  • High Accuracy

34
Examples of Ambiguous Entities
  • P130 i mediates TGF-beta-induced cell-cycle
    arrest in Rb mutant HT-3 cells.
  • The INK4alpha/ARF locus encodes p14(ARF) and
    p16(INK4alpha) , that function to arrest the cell
    cycle through the p53 and RB pathways,
    respectively.
  • Many tumor types are associated with genetic
    changes in the retinoblastoma pathway,leading to
    hyperactivation of cyclin-dependent kinases and
    incorrect progression through the cell cycle.
  • The Y79 and WERI-Rb1 retinoblastoma cells, as
    well as MCF7 breast cancer epithelial cells, all
    of which express T-channel current and mRNA for
    T-channel subunits, is inhibited by pimozide and
    mibefradil with IC(50) 8 and 5 microM for
    pimozide and mibefradil, respectively).

35
Experiments
  • Data RB-relate documents
  • Collected about 20000 abstracts from PubMed.
  • Select 100 of documents for verifying our system
  • This documents has many ambiguous features and
    high classification error
  • Define Category and Experimental Condition
  • most documents is connected with protein(P),
    gene(G) and cancer(C).
  • CI, others, then do the one-against-one
    classification about CP, G, D.

Definition of category Definition of category Definition of category Number of training documents (correct incorrect) Number of training documents (correct incorrect) Number of training documents (correct incorrect)
Target category (C) Candidate category (S) Intermediate category(X) Correct documents Incorrect documents (10) Total (300, 318)
Protein P1 30 5 60(36)
Protein P2 30 1 60(36)
X1 60 0 60
Gene G1 30 3 60(36)
Gene G2 30 3 60(36)
X2 60 0 60
Disease, Cancer D1 30 6 60(36)
Disease, Cancer D2 30 0 60(36)
36
Experimental Results
Yun Jeong Choi. Seung Soo Park,
Efficient Classification Method for Complex
Biological Literature using Text and


Data Mining Combination, IDEAL 2006,
LNCS 4224 , pp. 688 696, 2006
Result Table comparison between existing method
and RTPost method with only correct documents
Performance Method Accuracy Protein Predict Power Gene Predict Power Disease Predict Power Misclassification rate
Naïve Baysian 0.69 51 82 74 31.
SVM 0.74 64 83 76 29
RTPost Algorithm(with Naïve Baysian) 0.89 81 94 92 11
RTPost Algorithm(with SVM) 0.91 88 91 94 8
Result Table comparison between existing method
and RTPost method containing incorrect documents
Performance Method Accuracy Protein Predict Power Gene Predict Power Disease Predict Power Misclassification rate
Naïve Baysian 0.45 52 65 17 55.
SVM 0.47 54 61 26 64
RTPost Algorithm(with Naïve Baysian) 0.85 84 92 75 15
RTPost Algorithm(with SVM) 0.87 87 91 81 11
37
Summary
  • Many effective techniques have been proposed to
    classify these documents. But,
  • not proper to classify complex documents
  • Proposed method
  • Refinement classification system
  • It can be easily adapted to deal with other data
    or other data mining algorithms.
  • developed by using component based style using
    BOW toolkit and C language.
  • In Experiments, proposed method is very
    successful!
  • Compare the accuracy and the stability under
    actual condition.
  • Not depend on classification algorithms and
    techniques
  • In the future, well simplify the effectiveness
    function without raising the running costs of the
    entire process.

38
Publication
  • International Journal Conference
  • Yun Jeong Choi. Seung Soo Park, Efficient
    Classification Method for Complex Biological
    Literature using Text and Data Mining
    Combination, IDEAL 2006, LNCS 4224 , pp. 688
    696, 2006.
  • Yun Jeong Choi. Seung Soo Park, Refinement
    Method of Post-processing and Training for
    Improvement of Automated Text Classification,
    ICCSA 2006, LNCS 3981, pp 02980309, 2006
  • Yun Jeong Choi. Seung Soo Park, Automated
    Classification of PubMed texts for Disambiguated
    Annotation using Text and Data Mining, in
    Proceedings of the 2005 International Joint
    conference of InCob, AASBi and KSBI, BIOINFO
    2005, pp. 101 106, 2005
  • Yun Jeong Choi. Seung Soo Park. Min Kyung Kim.
    Hyun Seok Park, Sense Disambiguation in PubMed
    Abstracts using Text and Data Mining,
    International Conference on Genome informatics,
    GIW 2002, Vol. 13, pp. 578-579, 2002
  • Domestic Journal
  • ???, ???, ???? ??? ?????? ??? ??????? ???? ??,
    ???????? ???, Vol. 12 No 7, pp 811 822, 2005
  • ???, ???, ? ???? ??? ?? ??????? ??????? ?? ?? ??,
    ??????? ???, Vol. 23 No. 3, pp. 33-46, 2002
  • Domestic Conference
  • ???, ???, ????(Anomaly Detection) ? ????(Misuse
    Detection) ??? ??? ??? ?? ??? ?????? ?? ??,
    ????????, KCC, 2006,
  • ???, ???, ????? ??? ?? ????????? ?? ? ??,
    ??????? 2001? ??????, VOL. 28 NO. 01 pp.
    0247 0249 2001 . 04

39
Discussion and Next Plan
  • Application Area
  • Text Classification is very closely connected to
    Information Retrieval System, filtering system
  • Rich Tagging System
  • Text Mining, Data Mining
  • Dynamic classification for the ever-changing
    needs of user
  • Reflection of various viewpoint(angle) vs.
    tagging system
  • Traditional definition
  • ltdocument di, class ci gt disjoint mapping.
  • A Limitation of disjoint classification a
    documents which have multiple concepts must
    assign only one class?
  • Exact only one result vs. a liberal, Enrich
    results
  • Support to rich tagging system
  • Ranking Information ltcategory, probability
    scoregt
  • Efficient Evaluation Function
  • Definition Good / Fair / Poor cases

40
Discussion and Next Plan
  • Application Area
  • Text Classification is very closely connected to
    Information Retrieval System, filtering system
  • Rich Tagging System
  • Text Mining, Data Mining
  • Dynamic classification for the ever-changing
    needs of user
  • Reflection of various viewpoint(angle) vs.
    tagging system
  • Traditional definition
  • ltdocument di, class ci gt disjoint mapping.
  • A Limitation of disjoint classification a
    documents which have multiple concepts must
    assign only one class?
  • Exact only one result vs. a liberal, Enrich
    results
  • Support to rich tagging system
  • Ranking Information ltcategory, probability
    scoregt
  • Efficient Evaluation Function
  • Definition Good / Fair / Poor cases

41
References
  • Basic
  • Mitchell, T. Machine learning. McGraw-Hill, 1997.
  • Hearst, M.A. Trends and discoveries support
    vector macines. In IEEE Intelligent Systems,
    July/August 1998, pages 18-28.
  • Yang, Y., Slattery, S., Ghani, R. A study of
    approaches to hypertext categorization. Journal
    of Intelligent Information Systems,
    18(2/3)219-241, 2002.
  • Fabrizio Sebastiani, Machine Learning in
    Automated Text Categorization, ACM Computing
    Surveys, 34(1)1-47, 2002. http//www.math.unipd.i
    t/fabseb60/Publications/ACMCS02.pdf
  • Classification algorithms, Feature Extraction
  • Yang, Y. and Liu, X. A re-examination of text
    categorization methods. In Proceedings of ACM
    SIGIR, 1999.
  • Dumais, S. and Chen, H. Hierarchical
    classification of web content. In Proceedings of
    the 23rd ACM SIGIR Conference, pages 256-263,
    2000.
  • Yiming Yang, Shinjae Yoo, Jian Zhang and Bryan
    Kisiel. Robustness of Adaptive Filtering Methods
    in a Cross-benchmark Evaluation. In the 28th
    Annual International ACM SIGIR Conference (SIGIR
    2005), Brazil, 2005.
  • Lewis, D. D. An evaluation of phrasal and
    clustered representations on a text
    categorization task. In Proceedings of the 15th
    ACM SIGIR Conference, pages 37-50, 1992.
  • Yiming Yang, Thomas Ault, Thomas Pierce.
    Combining multiple learning strategies for
    effective cross validation (ps.gz). The
    Seventeenth International Conference on Machine
    Learning (ICML'00), pp1167-1182, 2000.
  • Bekkerman, R., El-Yaniv, R., Tishby, N., and
    Winter, Y. 2003. Distributional word clusters vs.
    words for text categorization. J. Mach. Learn.
    Res. 3 (Mar. 2003), 1183-1208.http//www.cs.techni
    on.ac.il/ronb/papers/jmlr.pdf
  • N. Slonim and N. Tishby. The power of word
    clusters for text classification. In 23rd
    European Colloquium on Information Retrieval
    Research. 2001. http//www.cs.huji.ac.il/noamm/pu
    blications/ECIR2001.ps.gz
  • Landauer, T. K., Foltz, P. W., Laham, D.
    Introduction to Latent Semantic Analysis.
    Discourse Processes, 25, 259-284,1998.
    http//lsa.colorado.edu/papers/dp1.LSAintro.pdf
  • Fan Li, Yiming Yang. Using recursive
    classification to discover predictive features.
    ACM SAC 2005.

42
Reference
  • Learning method
  • Joachims, T. Text categorization with support
    vector machines learning with many relevant
    features. In Proceedings of 10th European
    Conference on Machine Learning, pages 137-142,
    1998.
  • Lijuan Cai, Thomas Hofmann , Text Categorization
    by Boosting Automatically Extracted Concepts,
    26th Annual International ACM SIGIR Conference,
    2003
  • Ioannis Tsochantaridis, Thomas Hofmann, Thorsten
    Joachims, Yasemin Altun,Support Vector Machine
    Learning for Interdependent and Structured Output
    Spaces,International Conference on Machine
    Learning (ICML), 2004
  • Optimizing method
  • Lewis, D.D. Evaluating and optimizing autonomous
    text classification systems. In Proceedings of
    ACM SIGIR, 1995.
  • Özgür, L., Güngör, T., and Gürgen, F. 2004.
    Adaptive anti-spam filtering for agglutinative
    languages a special case for Turkish. Pattern
    Recogn. Lett. 25, 16 (Dec. 2004), 1819-1831.
  • Yiming Yang, Thomas Ault, Thomas Pierce and
    Charles W Lattimer. Improving text categorization
    methods for event tracking (ps.gz). Proceedings
    of ACM SIGIR Conference on Research and
    Development in Information Retrieval (SIGIR'00),
    pp65-72. 2000.

43
(No Transcript)
44
????- Stemming- Information Retrieval-
Classification Process- Expert System- Learning
Algorithms- Voting, Adaboost-Reinforcement
Learning- Performance Measure
45
Motivation Complex Case
Linear Decision Boundary
Non Linear Boundary
46
Experimental Results
  • Comparison of Predict Power
  • Existing method vs. Proposed method
  • Correct documents vs. 10 of incorrect documents

When training data contains 10 of incorrect
documents
  • Accuracy Existing method vs. Proposed method

47
Knowledge Engineering Process
48
Proposed Method concepts
Assign
Classifying
Applying Classification Algorithms(Classifier)
Preprocessing
Indexing
New Documents
Performance measure
Training
Training Documents
Indexing
Feature Selection
Preprocessing
49
  • ?? vs ????

?? ??
?? ??
???
???
???? (Feature Selection)
?? (Indexing)
?? (Indexing)
?? ??? (Text Classifier)
?? ?? (Assign)
????
??????
50
Pre-processing Stemming ??
51
Information Retrieval
52
Weights
  • How should weights be assigned to terms?
  • A very simple method the weight of a term is its
    document frequency
  • But different words normally have different
    frequencies

53
Related Work Expert System
54
Classification types
  • Single label vs. multi-label
  • exactly 1 category assigned to each document vs.
    0 to C
  • Binary vs. multi-way classification
  • binary a special case of single label, dj ? D is
    assigned either to ci or to its complement
    (e.g.spam non spam)
  • Document-pivoted (DPC) vs. category pivoted (CPC)
  • given a dj ? D , we want to find all the ci ? C
    under which it should be classified
    (document-pivoted)
  • DPC is suitable when documents become available
    at different moments in time, e.g. filtering
    e-mail
  • given a ci ? C , we want to find all the dj ? D
    under that should be classified under it
    (category-pivoted)
  • CPC is suitable when new categories are likely to
    be be added to C

55
Some more types
  • Hard categorisation vs. ranking categorisation
  • making a hard decision about the categories
    that a document should be classified under
  • or, ranking the categories based on their
    estimated appropriateness and allowing the choice
    of category to be done e.g by a human expert
  • ranking categorisation may lead to interactive
    categorisation systems
  • may be useful in critical applications
  • Hierarchical vs. flat
  • just like in clustering in flat the relations
    between classes are undefined, in hierarchical
    the classes are ordered

56
Learning Algorithms
  • Learning algorithms
  • Error-correction learning
  • Boltzmann learning
  • Thorndikes law of effect
  • Hebbian learning
  • Competitive learning
  • Learning paradigms
  • Supervised learning
  • Reinforcement learning
  • Self-organized(unsupervised) learning

57
Voting Algorithm
  • Principle using multiple evidence (multiple poor
    classifiersgt single good classifier)
  • Generate some base classifiers
  • Combine them to make the final decision

58
Bagging Algorithm
  • Use multiple versions of a training set D of size
    N, each created by resampling N examples from D
    with bootstrap
  • Each of data sets is used to train a base
    classifier, the final classification decision is
    made by the majority voting of these classifiers

59
Adaboost
  • Main idea
  • The main idea of this algorithm is to maintain a
    distribution or set of weights over the training
    set. Initially, all weights are set equally, but
    in each iteration the weights of incorrectly
    classified examples are increased so that the
    base classifier is forced to focus on the hard
    examples in the training set. For those correctly
    classified examples, their weights are decreased
    so that they are less important in next
    iteration.
  • Why ensembles can improve performance
  • Uncorrelated errors made by the individual
    classifiers can be removed by voting.
  • Our hypothesis space H may not contain the true
    function f. Instead, H may include several
    equally good approximations to f. By taking
    weighted combinations of these approximations, we
    may be able to represent classifiers that lie
    outside of H.

60
Adaboost algorithm
Given m examples
where
Initialize
for all i 1m
  • For t 1,,T
  • Train base classifier using distribution
  • Get a hypothesis

with error
  • Choose

.
  • Update

where
is a normalization factor (chosen so that
will be a distribution).
Output the final hypothesis
61
Analysis of Voting Algorithms
  • Advantage
  • Surprisingly effective
  • Robust to noise
  • Decrease the overfitting effect
  • Disadvantage
  • Require more calculation and memory

62
Decision Tree Learning
  • Learn a sequence of tests on features, typically
    using top-down, greedy search
  • At each stage choose the unused feature with
    highest Information Gain (feature/class MI)
  • Binary (yes/no) or continuous decisions

f1
!f1
f7
!f7
63
Performance Measure
  • Performance of algorithm
  • Training time
  • Testing time
  • Classification accuracy
  • Precision, Recall
  • Micro-average / Macro-average
  • Breakeven precision recall
  • Goal high classification quality and
    computation efficiency

64
Elements of Reinforcement Learning
Agent
Policy
Environment
  • Transition model, how action influence states
  • Reward R, immediate value of state-action
    transition
  • Policy ?, maps states to actions

65
???? ??(Artificial Life Brain)
????? ?? ?? ??? ???? ?? ? ?? ???? ?? ?
???? 3?? ????? ??
??(development)
??(evolution)
??(learning)
??(development) ?? L-system, cellular automata
??(learning) ?? reinforcement learning,
classifier system
??(evolution) ?? evolutionary algorithm,
co-evolution, DNA coding
66
Classification Performance Measures
  • Given n test documents and m classes in
    consideration, a classifier makes n ? m binary
    decisions. A two-by-two contingency table can
    be computed for each class.
  • Recall a/(ac) where a c gt 0 (o.w.
    undefined).
  • Did we find all of those that belonged in the
    class?
  • Precision a/(ab) where abgt0 (o.w. undefined).
  • Of the times we predicted it was in class, how
    often are we correct?
  • Accuracy (a d) / n
  • When one classes is overwhelmingly in the
    majority, this may not paint an accurate picture.
  • Others miss, false alarm (fallout), error,
    F-measure, break-even point, ...

truly YES truly NO
system YES a b
system NO c d

67
Data Mining Approach
???
? ?
??? ??
???
? ?
68
?? ??- Background- Results
69
  • ?? 8? 16?.
  • 18??? ?? ???? -gt ??...
  • -------------------------------
  • -????,??,
  • -Related Work. ????? ????? ?? ???? ??..
  • -??????? ?? ???? ??? ? ??????...
  • ????? ??? ? ??? ....??.....
  • -???
  • -???? ? ?? ????
  • -----------------------------------------------

70
Related Work Main
  • ???????? ?? ? ?? ??
  • ????(References) !
  • ??? ????(ranking system)
  • ??? ?? ????
  • ???? ? ?????? ??
  • ????(References) !
  • ?? ???? ??? ???? ??
  • ??(??)??(Feature Extraction Vector)
  • ???? ?? ?????? ? ????(best team!)
  • ??? ???? ???? ? ?? ??.! But.!
  • Boost ????
  • ??? ? ?? ??
  • ??? ???, ?? ???, ??? ???
  • ??? ??/ ??? ??? ??? ?? ??
  • Reference

71
????
  • ????
  • Misclassification ? ???? ??? ?? ??? ????? ????
    ???.. ?? ??? ? ?? ????? ???? ??? ??.
  • ????? ?? Sampling ????
  • ????
  • ????
  • ??, ???(???)? ???? ? ???
  • ?????? (learning by error correction)
  • ???? ?? ?? ???? ??? ??? ???? ???? ??? ??? ?? ??,
    ??? ??? ?? ???? ?? ??? ???? ???? ???? ???? ???
    ??.
  • ? ???? ?? ??? ???? ?? ??? ???? ???? ????? ?? ???
    ???? ??? ??? ?? ????? ??.
  • ?? ?? ??? ?? ??? ???? ?? ??? ???? ??? ??? ?? ???
    ?? ?????? ????? ??.

72
(Motivation)
  • ??(????? ?? ??)
  • ???
  • ?? ?????? ? ??
  • ?????? ??? ????
  • ??????? (????? ?? ??)
  • ???
  • ?? ????????, ????, ???? ? ????, ??????, ???,
    ?????? ????
  • ???? ???? ????, ????? ??? ????? ?.!
  • ????????? ??? ??? ? ?? ???! -gt ???? ???? ??? ?.
  • ???? ??? ?????, ??? ??? ??? ? ?????? ?? ??? ?.
  • Most of documents have high complexity in
    contents with multiple topics and features.
  • Complex Reports, News Materials, Biological
    Literatures
  • So, various kinds of algorithms have been
    proposed for this problem. However, the results
    are not satisfactory.

73
Background(1)
  • Classification (??,????,problem)
  • Objective classify a new object into one of
    the predefined categories using the generated
    classifier or rule
  • Definition D , C
  • Border Problem in Complex Data
  • Automated Classification Problem
  • ??????? ?? ????
  • ????? ???? ?? - ??? ????? ?????.
  • ?? ? ?? ????? ??? ??? -gt ???-gt ???? -gt ????
  • ? ??? ?? cost ??, ??? ??
  • ????1 ?? ?????? ???? ?? ??
  • ????2 ???? ?? ??? ??, ????? ??

74
Motivation Problem Definition
  • Classification Problem
  • Goal finding optimal decision line or surface
    while reducing manual intervention
  • Low Training Cost
  • High Accuracy
  • In real world problem
  • Representation of document A bag of word
  • The errors in training sets/samples may occur
    while we have efficient learning methods and
    human experts
  • Most of documents have high complexity in
    contents with multiple topics and features.
  • Complex Reports, Spam letter, News Materials,
    Biological Literatures..
  • Various kinds of algorithms and methods have been
    proposed for this problem. However, the results
    are not satisfactory.
  • Accuracy is high when data fits model well.
  • Assign method by simple ranking system

75
Problem Definition
  • Maeng et al, SIGIR 1998Navaro and Baeza, SIGIR
    1995
  • Model for querying and indexing
  • Need to analyze similarity between documents for
    mining(classification and clustering)
  • YiSanderson, SIGKDD 2000
  • Target uniform structure from one database or
    website.
  • Problem
  • Classification model, which doesnt consider text
    semantics
  • may not deal with documents with various
    concepts.
  • cannot identify similar meanings

76
Background Data Classification vs. Text
Classification
Target
Feature
Feature
ID Age Income Car Class
1
2
3
4
77
Approach Type of Misclassification
Brownell
Training error Causing Misclassification
Distortion
Incompleteness
Uncertainty
Absence
Vagueness
Probability
Ambiguity
Fuzziness
Non-specificity
78
Motivation
  • Recent Document
  • Similar symptoms -gt Different Disease , Similar
    Feature Vector -gt Difference Category
  • Multiple Concepts

Ranking System
79
  • ???? ?????, ?? ??? ??
  • ??? ??? ????? ?? ? ????? ??.
  • Similar Symptoms -gt Different Disease ???
    ????? ??? ???.
  • ??? ??/ ??
  • ????? ?, ??, ??, ??,??
  • ????? ???, ?????
  • Starting ?? ?? ??/??/??/????..
  • Existing Decision Rule But, ?????, ??? ?????
    ??????.
  • ?) ??? ?? ??? ???? ????? ????? ???? ??? ????.
  • Musso Sports ?? ? ?? ?
  • ??? ??? ?? ??.
  • ??,??,??
  • RV/SUV
  • Sports
  • ??

80
??? ?? ??? ??
??
??
?
???
??
???
??
????
???
????? ?? ??, ????, RV, SUV? ??? ??? ?? ???.
????
81
Proposed System
82
Proposed Method ????
Step 2 Text Mining
83
Refinement Method using Text/Data Mining System
  • Proposed System
  • Objective Maximize accuracy, Minimize training
    cost
  • Methodology Automated Text Classification based
    on knowledge Discovery Process using
    Reinforcement Learning and Post Processing
  • Target Data a Set of Documents which have
    Multiple Concepts (uncertain complex documents)

84
Implementation of Refinement Text/Data Mining
System
  • Part 1 Hierarchical Construction in Training
    Method
  • Part 2 Post-Processing by Reinforcement Learning

85
Part 1 Reinforcement Training Method Progress
  • Define Target Category Scheme
  • Definition 1 C c1, c2,.., cn is a set of
    final target categories, where ci and cj are
    disjoint each other.
  • Definition 2 SCn cn1, cn2, ,cnk is
    a set of subcategories of target category ci ,
    where each cnj are disjoint.
  • Definition 3 X x1, x2 xn-1 is set of
    intermediate categories.
  • The data located around decision boundary belongs
    to X. Also, unclassified documents are denoted by
    X

Organizing method of training data
Documents or Sentences
86
Part 2 Post-Processing - Assignment
  • Goal
  • overcome the problems and limitations of
    traditional method with the mining approach which
    is focused on risk minimization analysis.

Assignment a category to documents using initial
score
  • Input Document Di, Candidate category list Li
    normalized and resorted by descend order
  • Step1 for i 0 to N( number of input
    documents)
  • If (Disize min_support) ((Li1.score
    min_value)
  • (Li1.score - Li2.score diff_value)) then
  • assign Di to Li1
  • else assign Di to X
  • Step2 for n0 to N ( number of unassigned
    documents in step1)
  • for n0 to N ( number of target category)
  • Calculate distance of category between P, cnk
  • assign Di to more closer side cn

87
Part 2 Post-Processing Recomputation
  • Make another training data from candidate lists
    of documents
  • Perform Data Mining Analysis with these uncommon
    patterns

Li Di 1 (wm 0.02) 1 (wm 0.02) 2 (wm 0.15) 2 (wm 0.15) 3 (wm 0.25) 3 (wm 0.25) 4 (wm 0.31) 4 (wm 0.31) 5 (wm 0.35) 5 (wm 0.35) Disize Step1 Step2 Assign Actual Class
1 C21 .98 C11 .01 X1 .01 X2 .01 C12 .00 726.33 C21?C2 - C2 C2
2 C21 .39 C12 .20 X2 .17 C11 .13 X1 .10 31.6 X C2 C2 C2
3 X2 .29 C11 .28 C12 .17 C21 .15 X1 .01 514.42 X C1 C1 C1
4 X1 .28 C21 .23 X2 .17 C11 .16 C12 .15 287.12 X C2 C2 C2
88
Part 2 Post-Processing EvaluationFeedback
  • Evaluation matrix for effectiveness by variance
    of results

A (Cn1.step1 X) B (Cn1.step1 True ) C (Cn11.step1 True) (Cn1.step2 True) D (Cn1.step1 False ) (Cn1.step2 ? Cn2)
Location Input Data Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1) Result from each step (feedback time 1)
Location Input Data Cn1.step1 Cn1.step2 Cn2 Cn11.step1 Cn11.step2 Cn12
d1 X 1 1 1 - - Good
d2 X 0 1 1 - - Good
d3 X 0 0 1 - - Poor
d4 1 1 1 1 - - Fair
d5 1 0 1 1 - - Fair
d6 1 1 0 1 - - Poor
d7 1 0 0 1 - - Poor
d8 0 1 1 1 - - Good
d9 0 0 1 1 - - Good
d10 0 1 0 1 - - Poor
v v v v - -
where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2 where Cne.process , n 0,1,2 feedback time, e a type of input data,1 documents 2candidate lists of documents process step1, step2
89
Evaluation Methodology and Performance Measure
  • Performance goal high classification quality
    and computation efficiency
  • Training time , Testing time
  • Classification accuracy
  • Precision, Recall
  • Micro-average / Macro-average
  • Breakeven precision recall
  • We use
  • Measure
  • Accuracy True Casees / N
  • Test set has Not contain Training data
  • Condition
  • Same classifiers NB, SVM
  • With and without post-processing
  • With and without noisy data in training samples

90
Contribution
  • Improving of Automated text classification using
    Refinement training and post-processing method
  • Related Area Construction method of training
    dataset, Machine learning, Text Mining, Data
    mining, Information Retrieval
  • Not depend on classification algorithms,
    selection algorithms..
  • Main Objective
  • provide high performance and reliability in
    system while reducing a human intervention
  • Application area
  • a data area difficult to Classify
  • Anti-Spam system

91
Classification Learning

Full Category Schema
category1
category2
classifier
Assign
document
document
category3
Document
category4
92
Background Classification Example
93
Comparison Based on Six Classifiers
  • Classification accuracy six classifiers
    (Reuters-21578 collection)
  • SVM, Voting and KNN are showed good performance ,
    DT, NB and Rocchio showed relatively poor
    performance

      1 2 3 4
    Author Dumais Joachims Weiss Yang
Experimental Setup condition Training 9603 9603 9603 7789
Experimental Setup condition Test 3299 3299 3299 3309
Experimental Setup condition Topics 118 90 95 93
Experimental Setup condition Indexing Boolean tfc Frequency ltc
Experimental Setup condition Selection MI IG - ?2
Experimental Setup condition Measure Breakeven Microavg. Breakeven Breakeven
classifiers Rocchio 61.7 79.9 78.7 75
classifiers NB 75.2 72 73.4 71
classifiers KNN N/A 82.3 86.3 85
classifiers DT N/A 79.4 78.9 79
classifiers SVM 87 86 86.3 N/A
classifiers Voting N/A N/A 87.8 N/A
94
Comparison Based on Feature Selection
  • Classification accuracy NB vs. KNN vs. SVM
    (Reuter collection)

of features NB KNN SVM
10 48.66 0.10 57.31 0.2 60.78 0.17
20 52.28 0.15 62.57 0.16 73.67 0.11
40 59.19 0.15 68.39 0.13 77.07 0.14
50 60.32 0.14 74.22 0.11 79.02 0.13
75 66.18 0.19 76.41 0.11 83.0 0.10
100 77.9 0.19 80.2 0.09 84.3 0.12
200 78.26 0.15 82.5 0.09 86.94 0.11
500 80.80 0.12 82.19 0.08 86.59 0.10
1000 80.88 0.11 82.91 0.07 86.31 0.08
5000 79.26 0.07 82.97 0.06 86.57 0.04
95
The main approach to Automated Text classification
  • The machine learning approach
  • build a class ci by observing the properties of
    the set of documents manually pre-classified
    under ci(learning)
  • major drawback learning cost (human experts,
    learning algorithms)
  • i.e. how do we need a human expert to make
    concept in each different domain (experts system)
  • The knowledge engineering approach
  • need a large set of rules if ltgt then ltcategorygt
  • rules manually constructed
  • major drawback knowledge acquisition bottleneck
  • i.e. how do we deal with new categories,
    different domain, etc.

96
Uncertainty
  • Uncertainty Problem
  • True value is unknown
  • Too complex to compute prior to make decision
  • Characteristics of real-world applications
  • Source of Uncertainty
  • Cannot be explained by deterministic model
  • decay of radioactive substances
  • Dont understand well
  • disease transmit mechanism
  • Too complex to compute
  • classification of complex document

97
Approach(2/A) Selection Problem in Training
Brownell
Training error Causing Misclassification
Distortion
Incompleteness
Absence
Uncertainty
Confidence
Randomness
Vagueness
  • Which side will be up if I toss a coin ?
  • Is that Right?
  • How much do you Confident on your decision
Write a Comment
User Comments (0)
About PowerShow.com