Comments from Pre-submission Presentation - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Comments from Pre-submission Presentation

Description:

... [Joachims 98] [Debole 03 STM] [Dumais 98 Inductive] [Yang 99 Reexamination] ... Later, he himself pointed out that BEP is not a good effectiveness measure, ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 18
Provided by: soc128
Category:

less

Transcript and Presenter's Notes

Title: Comments from Pre-submission Presentation


1
Comments from Pre-submission Presentation
  • Q Check why kNN is so lower than SVM on Reuters
    and 20 Newsgroups corpus? -10.
  • A Refer to the following four references
    Joachims 98 Debole 03 STM Dumais 98
    Inductive Yang 99 Reexamination

2
Joachims98Debole03Dumais98Results on the
Reuters Corpus
Bayes Rocchio C4.5 kNN SVM (linear) SVM (Poly) SVM (rbf)
Micro-BEP() 69.84 79.14 77.78 82.5 84.2 86 86
kNN SVM (linear)
Micro-F1 85.4 92.0
NBayes DT SVM (linear)
Micro- BEP 81.5 88.4 92.0
3
Yang 99 Re-examinationSignificance Test
  • Micro-level analysis (s-test)
  • SVM gt kNN gtgt LLSF, NNet gtgt NB
  • Macro-level analysis
  • SVM, kNN, LLSF gtgt NB, NNet
  • Error-rate based comparison
  • SVM, kNN gt LLSF gt NNet gtgt NB

4
Comments from Pre-submission Presentation
  • 2. Explain why BEP F1 in Chap 7
  • -Add reference

5
Breakeven point (1)
  • BEP, first proposed by Lewis1992. Later, he
    himself pointed out that BEP is not a good
    effectiveness measure, because
  • 1. there may be no parameter setting that yields
    the breakeven in this case the final BEP value,
    obtained by interpolation, is artificial
  • 2. to have PR is not necessarily desirable, and
    it is not clear that a system that achieves high
    BEP can be tuned to score high on other
    effectiveness measure.

6
Breakeven point (2)
  • Yang1999Re-examinatio also noted that when for
    no value of the parameters P and R are close
    enough, interpolated breakeven may not be a
    reliable indicator of effectiveness.

7
Comments from Pre-submission Presentation
  • 3. Add more qualitative analysis would be better

8
Analysis and Proposal Empirical observation
Comparison of idf, rf and chi2 value of four
features in two categories of Reuters Corpus
feature Category 00_acq Category 00_acq Category 00_acq Category 03_earn Category 03_earn Category 03_earn
feature idf rf chi2 idf rf chi2
Acquir 3.553 4.368 850.66 3.553 1.074 81.50
Stake 4.201 2.975 303.94 4.201 1.082 31.26
Payout 4.999 1 10.87 4.999 7.820 44.68
dividend 3.567 1.033 46.63 3.567 4.408 295.46
9
Comments from Pre-submission Presentation
  • 4. Chap 7 remove Joachims Results using quotation
    is fine

10
Comments from Pre-submission Presentation
  • 5. Tone down best claims
  • ? to our knowledge (experience, understanding)
  • Pay attention this usage when doing presentation

11
IntroductionOther Text Representation
  • Word senses (meanings) Kehagias 2001
  • same word assumes different meanings in a
    different contexts
  • Term clustering Lewis 1992
  • group words with high degree of pairwise semantic
    relatedness
  • Semantic and syntactic representation Scott
    matwin 1999
  • Relationship between words, i.e. phrases,
    synonyms and hypernyms

12
IntroductionOther Text Representation
  • Latent Semantic Indexing Deerwester 1990
  • A feature reconstruction technique
  • Combination Approach Peng 2003
  • combine two types of indexing terms, i.e. words
    and 3-grams
  • In general, high level representation did not
    show good performance in most cases

13
Literature ReviewKnowledge-based Representation
  • Theme Topic Mixture Model Graphical Model
    Keller 2004
  • Using keywords from summarization Li 2003

14
Literature Review 2. How to weight a term
(feature)
  • Salton 1988 elaborated three considerations
  • 1. term occurrences closely represent the content
    of document
  • 2. other factors with the discriminating power
    pick up the relevant documents from other
    irrelevant documents
  • 3. consider the effect of length of documents

15
Literature Review 2. How to weight a term
(feature)
  • 1. Term Frequency Factor
  • Binary representation (1 for present and 0 for
    absent)
  • Term frequency (tf) number of times a term
    occurs in a document
  • Log(tf) log operation to scale the effect of
    unfavorably high term frequency
  • Inverse term frequency (ITF)

16
Literature Review 2. How to weight a term
(feature)
  • 2. Collection Frequency Factor
  • idf the most-commonly used factor
  • Probabilistic idf aka. term relevance weight
  • Feature selection metrics chi2, information
    gain, gain ratio, odds ratio, etc.

17
Literature Review 2. How to weight a term
(feature)
  • 3. Normalization Factor
  • Combine the above two factors by using
    multiplication operation
  • In order to eliminate the length effect, we use
    the cosine normalization to limit the term
    weighting range within (0,1)
Write a Comment
User Comments (0)
About PowerShow.com