Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers

Description:

Interesting Theorem: Cover and Hart (1967) ... that it violates the Cover and Hart assumptions, so the quality limit theorem no ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 32
Provided by: crm114Sou
Category:

less

Transcript and Presenter's Notes

Title: Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers


1
Sorting Spam with K-Nearest Neighbor and
Hyperspace Classifiers
  • William Yerazunis1
  • Fidelis Assis2
  • Christian Siefkes3
  • Shalendra Chhabra1,4
  • 1 Mitsubishi Electric Research Labs- Cambridge
    MA
  • 2 Empresa Brasileira de Telecomunicações
    Embratel, Rio de Janeiro, RJ Brazil
  • 3 Database and Information Systems Group, Freie
    Universität Berlin,
  • Berlin-Brandenburg Graduate School
    in Distributed Information Systems
  • 4 Computer Science and Engineering,
    University of California, Riverside CA

2
Bayesian is Great.Why Worry?
  • Typical Spam Filters are linear classifiers
  • Consider the checkerboard problem
  • Markovian requires the nonlinear features to be
    textually near each other
  • cant be sure that will work forever because
    spammers are clever.
  • Winnow is just a different weighting different
    chain rule rule

3
Bayesian is Great.Why Worry?
  • Bayesian is only a linear classifier
  • Consider the checkerboard problem
  • Markovian requires the nonlinear features to be
    textually near each other
  • cant be sure of that spammers are clever
  • Winnow is just a different weighting
  • KNNs are a very different kind of classifier

4
Typical Linear Separation
5
Typical Linear Separation
6
Typical Linear Separation
7
Nonlinear Decision Surfaces
Nonlinear decision surfaces require tremendous
amounts of data.
8
Nonlinear Decision and KNN / Hyperspace
Nonlinear decision surfaces require tremendous
amounts of data.
9
KNNs have been around
  • Earliest found reference
  • E. Fix and J. Hodges, Discriminatory Analysis
    Nonparametric Discrimination Consistency
    Properties

10
KNNs have been around
  • Earliest found reference
  • E. Fix and J. Hodges, Discriminatory Analysis
    Nonparametric Discrimination Consistency
    Properties
  • In 1951 !

11
KNNs have been around
  • Earliest found reference
  • E. Fix and J. Hodges, Discriminatory Analysis
    Nonparametric Discrimination Consistency
    Properties
  • In 1951 !
  • Interesting Theorem Cover and Hart (1967)
  • KNNs are within a factor of 2 in accuracy to the
    optimal Bayesian filter

12
KNNs in one slide!
  • Start with bunch of known things and one unknown
    thing.
  • Find the K known things most similar to the
    unknown thing.
  • Count how many of the K known things are in each
    class.
  • The unknown thing is of the same class as the
    majority of the K known things.

13
Issues with Standard KNNs
  • How big is the neighborhood K ?
  • How do you weight your neighbors?
  • Equal-vote?
  • Some falloff in weight?
  • Nearby interaction the Parzen window?
  • How do you train?
  • Everything? That gets big...
  • And SLOW.

14
Issues with Standard KNNs
  • How big is the neighborhood?
  • We will test with 3, 7, 21 and corpus
  • How do we weight the neighbors?
  • We will try equal-weighting, similarity,
    Euclidean distance, and combinations thereof.

15
Issues with Standard KNNs
  • How do we train?
  • To compare with a good Markov classifier we need
    to use TOE Train Only Errors
  • This is good in that it really speeds up
    classification and keeps the database small.
  • This is bad in that it violates the Cover and
    Hart assumptions, so the quality limit theorem no
    longer applies
  • BUT we will train multiple passes to see if an
    asymptote appears.

16
Issues with Standard KNNs
  • We found the bad KNNs mimic Cover and Hart
    behavior- they insert basically everything into a
    bloated database, sometimes more than once!
  • The more accurate KNNs inserted fewer examples
    into their database.

17
How do we compare KNNs?
  • Use the TREC 2005 SA dataset.
  • 10-fold validation train on 90, test on 10,
    repeat for each successive 10 (but remember to
    clear memory!)
  • Run 5 passes (find the asymptote)
  • Compare it versus the OSB Markovian tested at
    TREC 2005.

18
What do we use as features?
  • Use the OSB feature set. This combines nearby
    words to make short phrases the phrases are what
    are matched.
  • Example this is an example yields
  • this is
  • this ltskipgt an
  • this ltskipgt ltskipgt example
  • These features are the measurements we classify
    against

19
Test 1 Equal Weight VotingKNN with K 3, 7,
and 21
Asymptotic accuracy 93, 93, and 94 (good acc
98, spam acc 80 for K 2 and 7, 96 and 90
for K21) Time 50-75 milliseconds/message
20
Test 2 Weight by Hamming-1/2KNN with K 7 and
21
Asymptotic accuracy 94 and 92 (good acc 98,
spam acc 85 for K7, 98 and 79 for
K21) Time 60 milliseconds/message
21
Test 3 Weight by Hamming-1/2KNN with K
corpus
Asymptotic accuracy 97.8 Good accuracy
98.2Spam accuracy 96.9 Time 32 msec/message
22
Test 4 Weight by N-dimensional radiation
model(a.k.a. Hyperspace)
23
Test 4 Hyperspace weight,K corpus, d1,
2, 3
Asymptotic accuracy 99.3 Good accuracy 99.64
, 99.66 and 99.59 Spam accuracy 98.7, 98.4,
98.5 Time 32, 22, and 22 milliseconds/message
24
Test 5 Compare vs. Markov OSB(thin threshold)
Asymptotic accuracy 99.1 Good accuracy 99.6,
Spam accuracy 97.9 Time 31 msec/message
25
Test 6 Compare vs. Markov OSB(thick threshold
10.0 pR)
  • Thick Threshold means
  • Test it first
  • If it is wrong, train it.
  • If it was right, but only by less than the
    threshold thickness, train it anyway!
  • 10.0 pR units is roughly the range between 10 to
    90 certainty.

26
Test 6 Compare vs. Markov OSB(thick threshold
10.0 pR)
Asymptotic accuracy 99.5 Good accuracy 99.6,
Spam accuracy 99.3 Time 19 msec/message
27
Conclusions
  • Small-K KNNs are not very good for sorting spam.

28
Conclusions
  • Small-K KNNs are not very good for sorting spam.
  • Kcorpus KNNs with distance weighting are
    reasonable.

29
Conclusions
  • Small-K KNNs are not very good for sorting spam
  • Kcorpus KNNs with distance weighting are
    reasonable
  • Kcorpus KNNs with hyperspace weighting are
    pretty good.

30
Conclusions
  • Small-K KNNs are not very good for sorting spam.
  • Kcorpus KNNs with distance weighting are
    reasonable.
  • Kcorpus KNNs with hyperspace weighting are
    pretty good.
  • But thick-threshold trained Markovs seem to be
    more accurate, especially in single-pass training.

31
Thank you! Questions?
  • Full source is available at
  • http//crm114.sourceforge.net
  • (licensed under the GPL)
Write a Comment
User Comments (0)
About PowerShow.com