Title: Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers
1Sorting Spam with K-Nearest Neighbor and
Hyperspace Classifiers
- William Yerazunis1
- Fidelis Assis2
- Christian Siefkes3
- Shalendra Chhabra1,4
- 1 Mitsubishi Electric Research Labs- Cambridge
MA - 2 Empresa Brasileira de Telecomunicações
Embratel, Rio de Janeiro, RJ Brazil - 3 Database and Information Systems Group, Freie
Universität Berlin, - Berlin-Brandenburg Graduate School
in Distributed Information Systems - 4 Computer Science and Engineering,
University of California, Riverside CA -
2Bayesian is Great.Why Worry?
- Typical Spam Filters are linear classifiers
- Consider the checkerboard problem
- Markovian requires the nonlinear features to be
textually near each other - cant be sure that will work forever because
spammers are clever. - Winnow is just a different weighting different
chain rule rule
3Bayesian is Great.Why Worry?
- Bayesian is only a linear classifier
- Consider the checkerboard problem
- Markovian requires the nonlinear features to be
textually near each other - cant be sure of that spammers are clever
- Winnow is just a different weighting
- KNNs are a very different kind of classifier
4Typical Linear Separation
5Typical Linear Separation
6Typical Linear Separation
7Nonlinear Decision Surfaces
Nonlinear decision surfaces require tremendous
amounts of data.
8Nonlinear Decision and KNN / Hyperspace
Nonlinear decision surfaces require tremendous
amounts of data.
9KNNs have been around
- Earliest found reference
- E. Fix and J. Hodges, Discriminatory Analysis
Nonparametric Discrimination Consistency
Properties
10KNNs have been around
- Earliest found reference
- E. Fix and J. Hodges, Discriminatory Analysis
Nonparametric Discrimination Consistency
Properties - In 1951 !
11KNNs have been around
- Earliest found reference
- E. Fix and J. Hodges, Discriminatory Analysis
Nonparametric Discrimination Consistency
Properties - In 1951 !
- Interesting Theorem Cover and Hart (1967)
- KNNs are within a factor of 2 in accuracy to the
optimal Bayesian filter
12KNNs in one slide!
- Start with bunch of known things and one unknown
thing. - Find the K known things most similar to the
unknown thing. - Count how many of the K known things are in each
class. - The unknown thing is of the same class as the
majority of the K known things.
13Issues with Standard KNNs
- How big is the neighborhood K ?
- How do you weight your neighbors?
- Equal-vote?
- Some falloff in weight?
- Nearby interaction the Parzen window?
- How do you train?
- Everything? That gets big...
- And SLOW.
14Issues with Standard KNNs
- How big is the neighborhood?
- We will test with 3, 7, 21 and corpus
- How do we weight the neighbors?
- We will try equal-weighting, similarity,
Euclidean distance, and combinations thereof.
15Issues with Standard KNNs
- How do we train?
- To compare with a good Markov classifier we need
to use TOE Train Only Errors - This is good in that it really speeds up
classification and keeps the database small. - This is bad in that it violates the Cover and
Hart assumptions, so the quality limit theorem no
longer applies - BUT we will train multiple passes to see if an
asymptote appears.
16Issues with Standard KNNs
- We found the bad KNNs mimic Cover and Hart
behavior- they insert basically everything into a
bloated database, sometimes more than once! - The more accurate KNNs inserted fewer examples
into their database.
17How do we compare KNNs?
- Use the TREC 2005 SA dataset.
- 10-fold validation train on 90, test on 10,
repeat for each successive 10 (but remember to
clear memory!) - Run 5 passes (find the asymptote)
- Compare it versus the OSB Markovian tested at
TREC 2005.
18What do we use as features?
- Use the OSB feature set. This combines nearby
words to make short phrases the phrases are what
are matched. - Example this is an example yields
- this is
- this ltskipgt an
- this ltskipgt ltskipgt example
- These features are the measurements we classify
against
19Test 1 Equal Weight VotingKNN with K 3, 7,
and 21
Asymptotic accuracy 93, 93, and 94 (good acc
98, spam acc 80 for K 2 and 7, 96 and 90
for K21) Time 50-75 milliseconds/message
20Test 2 Weight by Hamming-1/2KNN with K 7 and
21
Asymptotic accuracy 94 and 92 (good acc 98,
spam acc 85 for K7, 98 and 79 for
K21) Time 60 milliseconds/message
21Test 3 Weight by Hamming-1/2KNN with K
corpus
Asymptotic accuracy 97.8 Good accuracy
98.2Spam accuracy 96.9 Time 32 msec/message
22Test 4 Weight by N-dimensional radiation
model(a.k.a. Hyperspace)
23Test 4 Hyperspace weight,K corpus, d1,
2, 3
Asymptotic accuracy 99.3 Good accuracy 99.64
, 99.66 and 99.59 Spam accuracy 98.7, 98.4,
98.5 Time 32, 22, and 22 milliseconds/message
24Test 5 Compare vs. Markov OSB(thin threshold)
Asymptotic accuracy 99.1 Good accuracy 99.6,
Spam accuracy 97.9 Time 31 msec/message
25Test 6 Compare vs. Markov OSB(thick threshold
10.0 pR)
- Thick Threshold means
- Test it first
- If it is wrong, train it.
- If it was right, but only by less than the
threshold thickness, train it anyway! - 10.0 pR units is roughly the range between 10 to
90 certainty.
26Test 6 Compare vs. Markov OSB(thick threshold
10.0 pR)
Asymptotic accuracy 99.5 Good accuracy 99.6,
Spam accuracy 99.3 Time 19 msec/message
27Conclusions
- Small-K KNNs are not very good for sorting spam.
28Conclusions
- Small-K KNNs are not very good for sorting spam.
- Kcorpus KNNs with distance weighting are
reasonable.
29Conclusions
- Small-K KNNs are not very good for sorting spam
- Kcorpus KNNs with distance weighting are
reasonable - Kcorpus KNNs with hyperspace weighting are
pretty good.
30Conclusions
- Small-K KNNs are not very good for sorting spam.
- Kcorpus KNNs with distance weighting are
reasonable. - Kcorpus KNNs with hyperspace weighting are
pretty good. - But thick-threshold trained Markovs seem to be
more accurate, especially in single-pass training.
31Thank you! Questions?
- Full source is available at
- http//crm114.sourceforge.net
- (licensed under the GPL)