Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers

Description:

Interesting Theorem: Cover and Hart (1967) ... that it violates the Cover and Hart assumptions, so the quality limit theorem no ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 32

Provided by: crm114Sou

Category:

more less

Transcript and Presenter's Notes

Title: Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers

1
Sorting Spam with K-Nearest Neighbor and
Hyperspace Classifiers

William Yerazunis1
Fidelis Assis2
Christian Siefkes3
Shalendra Chhabra1,4
1 Mitsubishi Electric Research Labs- Cambridge
MA
2 Empresa Brasileira de Telecomunicações
Embratel, Rio de Janeiro, RJ Brazil
3 Database and Information Systems Group, Freie
Universität Berlin,
Berlin-Brandenburg Graduate School
in Distributed Information Systems
4 Computer Science and Engineering,
University of California, Riverside CA

2
Bayesian is Great.Why Worry?

Typical Spam Filters are linear classifiers
Consider the checkerboard problem
Markovian requires the nonlinear features to be
textually near each other
cant be sure that will work forever because
spammers are clever.
Winnow is just a different weighting different
chain rule rule

3
Bayesian is Great.Why Worry?

Bayesian is only a linear classifier
Consider the checkerboard problem
Markovian requires the nonlinear features to be
textually near each other
cant be sure of that spammers are clever
Winnow is just a different weighting
KNNs are a very different kind of classifier

4
Typical Linear Separation
5
Typical Linear Separation
6
Typical Linear Separation
7
Nonlinear Decision Surfaces
Nonlinear decision surfaces require tremendous
amounts of data.
8
Nonlinear Decision and KNN / Hyperspace
Nonlinear decision surfaces require tremendous
amounts of data.
9
KNNs have been around

Earliest found reference
E. Fix and J. Hodges, Discriminatory Analysis
Nonparametric Discrimination Consistency
Properties

10
KNNs have been around

Earliest found reference
E. Fix and J. Hodges, Discriminatory Analysis
Nonparametric Discrimination Consistency
Properties
In 1951 !

11
KNNs have been around

Earliest found reference
E. Fix and J. Hodges, Discriminatory Analysis
Nonparametric Discrimination Consistency
Properties
In 1951 !
Interesting Theorem Cover and Hart (1967)
KNNs are within a factor of 2 in accuracy to the
optimal Bayesian filter

12
KNNs in one slide!

Start with bunch of known things and one unknown
thing.
Find the K known things most similar to the
unknown thing.
Count how many of the K known things are in each
class.
The unknown thing is of the same class as the
majority of the K known things.

13
Issues with Standard KNNs

How big is the neighborhood K ?
How do you weight your neighbors?
Equal-vote?
Some falloff in weight?
Nearby interaction the Parzen window?
How do you train?
Everything? That gets big...
And SLOW.

14
Issues with Standard KNNs

How big is the neighborhood?
We will test with 3, 7, 21 and corpus
How do we weight the neighbors?
We will try equal-weighting, similarity,
Euclidean distance, and combinations thereof.

15
Issues with Standard KNNs

How do we train?
To compare with a good Markov classifier we need
to use TOE Train Only Errors
This is good in that it really speeds up
classification and keeps the database small.
This is bad in that it violates the Cover and
Hart assumptions, so the quality limit theorem no
longer applies
BUT we will train multiple passes to see if an
asymptote appears.

16
Issues with Standard KNNs

We found the bad KNNs mimic Cover and Hart
behavior- they insert basically everything into a
bloated database, sometimes more than once!
The more accurate KNNs inserted fewer examples
into their database.

17
How do we compare KNNs?

Use the TREC 2005 SA dataset.
10-fold validation train on 90, test on 10,
repeat for each successive 10 (but remember to
clear memory!)
Run 5 passes (find the asymptote)
Compare it versus the OSB Markovian tested at
TREC 2005.

18
What do we use as features?

Use the OSB feature set. This combines nearby
words to make short phrases the phrases are what
are matched.
Example this is an example yields
this is
this ltskipgt an
this ltskipgt ltskipgt example
These features are the measurements we classify
against

19
Test 1 Equal Weight VotingKNN with K 3, 7,
and 21
Asymptotic accuracy 93, 93, and 94 (good acc
98, spam acc 80 for K 2 and 7, 96 and 90
for K21) Time 50-75 milliseconds/message
20
Test 2 Weight by Hamming-1/2KNN with K 7 and
21
Asymptotic accuracy 94 and 92 (good acc 98,
spam acc 85 for K7, 98 and 79 for
K21) Time 60 milliseconds/message
21
Test 3 Weight by Hamming-1/2KNN with K
corpus
Asymptotic accuracy 97.8 Good accuracy
98.2Spam accuracy 96.9 Time 32 msec/message
22
Test 4 Weight by N-dimensional radiation
model(a.k.a. Hyperspace)
23
Test 4 Hyperspace weight,K corpus, d1,
2, 3
Asymptotic accuracy 99.3 Good accuracy 99.64
, 99.66 and 99.59 Spam accuracy 98.7, 98.4,
98.5 Time 32, 22, and 22 milliseconds/message
24
Test 5 Compare vs. Markov OSB(thin threshold)
Asymptotic accuracy 99.1 Good accuracy 99.6,
Spam accuracy 97.9 Time 31 msec/message
25
Test 6 Compare vs. Markov OSB(thick threshold
10.0 pR)

Thick Threshold means
Test it first
If it is wrong, train it.
If it was right, but only by less than the
threshold thickness, train it anyway!
10.0 pR units is roughly the range between 10 to
90 certainty.

26
Test 6 Compare vs. Markov OSB(thick threshold
10.0 pR)
Asymptotic accuracy 99.5 Good accuracy 99.6,
Spam accuracy 99.3 Time 19 msec/message
27
Conclusions

Small-K KNNs are not very good for sorting spam.

28
Conclusions

Small-K KNNs are not very good for sorting spam.
Kcorpus KNNs with distance weighting are
reasonable.

29
Conclusions

Small-K KNNs are not very good for sorting spam
Kcorpus KNNs with distance weighting are
reasonable
Kcorpus KNNs with hyperspace weighting are
pretty good.

30
Conclusions

Small-K KNNs are not very good for sorting spam.
Kcorpus KNNs with distance weighting are
reasonable.
Kcorpus KNNs with hyperspace weighting are
pretty good.
But thick-threshold trained Markovs seem to be
more accurate, especially in single-pass training.

31
Thank you! Questions?