Email and Spam - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Email and Spam

Description:

Over 100,000 volunteers on Hotmail, over 50,000 new labeled examples/day. ... correct all errors, or only all spam, all good, 50% spam, 10% spam, no errors, etc. ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 50

Provided by: deanslmic

Category:

Tags: email | spam

more less

Transcript and Presenter's Notes

Title: Email and Spam

1
Email and Spam Joshua GoodmanMicrosoft
Corporation http//research.microsoft.com/joshuag
o (Slides stolen from lots of people,
especially Geoff Hulten)
2
Source Pew Internet American Life Project
3
Email addiction
Source AOL Email Addiction Survey

41 check email first thing in the morning
23 have checked in bed in their pajamas

26 say they haven't gone more than two to three
days without checking their email.

4
Overview

Email
Most important application
Great research problems for people working on
Machine Learning / Data Mining
Spam
Techniques spammers use
Solutions to Spam
Fun problems you find building real systems

5
Part 1 Email

A sample of interesting machine learning / data
mining email problems
Finding whats important
Priorities
Organizing mail
Auto foldering
Auto tagging
Finding whats interesting
Automatic search
Contact finding

6
Priorities(Eric Horvitz, Andy Jacobs, David
Hovel, etc.)

Automatically determines how important your email
is
Send to your cell phone
Different sound/toast
Uses machine learning
Sent directly to you?
From your manager?
Uses future tense?
Future dates?

Personal request will you, can you
Importance is important, is critical

7
Auto Foldering (Jake Brutlag and Chris Meek)

Use machine learning to figure out automatically
what folder mail goes in.
Interesting text classification problem
Folders contain as few as three entries
Data changes over time

8
Automatic Tagging for Email (Arun C. Surendran,
John C. Platt and Erin Renshaw)

Automatically tag email messages to enrich search
organization and navigation.
How it works
Put messages into clusters
Naming clusters is hard
Use domain-dependent filtering (remove common
intranet words)
Use noun phrases from subjects
Words do not have to occur
in all messages in cluster

college NCAA basketball
NBA Basketball bball
March Madness NCAA tournament

A message about NCAA
basketball might also get the tag college if
that keyword occurs in most other messages in the
cluster.

9
Automatic Search(Joshua Goodman and Vitor
Carvalho)

Automatically show users useful search results
Examined over 20 factors
Automatically train machine learning system to
weight them.
Frequency of keywords in Internet Search query
logs (MSN) is third most helpful feature (after
TF and IDF)
Helped solve lots of linguistic problems
Almost everything in query logs is a meaningful
phrase
Much easier to port to multiple languages

10
Contact Finding(T. Kristjansson, A. Culotta, P.
Viola and A. McCallum)

Automatically find contact information in an
email message.
Machine learning method train it by showing
examples

Propagates corrections
If you fix first name, makes a new guess for last
name.
Extensions to CRFs
Constrained decoding
Confidence estimation

Recent even better results with discriminatively
trained CFGs (Viola and Narasimhan)

11
Other Interesting Email Research

All of the research Ive just shown you is from
Microsoft Research
Main reason much easier to steal slides from
colleagues with nearby offices
Why do people in MSR spend so much time working
on email problems?
CALO Project
Cognitive Assistant that Learns and Organizes
DARPA funded project lead by SRI, with 22
organizations participating
Main way you deal with your automated assistant
is through email.
RADAR Project
Primarily at CMU (11 research groups) (DARPA
funded)
Cognitive assistant that can do tasks like space
planning, automated web master, etc.
Primary interface to the assistant is through
email

12
Part 2 Spam

SPAM is the number one problem for email systems
Estimates from about 71 to 87 of mail is spam
If you stop 90 of the spam, over a billion spam
a day will get past filters worldwide, and 20 of
your inbox will be spam.
Overview
Techniques spammers use
Solutions to Spam
And some of the interesting research problems

13
Techniques spammers use

A few examples of tricks spammers use to get past
spam filters
Most spam filters have text classification as
main or important part, often with linear models
(e.g. Naïve Bayes, etc.)

14
The Hitchhiker Chaffer

Content Chaff
Random passages from the Hitchhikers Guide
Footers from valid mail

This must be Thursday, said Arthur to himself,
sinking low over his beer, I never could get the
hang of Thursdays.
Express yourself with MSN Messenger 6.0
15
Hitchhiker Chaffers Later Work

Can use hidden text, e.g. white on white or many
other tricks
User sees only spammy text
Spam filter sees everything, including good words.

16
Hitchhiker Chaffers Later Work

Can use hidden text, e.g. white on white or many
other tricks

Also included a number of unusual statements made
by candidates during, On display? I eventually
had to go down to the cellar to find them.
http//join.msn.com/?Pagefeatures/es
17
Secret Decoder Ring

Looks easy
Is it?

Viagra Proven sexual aid to enhance
performance
18
Secret Decoder Ring Dude

Character Encoding
HTML word breaking

Phar109acy Prod117clt!LZJgttlt!LGgts
19
Diploma Guy

Word Obscuring

Dplmoia Pragorm Caerte a mroe prosoeprus
20
Diploma Guy

Word Obscuring

Dipmloa Paogrrm Cterae a more presporous
21
Diploma Guy

Word Obscuring

Dimlpoa Pgorram Cearte a more poosperrus
22
Diploma Guy

Word Obscuring

Dpmloia Pragorm Caetre a more prorpeosus
23
Diploma Guy

Word Obscuring

Dplmoia Pragorm Carete a mroe prorpseous
24
More of Diploma Guy

Diploma Guy is good at what he does

25
Trends in Spam Exploits(Hulten et al.)
26
Solutions to Spam

Filtering
Machine Learning
Matching/Fuzzy Hashing
(Blackhole Lists (IP addresses))
Postage
Turing Tests, Money, Computation
(Disposable Email Addresses)
Smart Proof

27
Filtering TechniqueMachine Learning

Learn spam versus good
Problem need source of training data
Get users to volunteer GOOD and SPAM
Over 100,000 volunteers on Hotmail, over 50,000
new labeled examples/day.
Use standard text classification features, but
also email/spam features
Time of day, number of recipients, etc.
But spammers are adapting to machine learning too
Images, different words, misspellings, etc.

28
Finding Cool Problems by Building Systems

Fun problems we found when we shipped adaptation
for a spam filter
Fun problems we found when we worried about
losing good mail.

29
What Happened When we Shipped an Adaptive Spam
Filter

The first spam filter we shipped was adaptive
If user corrected mistakes, we improved the
filter.
What to do if the user does not correct mistakes?
We assumed the filter was correct
For users who rarely fixed mistakes, this lead to
catastrophically bad results the filter got
worse and worse and worse

30
Threshold DriftConservative Threshold Setting
Separator 50/50 mark
We are conservative in our filtering. For
instance, maybe we need to be 96 certain that
mail is spam before we classify as spam
Conservative Threshold 96 sure
31
Threshold DriftLots of Spam Classified as Good
Separator 50/50 mark
Conservative Threshold 96 sure
32
Threshold DriftNew Separator Parallel to Old
Old Separator
Old Conservative Threshold 96 sure
New Conservative Threshold 96 sure
33
Threshold DriftNew Separator Parallel to Old
Old Separator
Old Conservative Threshold 96 sure
New Conservative Threshold 96 sure
34
Adaptation with partial user feedback is hard

Users may correct all errors, or only all spam,
all good, 50 spam, 10 spam, no errors, etc.
Need to work no matter what the user correction
rate is
Great problem that you find when you try to build
a real system

35
Fun problems we found when we worried about
losing good mail

Most machine learning focuses on accuracy
Assumes all errors equally bad
For spam (and most other problems) cost of
deleting good mail much higher than cost of spam
in inbox

0 (No missed spam)

Some research on optimizing area under the curve
so you get good performance everywhere
Almost no research on how to optimize for a
specific point.

(All spam missed) 1
0 (No good caught)
1 (All good caught)
36
Our technique(Scott Yih and Joshua Goodman)
Spam mail
Good mail

First, learn a model on all training data (e.g.
linear classifier)
Pick the subset of the data in the region you
care about
Find all messages, good and spam, that are more
than, say, 50 likely to be spam according to the
first model
Train a new model on only this data
At test time, use both models
Works substantially better than other techniques
at the desired low false positive rate, reduce
spam by 20-40 at compared to normal techniques.
Can make exciting progress even in well-explored
area like text classification when you build a
system.

37
Conclusion (1/2)

Building systems is a great way to find
interesting and important new problems
Sometimes leads to fundamental research

Adaptation with partial feedback
Learning for specific tradeoffs
Contact finding with CRFs (constrained CRF
decoding confidence estimation)
Email is the most important application (and
theres lots of machine learning problems)
Email is why my wifes grandmother bought a
computer

38
Conclusion (2/2)

Tons of exciting research
Email
Priorities, Auto-folder, Auto-Tag, Automatic
Search

Spam
Still havent solved it can keep improving
New problems like phishing
Conference on Email and Anti-Spam
You just missed it (two weeks ago), so start
writing papers for next year.

39
Disposable Email Addresses

You have one address for each sender
JOSHUAGO1895422_at_microsoft.com
All go to same mailbox
If I give you my address, and you send me spam, I
just delete the address
How do new senders get an address?
If I send mail to 3 people, which address is it
From?
Hard to remember!

40
My Favorite Solution

If we could get everyone at Hotmail to never
answer any spam, spammers would just give up
sending to Hotmail.
So, when new Hotmail users sign up, send them 100
really tempting ads
If they answer any of them, terminate account

41
My Favorite Solution

If we could get everyone at Hotmail to never
answer any spam, spammers would just give up
sending to Hotmail.
So, when new Hotmail users sign up, send them 100
really tempting ads
If they answer any of them, terminate account
Hotmail management refuses to consider this.

42
I tried to ship a grammar checker

Eric Brill gave a keynote in ???
Processing Natural Language
without Natural Language
Processing
All you need is lots of data
You can build a grammar checker with very simple
machine learning.
Solve common grammar problems like their/
theyre, etc.
Makes NLP sound really boring and problems seem
easy.
Grammar checking is actually a very interesting
problem

43
Why grammar checking is interesting (and hard)
after all

Product groups already had good solutions for
English
Wanted Brazilian Portuguese
Theres tons of well-edited data for English
Try finding data for Brazilian Portuguese, etc.
Theres no data like more data only applies if
there is more data
English is uninflected, but most languages have
strong inflection
If you dont morphologically analyze, the
vocabulary is effectively huge, multiplying the
data sparsity problem

44
What else went wrong

Top priority agreement (singular/plural, gender)
Traditional ML approach to grammar checking
(confusable word pairs) is local, no structure
Works well for gt 90 of test instances, because
most agreement is local.
People doesnt make mistakes when the subject and
verb is next to each other
People who make a mistake is most likely to do so
when the subject and verb is far apart.
Need grammar, or some other powerful technique
No Brazilian Portuguese treebank
Grammar checking is a great problem for NLP
Trying to build a real system helps us find
problems we didnt even know we had.

45
Blackhole Lists
MSN blocks e-mail from rival ISPs By Stefanie
Olsen Staff Writer, CNET News.comFebruary 28,
2003, 234 PM PT Microsoft's MSN said its e-mail
services had blocked some incoming messages from
rival Internet service providers earlier this
week, after their networks were mistakenly banned
as sources of junk mail. The Redmond, Wash.,
company, which has nearly 120 million e-mail
customers through its Hotmail and MSN Internet
services, confirmed Friday it had wrongly placed
a group of Internet protocol addresses from AOL
Time Warner's RoadRunner broadband service and
EarthLink on its "blocklist" of known spammers
whose mail should be barred from customer
in-boxes. Once notified of the error by the two
ISPs, MSN moved the IP addresses "over to a safe
list immediately," according to a Microsoft
spokeswoman.

Lists of IP addresses that send spam
Open relays, Open proxies, DSL/Cable lines, etc
Easy to make mistakes
Open relays, DSL, Cable send good and spam
Who makes the lists?
Some list-makers very aggressive
Some list-makers too slow

46
Nigerian Chatter

tatyanaatkins want to make money?joshuagood9
how?tatyanaatkins have run a textile company
and get pay in cheques and money
ordersjoshuagood9 how do I make
money?tatyanaatkins i gt my clients to send
them to u while u cash em and remove your pay
then sen the rest to me joshuagood9 Why don't
you cash them yourself?tatyanaatkins because
presently i am traveling around and this come in
at a rate faster than i can tatyanaatkins need
assistance in catching uptatyanaatkins if u
wish i can send u the letter of
incoporationjoshuagood9 yes, email it to
mejoshuagood9 joshuagood9_at_yahoo.comtatyanaatkin
s hold onjoshuagood9 you are in
nigeria?tatyanaatkins yestatyanaatkins that's
where the factory isjoshuagood9 how much will
you pay metatyanaatkins u get up to 200 dollars
every deliveryjoshuagood9 what is in a
delivery? how do I get the money to
you?tatyanaatkins i get the clients to send
them to ujoshuagood9 and then
what?tatyanaatkins u cash it and send via
western unionjoshuagood9 sounds
easytatyanaatkins yeahjoshuagood9 why do you
pay me so muchmoney?joshuagood9 how many do I
have to cash? Is one "delivery" one check? or a
lot?tatyanaatkins cos people have eloped with
my money n the pastjoshuagood9 why will you
trust me?tatyanaatkins so i have decided to pay
good so we all can be satisfiedjoshuagood9
that makes sensejoshuagood9 Let me call you on
the phone, and we can talk about
ittatyanaatkins okjoshuagood9 what is your
number?tatyanaatkins 2340833830119joshuagood9
oh, that's internationaljoshuagood9 Im at work
nowjoshuagood9 I'll have to call you later,
from hometatyanaatkins oktatyanaatkins are u
interested?joshuagood9 of course!tatyanaatkins
so i'll send u your letterjoshuagood9 my
letter?tatyanaatkins of employmentjoshuagood9
oh, ok

47
Implicit Query(S. T. Dumais, E. Cutrell, R.
Sarin and E. Horvitz)
People search
Related email
Documents
Intranet
48
Chat Bot

joshuagood9 hi theresuperchristina hey there
how u doin?joshuagood9 are you a
bot?superchristina im not a bot are u?
loljoshuagood9 are you a bot?superchristina i
hate bots loljoshuagood9 asl?superchristina
im 21 f usa and u?joshuagood9 I am fine, thank
yousuperchristina right on asl?... im 20 f
usajoshuagood9 74/M, WAsuperchristina nice
age joshuagood9 thank yousuperchristina yw
sweety..could u do me a favor..check out my
homepage and my profile see if my cam works? brb

49
(No Transcript)

Write a Comment

User Comments (0)