Title: Email and Spam
1Email and Spam Joshua GoodmanMicrosoft
Corporation http//research.microsoft.com/joshuag
o (Slides stolen from lots of people,
especially Geoff Hulten)
2Source Pew Internet American Life Project
3Email addiction
Source AOL Email Addiction Survey
- 41 check email first thing in the morning
- 23 have checked in bed in their pajamas
- 26 say they haven't gone more than two to three
days without checking their email.
4Overview
- Email
- Most important application
- Great research problems for people working on
Machine Learning / Data Mining - Spam
- Techniques spammers use
- Solutions to Spam
- Fun problems you find building real systems
5Part 1 Email
- A sample of interesting machine learning / data
mining email problems - Finding whats important
- Priorities
- Organizing mail
- Auto foldering
- Auto tagging
- Finding whats interesting
- Automatic search
- Contact finding
6Priorities(Eric Horvitz, Andy Jacobs, David
Hovel, etc.)
- Automatically determines how important your email
is - Send to your cell phone
- Different sound/toast
- Uses machine learning
- Sent directly to you?
- From your manager?
- Uses future tense?
- Future dates?
- Personal request will you, can you
- Importance is important, is critical
7Auto Foldering (Jake Brutlag and Chris Meek)
- Use machine learning to figure out automatically
what folder mail goes in. - Interesting text classification problem
- Folders contain as few as three entries
- Data changes over time
8Automatic Tagging for Email (Arun C. Surendran,
John C. Platt and Erin Renshaw)
- Automatically tag email messages to enrich search
organization and navigation. - How it works
- Put messages into clusters
- Naming clusters is hard
- Use domain-dependent filtering (remove common
intranet words) - Use noun phrases from subjects
- Words do not have to occur
- in all messages in cluster
college NCAA basketball
NBA Basketball bball
March Madness NCAA tournament
- A message about NCAA
- basketball might also get the tag college if
that keyword occurs in most other messages in the
cluster.
9Automatic Search(Joshua Goodman and Vitor
Carvalho)
- Automatically show users useful search results
- Examined over 20 factors
- Automatically train machine learning system to
weight them. - Frequency of keywords in Internet Search query
logs (MSN) is third most helpful feature (after
TF and IDF) - Helped solve lots of linguistic problems
- Almost everything in query logs is a meaningful
phrase - Much easier to port to multiple languages
10Contact Finding(T. Kristjansson, A. Culotta, P.
Viola and A. McCallum)
- Automatically find contact information in an
email message. - Machine learning method train it by showing
examples
- Propagates corrections
- If you fix first name, makes a new guess for last
name. - Extensions to CRFs
- Constrained decoding
- Confidence estimation
- Recent even better results with discriminatively
trained CFGs (Viola and Narasimhan)
11Other Interesting Email Research
- All of the research Ive just shown you is from
Microsoft Research - Main reason much easier to steal slides from
colleagues with nearby offices - Why do people in MSR spend so much time working
on email problems? - CALO Project
- Cognitive Assistant that Learns and Organizes
DARPA funded project lead by SRI, with 22
organizations participating - Main way you deal with your automated assistant
is through email. - RADAR Project
- Primarily at CMU (11 research groups) (DARPA
funded) - Cognitive assistant that can do tasks like space
planning, automated web master, etc. - Primary interface to the assistant is through
email
12Part 2 Spam
- SPAM is the number one problem for email systems
- Estimates from about 71 to 87 of mail is spam
- If you stop 90 of the spam, over a billion spam
a day will get past filters worldwide, and 20 of
your inbox will be spam. - Overview
- Techniques spammers use
- Solutions to Spam
- And some of the interesting research problems
13Techniques spammers use
- A few examples of tricks spammers use to get past
spam filters - Most spam filters have text classification as
main or important part, often with linear models
(e.g. Naïve Bayes, etc.)
14The Hitchhiker Chaffer
- Content Chaff
- Random passages from the Hitchhikers Guide
- Footers from valid mail
This must be Thursday, said Arthur to himself,
sinking low over his beer, I never could get the
hang of Thursdays.
Express yourself with MSN Messenger 6.0
15Hitchhiker Chaffers Later Work
- Can use hidden text, e.g. white on white or many
other tricks - User sees only spammy text
- Spam filter sees everything, including good words.
16Hitchhiker Chaffers Later Work
- Can use hidden text, e.g. white on white or many
other tricks
Also included a number of unusual statements made
by candidates during, On display? I eventually
had to go down to the cellar to find them.
http//join.msn.com/?Pagefeatures/es
17Secret Decoder Ring
Viagra Proven sexual aid to enhance
performance
18Secret Decoder Ring Dude
- Character Encoding
- HTML word breaking
Phar109acy Prod117clt!LZJgttlt!LGgts
19 Diploma Guy
Dplmoia Pragorm Caerte a mroe prosoeprus
20 Diploma Guy
Dipmloa Paogrrm Cterae a more presporous
21 Diploma Guy
Dimlpoa Pgorram Cearte a more poosperrus
22 Diploma Guy
Dpmloia Pragorm Caetre a more prorpeosus
23 Diploma Guy
Dplmoia Pragorm Carete a mroe prorpseous
24 More of Diploma Guy
- Diploma Guy is good at what he does
25Trends in Spam Exploits(Hulten et al.)
26Solutions to Spam
- Filtering
- Machine Learning
- Matching/Fuzzy Hashing
- (Blackhole Lists (IP addresses))
- Postage
- Turing Tests, Money, Computation
- (Disposable Email Addresses)
- Smart Proof
27Filtering TechniqueMachine Learning
- Learn spam versus good
- Problem need source of training data
- Get users to volunteer GOOD and SPAM
- Over 100,000 volunteers on Hotmail, over 50,000
new labeled examples/day. - Use standard text classification features, but
also email/spam features - Time of day, number of recipients, etc.
- But spammers are adapting to machine learning too
- Images, different words, misspellings, etc.
28Finding Cool Problems by Building Systems
- Fun problems we found when we shipped adaptation
for a spam filter - Fun problems we found when we worried about
losing good mail.
29What Happened When we Shipped an Adaptive Spam
Filter
- The first spam filter we shipped was adaptive
- If user corrected mistakes, we improved the
filter. - What to do if the user does not correct mistakes?
- We assumed the filter was correct
- For users who rarely fixed mistakes, this lead to
catastrophically bad results the filter got
worse and worse and worse
30Threshold DriftConservative Threshold Setting
Separator 50/50 mark
We are conservative in our filtering. For
instance, maybe we need to be 96 certain that
mail is spam before we classify as spam
Conservative Threshold 96 sure
31Threshold DriftLots of Spam Classified as Good
Separator 50/50 mark
Conservative Threshold 96 sure
32Threshold DriftNew Separator Parallel to Old
Old Separator
Old Conservative Threshold 96 sure
New Conservative Threshold 96 sure
33Threshold DriftNew Separator Parallel to Old
Old Separator
Old Conservative Threshold 96 sure
New Conservative Threshold 96 sure
34Adaptation with partial user feedback is hard
- Users may correct all errors, or only all spam,
all good, 50 spam, 10 spam, no errors, etc. - Need to work no matter what the user correction
rate is - Great problem that you find when you try to build
a real system
35Fun problems we found when we worried about
losing good mail
- Most machine learning focuses on accuracy
- Assumes all errors equally bad
- For spam (and most other problems) cost of
deleting good mail much higher than cost of spam
in inbox
0 (No missed spam)
- Some research on optimizing area under the curve
so you get good performance everywhere - Almost no research on how to optimize for a
specific point.
(All spam missed) 1
0 (No good caught)
1 (All good caught)
36Our technique(Scott Yih and Joshua Goodman)
Spam mail
Good mail
- First, learn a model on all training data (e.g.
linear classifier) - Pick the subset of the data in the region you
care about - Find all messages, good and spam, that are more
than, say, 50 likely to be spam according to the
first model - Train a new model on only this data
- At test time, use both models
- Works substantially better than other techniques
at the desired low false positive rate, reduce
spam by 20-40 at compared to normal techniques. - Can make exciting progress even in well-explored
area like text classification when you build a
system.
37Conclusion (1/2)
- Building systems is a great way to find
interesting and important new problems - Sometimes leads to fundamental research
- Adaptation with partial feedback
- Learning for specific tradeoffs
- Contact finding with CRFs (constrained CRF
decoding confidence estimation) - Email is the most important application (and
theres lots of machine learning problems) - Email is why my wifes grandmother bought a
computer
38Conclusion (2/2)
- Tons of exciting research
- Email
- Priorities, Auto-folder, Auto-Tag, Automatic
Search
- Spam
- Still havent solved it can keep improving
- New problems like phishing
- Conference on Email and Anti-Spam
- You just missed it (two weeks ago), so start
writing papers for next year.
39Disposable Email Addresses
- You have one address for each sender
- JOSHUAGO1895422_at_microsoft.com
- All go to same mailbox
- If I give you my address, and you send me spam, I
just delete the address - How do new senders get an address?
- If I send mail to 3 people, which address is it
From? - Hard to remember!
40My Favorite Solution
- If we could get everyone at Hotmail to never
answer any spam, spammers would just give up
sending to Hotmail. - So, when new Hotmail users sign up, send them 100
really tempting ads - If they answer any of them, terminate account
41My Favorite Solution
- If we could get everyone at Hotmail to never
answer any spam, spammers would just give up
sending to Hotmail. - So, when new Hotmail users sign up, send them 100
really tempting ads - If they answer any of them, terminate account
- Hotmail management refuses to consider this.
42I tried to ship a grammar checker
- Eric Brill gave a keynote in ???
- Processing Natural Language
- without Natural Language
- Processing
- All you need is lots of data
- You can build a grammar checker with very simple
machine learning. - Solve common grammar problems like their/
theyre, etc. - Makes NLP sound really boring and problems seem
easy. - Grammar checking is actually a very interesting
problem
43Why grammar checking is interesting (and hard)
after all
- Product groups already had good solutions for
English - Wanted Brazilian Portuguese
- Theres tons of well-edited data for English
- Try finding data for Brazilian Portuguese, etc.
- Theres no data like more data only applies if
there is more data - English is uninflected, but most languages have
strong inflection - If you dont morphologically analyze, the
vocabulary is effectively huge, multiplying the
data sparsity problem
44What else went wrong
- Top priority agreement (singular/plural, gender)
- Traditional ML approach to grammar checking
(confusable word pairs) is local, no structure - Works well for gt 90 of test instances, because
most agreement is local. - People doesnt make mistakes when the subject and
verb is next to each other - People who make a mistake is most likely to do so
when the subject and verb is far apart. - Need grammar, or some other powerful technique
- No Brazilian Portuguese treebank
- Grammar checking is a great problem for NLP
- Trying to build a real system helps us find
problems we didnt even know we had.
45Blackhole Lists
MSN blocks e-mail from rival ISPs By Stefanie
Olsen Staff Writer, CNET News.comFebruary 28,
2003, 234 PM PT Microsoft's MSN said its e-mail
services had blocked some incoming messages from
rival Internet service providers earlier this
week, after their networks were mistakenly banned
as sources of junk mail. The Redmond, Wash.,
company, which has nearly 120 million e-mail
customers through its Hotmail and MSN Internet
services, confirmed Friday it had wrongly placed
a group of Internet protocol addresses from AOL
Time Warner's RoadRunner broadband service and
EarthLink on its "blocklist" of known spammers
whose mail should be barred from customer
in-boxes. Once notified of the error by the two
ISPs, MSN moved the IP addresses "over to a safe
list immediately," according to a Microsoft
spokeswoman.
- Lists of IP addresses that send spam
- Open relays, Open proxies, DSL/Cable lines, etc
- Easy to make mistakes
- Open relays, DSL, Cable send good and spam
- Who makes the lists?
- Some list-makers very aggressive
- Some list-makers too slow
46Nigerian Chatter
- tatyanaatkins want to make money?joshuagood9
how?tatyanaatkins have run a textile company
and get pay in cheques and money
ordersjoshuagood9 how do I make
money?tatyanaatkins i gt my clients to send
them to u while u cash em and remove your pay
then sen the rest to me joshuagood9 Why don't
you cash them yourself?tatyanaatkins because
presently i am traveling around and this come in
at a rate faster than i can tatyanaatkins need
assistance in catching uptatyanaatkins if u
wish i can send u the letter of
incoporationjoshuagood9 yes, email it to
mejoshuagood9 joshuagood9_at_yahoo.comtatyanaatkin
s hold onjoshuagood9 you are in
nigeria?tatyanaatkins yestatyanaatkins that's
where the factory isjoshuagood9 how much will
you pay metatyanaatkins u get up to 200 dollars
every deliveryjoshuagood9 what is in a
delivery? how do I get the money to
you?tatyanaatkins i get the clients to send
them to ujoshuagood9 and then
what?tatyanaatkins u cash it and send via
western unionjoshuagood9 sounds
easytatyanaatkins yeahjoshuagood9 why do you
pay me so muchmoney?joshuagood9 how many do I
have to cash? Is one "delivery" one check? or a
lot?tatyanaatkins cos people have eloped with
my money n the pastjoshuagood9 why will you
trust me?tatyanaatkins so i have decided to pay
good so we all can be satisfiedjoshuagood9
that makes sensejoshuagood9 Let me call you on
the phone, and we can talk about
ittatyanaatkins okjoshuagood9 what is your
number?tatyanaatkins 2340833830119joshuagood9
oh, that's internationaljoshuagood9 Im at work
nowjoshuagood9 I'll have to call you later,
from hometatyanaatkins oktatyanaatkins are u
interested?joshuagood9 of course!tatyanaatkins
so i'll send u your letterjoshuagood9 my
letter?tatyanaatkins of employmentjoshuagood9
oh, ok
47Implicit Query(S. T. Dumais, E. Cutrell, R.
Sarin and E. Horvitz)
People search
Related email
Documents
Intranet
48Chat Bot
- joshuagood9 hi theresuperchristina hey there
how u doin?joshuagood9 are you a
bot?superchristina im not a bot are u?
loljoshuagood9 are you a bot?superchristina i
hate bots loljoshuagood9 asl?superchristina
im 21 f usa and u?joshuagood9 I am fine, thank
yousuperchristina right on asl?... im 20 f
usajoshuagood9 74/M, WAsuperchristina nice
age joshuagood9 thank yousuperchristina yw
sweety..could u do me a favor..check out my
homepage and my profile see if my cam works? brb
49(No Transcript)