Email and Spam - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Email and Spam

Description:

Over 100,000 volunteers on Hotmail, over 50,000 new labeled examples/day. ... correct all errors, or only all spam, all good, 50% spam, 10% spam, no errors, etc. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 50
Provided by: deanslmic
Category:
Tags: email | spam

less

Transcript and Presenter's Notes

Title: Email and Spam


1
Email and Spam Joshua GoodmanMicrosoft
Corporation http//research.microsoft.com/joshuag
o (Slides stolen from lots of people,
especially Geoff Hulten)
2
Source Pew Internet American Life Project
3
Email addiction
Source AOL Email Addiction Survey
  • 41 check email first thing in the morning
  • 23 have checked in bed in their pajamas
  • 26 say they haven't gone more than two to three
    days without checking their email.

4
Overview
  • Email
  • Most important application
  • Great research problems for people working on
    Machine Learning / Data Mining
  • Spam
  • Techniques spammers use
  • Solutions to Spam
  • Fun problems you find building real systems

5
Part 1 Email
  • A sample of interesting machine learning / data
    mining email problems
  • Finding whats important
  • Priorities
  • Organizing mail
  • Auto foldering
  • Auto tagging
  • Finding whats interesting
  • Automatic search
  • Contact finding

6
Priorities(Eric Horvitz, Andy Jacobs, David
Hovel, etc.)
  • Automatically determines how important your email
    is
  • Send to your cell phone
  • Different sound/toast
  • Uses machine learning
  • Sent directly to you?
  • From your manager?
  • Uses future tense?
  • Future dates?
  • Personal request will you, can you
  • Importance is important, is critical

7
Auto Foldering (Jake Brutlag and Chris Meek)
  • Use machine learning to figure out automatically
    what folder mail goes in.
  • Interesting text classification problem
  • Folders contain as few as three entries
  • Data changes over time

8
Automatic Tagging for Email (Arun C. Surendran,
John C. Platt and Erin Renshaw)
  • Automatically tag email messages to enrich search
    organization and navigation.
  • How it works
  • Put messages into clusters
  • Naming clusters is hard
  • Use domain-dependent filtering (remove common
    intranet words)
  • Use noun phrases from subjects
  • Words do not have to occur
  • in all messages in cluster

college NCAA basketball
NBA Basketball bball
March Madness NCAA tournament
  • A message about NCAA
  • basketball might also get the tag college if
    that keyword occurs in most other messages in the
    cluster.

9
Automatic Search(Joshua Goodman and Vitor
Carvalho)
  • Automatically show users useful search results
  • Examined over 20 factors
  • Automatically train machine learning system to
    weight them.
  • Frequency of keywords in Internet Search query
    logs (MSN) is third most helpful feature (after
    TF and IDF)
  • Helped solve lots of linguistic problems
  • Almost everything in query logs is a meaningful
    phrase
  • Much easier to port to multiple languages

10
Contact Finding(T. Kristjansson, A. Culotta, P.
Viola and A. McCallum)
  • Automatically find contact information in an
    email message.
  • Machine learning method train it by showing
    examples
  • Propagates corrections
  • If you fix first name, makes a new guess for last
    name.
  • Extensions to CRFs
  • Constrained decoding
  • Confidence estimation
  • Recent even better results with discriminatively
    trained CFGs (Viola and Narasimhan)

11
Other Interesting Email Research
  • All of the research Ive just shown you is from
    Microsoft Research
  • Main reason much easier to steal slides from
    colleagues with nearby offices
  • Why do people in MSR spend so much time working
    on email problems?
  • CALO Project
  • Cognitive Assistant that Learns and Organizes
    DARPA funded project lead by SRI, with 22
    organizations participating
  • Main way you deal with your automated assistant
    is through email.
  • RADAR Project
  • Primarily at CMU (11 research groups) (DARPA
    funded)
  • Cognitive assistant that can do tasks like space
    planning, automated web master, etc.
  • Primary interface to the assistant is through
    email

12
Part 2 Spam
  • SPAM is the number one problem for email systems
  • Estimates from about 71 to 87 of mail is spam
  • If you stop 90 of the spam, over a billion spam
    a day will get past filters worldwide, and 20 of
    your inbox will be spam.
  • Overview
  • Techniques spammers use
  • Solutions to Spam
  • And some of the interesting research problems

13
Techniques spammers use
  • A few examples of tricks spammers use to get past
    spam filters
  • Most spam filters have text classification as
    main or important part, often with linear models
    (e.g. Naïve Bayes, etc.)

14
The Hitchhiker Chaffer
  • Content Chaff
  • Random passages from the Hitchhikers Guide
  • Footers from valid mail

This must be Thursday, said Arthur to himself,
sinking low over his beer, I never could get the
hang of Thursdays.
Express yourself with MSN Messenger 6.0
15
Hitchhiker Chaffers Later Work
  • Can use hidden text, e.g. white on white or many
    other tricks
  • User sees only spammy text
  • Spam filter sees everything, including good words.

16
Hitchhiker Chaffers Later Work
  • Can use hidden text, e.g. white on white or many
    other tricks

Also included a number of unusual statements made
by candidates during, On display? I eventually
had to go down to the cellar to find them.
http//join.msn.com/?Pagefeatures/es
17
Secret Decoder Ring
  • Looks easy
  • Is it?

Viagra Proven sexual aid to enhance
performance
18
Secret Decoder Ring Dude
  • Character Encoding
  • HTML word breaking

Phar109acy Prod117clt!LZJgttlt!LGgts
19
Diploma Guy
  • Word Obscuring

Dplmoia Pragorm Caerte a mroe prosoeprus
20
Diploma Guy
  • Word Obscuring

Dipmloa Paogrrm Cterae a more presporous
21
Diploma Guy
  • Word Obscuring

Dimlpoa Pgorram Cearte a more poosperrus
22
Diploma Guy
  • Word Obscuring

Dpmloia Pragorm Caetre a more prorpeosus
23
Diploma Guy
  • Word Obscuring

Dplmoia Pragorm Carete a mroe prorpseous
24
More of Diploma Guy
  • Diploma Guy is good at what he does

25
Trends in Spam Exploits(Hulten et al.)
26
Solutions to Spam
  • Filtering
  • Machine Learning
  • Matching/Fuzzy Hashing
  • (Blackhole Lists (IP addresses))
  • Postage
  • Turing Tests, Money, Computation
  • (Disposable Email Addresses)
  • Smart Proof

27
Filtering TechniqueMachine Learning
  • Learn spam versus good
  • Problem need source of training data
  • Get users to volunteer GOOD and SPAM
  • Over 100,000 volunteers on Hotmail, over 50,000
    new labeled examples/day.
  • Use standard text classification features, but
    also email/spam features
  • Time of day, number of recipients, etc.
  • But spammers are adapting to machine learning too
  • Images, different words, misspellings, etc.

28
Finding Cool Problems by Building Systems
  • Fun problems we found when we shipped adaptation
    for a spam filter
  • Fun problems we found when we worried about
    losing good mail.

29
What Happened When we Shipped an Adaptive Spam
Filter
  • The first spam filter we shipped was adaptive
  • If user corrected mistakes, we improved the
    filter.
  • What to do if the user does not correct mistakes?
  • We assumed the filter was correct
  • For users who rarely fixed mistakes, this lead to
    catastrophically bad results the filter got
    worse and worse and worse

30
Threshold DriftConservative Threshold Setting
Separator 50/50 mark
We are conservative in our filtering. For
instance, maybe we need to be 96 certain that
mail is spam before we classify as spam
Conservative Threshold 96 sure
31
Threshold DriftLots of Spam Classified as Good
Separator 50/50 mark
Conservative Threshold 96 sure
32
Threshold DriftNew Separator Parallel to Old
Old Separator
Old Conservative Threshold 96 sure
New Conservative Threshold 96 sure
33
Threshold DriftNew Separator Parallel to Old
Old Separator
Old Conservative Threshold 96 sure
New Conservative Threshold 96 sure
34
Adaptation with partial user feedback is hard
  • Users may correct all errors, or only all spam,
    all good, 50 spam, 10 spam, no errors, etc.
  • Need to work no matter what the user correction
    rate is
  • Great problem that you find when you try to build
    a real system

35
Fun problems we found when we worried about
losing good mail
  • Most machine learning focuses on accuracy
  • Assumes all errors equally bad
  • For spam (and most other problems) cost of
    deleting good mail much higher than cost of spam
    in inbox

0 (No missed spam)
  • Some research on optimizing area under the curve
    so you get good performance everywhere
  • Almost no research on how to optimize for a
    specific point.

(All spam missed) 1
0 (No good caught)
1 (All good caught)
36
Our technique(Scott Yih and Joshua Goodman)
Spam mail
Good mail
  • First, learn a model on all training data (e.g.
    linear classifier)
  • Pick the subset of the data in the region you
    care about
  • Find all messages, good and spam, that are more
    than, say, 50 likely to be spam according to the
    first model
  • Train a new model on only this data
  • At test time, use both models
  • Works substantially better than other techniques
    at the desired low false positive rate, reduce
    spam by 20-40 at compared to normal techniques.
  • Can make exciting progress even in well-explored
    area like text classification when you build a
    system.

37
Conclusion (1/2)
  • Building systems is a great way to find
    interesting and important new problems
  • Sometimes leads to fundamental research
  • Adaptation with partial feedback
  • Learning for specific tradeoffs
  • Contact finding with CRFs (constrained CRF
    decoding confidence estimation)
  • Email is the most important application (and
    theres lots of machine learning problems)
  • Email is why my wifes grandmother bought a
    computer

38
Conclusion (2/2)
  • Tons of exciting research
  • Email
  • Priorities, Auto-folder, Auto-Tag, Automatic
    Search
  • Spam
  • Still havent solved it can keep improving
  • New problems like phishing
  • Conference on Email and Anti-Spam
  • You just missed it (two weeks ago), so start
    writing papers for next year.

39
Disposable Email Addresses
  • You have one address for each sender
  • JOSHUAGO1895422_at_microsoft.com
  • All go to same mailbox
  • If I give you my address, and you send me spam, I
    just delete the address
  • How do new senders get an address?
  • If I send mail to 3 people, which address is it
    From?
  • Hard to remember!

40
My Favorite Solution
  • If we could get everyone at Hotmail to never
    answer any spam, spammers would just give up
    sending to Hotmail.
  • So, when new Hotmail users sign up, send them 100
    really tempting ads
  • If they answer any of them, terminate account

41
My Favorite Solution
  • If we could get everyone at Hotmail to never
    answer any spam, spammers would just give up
    sending to Hotmail.
  • So, when new Hotmail users sign up, send them 100
    really tempting ads
  • If they answer any of them, terminate account
  • Hotmail management refuses to consider this.

42
I tried to ship a grammar checker
  • Eric Brill gave a keynote in ???
  • Processing Natural Language
  • without Natural Language
  • Processing
  • All you need is lots of data
  • You can build a grammar checker with very simple
    machine learning.
  • Solve common grammar problems like their/
    theyre, etc.
  • Makes NLP sound really boring and problems seem
    easy.
  • Grammar checking is actually a very interesting
    problem

43
Why grammar checking is interesting (and hard)
after all
  • Product groups already had good solutions for
    English
  • Wanted Brazilian Portuguese
  • Theres tons of well-edited data for English
  • Try finding data for Brazilian Portuguese, etc.
  • Theres no data like more data only applies if
    there is more data
  • English is uninflected, but most languages have
    strong inflection
  • If you dont morphologically analyze, the
    vocabulary is effectively huge, multiplying the
    data sparsity problem

44
What else went wrong
  • Top priority agreement (singular/plural, gender)
  • Traditional ML approach to grammar checking
    (confusable word pairs) is local, no structure
  • Works well for gt 90 of test instances, because
    most agreement is local.
  • People doesnt make mistakes when the subject and
    verb is next to each other
  • People who make a mistake is most likely to do so
    when the subject and verb is far apart.
  • Need grammar, or some other powerful technique
  • No Brazilian Portuguese treebank
  • Grammar checking is a great problem for NLP
  • Trying to build a real system helps us find
    problems we didnt even know we had.

45
Blackhole Lists
MSN blocks e-mail from rival ISPs By Stefanie
Olsen Staff Writer, CNET News.comFebruary 28,
2003, 234 PM PT Microsoft's MSN said its e-mail
services had blocked some incoming messages from
rival Internet service providers earlier this
week, after their networks were mistakenly banned
as sources of junk mail. The Redmond, Wash.,
company, which has nearly 120 million e-mail
customers through its Hotmail and MSN Internet
services, confirmed Friday it had wrongly placed
a group of Internet protocol addresses from AOL
Time Warner's RoadRunner broadband service and
EarthLink on its "blocklist" of known spammers
whose mail should be barred from customer
in-boxes. Once notified of the error by the two
ISPs, MSN moved the IP addresses "over to a safe
list immediately," according to a Microsoft
spokeswoman.
  • Lists of IP addresses that send spam
  • Open relays, Open proxies, DSL/Cable lines, etc
  • Easy to make mistakes
  • Open relays, DSL, Cable send good and spam
  • Who makes the lists?
  • Some list-makers very aggressive
  • Some list-makers too slow

46
Nigerian Chatter
  • tatyanaatkins want to make money?joshuagood9
    how?tatyanaatkins have run a textile company
    and get pay in cheques and money
    ordersjoshuagood9 how do I make
    money?tatyanaatkins i gt my clients to send
    them to u while u cash em and remove your pay
    then sen the rest to me joshuagood9 Why don't
    you cash them yourself?tatyanaatkins because
    presently i am traveling around and this come in
    at a rate faster than i can tatyanaatkins need
    assistance in catching uptatyanaatkins if u
    wish i can send u the letter of
    incoporationjoshuagood9 yes, email it to
    mejoshuagood9 joshuagood9_at_yahoo.comtatyanaatkin
    s hold onjoshuagood9 you are in
    nigeria?tatyanaatkins yestatyanaatkins that's
    where the factory isjoshuagood9 how much will
    you pay metatyanaatkins u get up to 200 dollars
    every deliveryjoshuagood9 what is in a
    delivery? how do I get the money to
    you?tatyanaatkins i get the clients to send
    them to ujoshuagood9 and then
    what?tatyanaatkins u cash it and send via
    western unionjoshuagood9 sounds
    easytatyanaatkins yeahjoshuagood9 why do you
    pay me so muchmoney?joshuagood9 how many do I
    have to cash? Is one "delivery" one check? or a
    lot?tatyanaatkins cos people have eloped with
    my money n the pastjoshuagood9 why will you
    trust me?tatyanaatkins so i have decided to pay
    good so we all can be satisfiedjoshuagood9
    that makes sensejoshuagood9 Let me call you on
    the phone, and we can talk about
    ittatyanaatkins okjoshuagood9 what is your
    number?tatyanaatkins 2340833830119joshuagood9
    oh, that's internationaljoshuagood9 Im at work
    nowjoshuagood9 I'll have to call you later,
    from hometatyanaatkins oktatyanaatkins are u
    interested?joshuagood9 of course!tatyanaatkins
    so i'll send u your letterjoshuagood9 my
    letter?tatyanaatkins of employmentjoshuagood9
    oh, ok

47
Implicit Query(S. T. Dumais, E. Cutrell, R.
Sarin and E. Horvitz)
People search
Related email
Documents
Intranet
48
Chat Bot
  • joshuagood9 hi theresuperchristina hey there
    how u doin?joshuagood9 are you a
    bot?superchristina im not a bot are u?
    loljoshuagood9 are you a bot?superchristina i
    hate bots loljoshuagood9 asl?superchristina
    im 21 f usa and u?joshuagood9 I am fine, thank
    yousuperchristina right on asl?...  im 20 f
    usajoshuagood9 74/M, WAsuperchristina nice
    age joshuagood9 thank yousuperchristina yw
    sweety..could u do me a favor..check out my
    homepage and my profile see if my cam works? brb

49
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com