The State of the Spam July 2004 - PowerPoint PPT Presentation

About This Presentation
Title:

The State of the Spam July 2004

Description:

Decreasing in effectiveness as spammers use more tricks to avoid content filters ... Caller Id For Email. Microsoft proposal. Validates message header ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 33
Provided by: Erica127
Learn more at: https://www.usenix.org
Category:
Tags: july | spam | state

less

Transcript and Presenter's Notes

Title: The State of the Spam July 2004


1
The State of the SpamJuly 2004
  • Eric Allman
  • Sendmail, Inc.

2
Full Disclosure
  • These are not the exact slides I presented at
    USENIX ATC on July 2, 2004. I have taken the
    opportunity to fix one or two minor mistakes,
    fill out spam data for June 2004, and add a few
    minor points. Other than that, these are the
    same material.

3
My Personal View
  • Everything in this talk is my personal view
  • Doesnt represent Sendmail, Inc.
  • Doesnt represent Yahoo!
  • Doesnt represent Microsoft
  • Doesnt represent Bobs Sushi Bar and Bait Shop
  • Its really a grab bag of observations and
    thoughts
  • Focus on technology
  • A little data, a little math, a lot of arm waving
  • A whole bunch of opinion
  • Lots of overlap between various scheme
    classifications

4
The Traditional Anti-Spam Talk
  • What is spam? (three slides)
  • Spam is bad (two slides)
  • No, its really really bad (two slides)
  • Its getting worse (three slides)
  • Did I mention how bad it is? (one slide)
  • If we dont fix spam, all of London will be
    buried under 25 feet of spam by 2104 (one-two
    slides)
  • Its really really really really bad (one slide)
  • The sky is falling (one slide)
  • Buy my product and all your problems will be
    solved (17 slides)

5
Statistics and Claims
  • There are approximately 80 world class spammers
    who pay on the order of US100K/month for
    bandwidth and servers
  • 200 spam operations account for 90 of all spam
    Spamhaus, June 2004
  • Spam costs 300? (3?10-6 US) per message David
    Harris, LinuxPro, October 2003
  • AOL rejects 80 of all incoming mail as spam 80
    of remaining is never opened (probably spam) for
    96 total spam Carl Hutzler, AOL, personal
    communication, 2004
  • Email volume increased 600 in 2003, all of it
    spam MAAWG, April 2004
  • Zombie machines generate 90 of all spam AOL,
    via MSNBC, June 2004 (at least 40, Atlanta
    Journal-Constitution, June 2004) (four fifths,
    Sandvine, via The Register, June 2004) pick your
    number

6
Erics Personal and Business Spam By Month
Gap in data due to operator error
7
Technology Summary
  • Increasing breadth and depth of technologies
  • Often combined to create hybrids
  • Realtime Blackhole Lists (RBLs)
  • Lots of collateral damage
  • Content filtering
  • Heuristics
  • Fingerprinting collaboration
  • Machine learning (ML)
  • Challenge-Response (C-R)
  • Nearly always combined with other schemes
  • Traffic analysis
  • Identity (sender authentication source
    filtering)
  • IP Address based
  • Cryptography based
  • Economic Schemes

We Are Here
8
Realtime IP Blackhole Lists
  • DNS-based lists of known spamming IP addresses
  • Tend to have a lot of collateral damage
  • Operators vary from quite clean to very shady
  • Increasingly used as input to richer algorithms

9
Heuristic Filters
  • Smart people observe what spammers are doing and
    means to detect and counter them
  • Often based on header weirdness, excessive use of
    HTML tags and attributes, encoding tricks
  • Sometimes based on content (e.g., dollar signs,
    exclamation points)
  • Leaves the good guys (thats us!) in a reactive
    mode
  • Decreasing in effectiveness as spammers use more
    tricks to avoid content filters
  • Generally an unwinnable arms race

10
Fingerprinting and Collaboration
  • Central clearing house of known spam
  • Stores fingerprints, not full message
  • Fingerprints sometimes fuzzy to avoid
    snowflaking
  • Often collaborative (community based)
  • Extremely low false positive rate
  • Problem reactive (by definition, some people
    have to receive the spam)
  • Claim updates every 5 minutes OK but
    preferably 15 seconds
  • Problem reputation of submitters
  • Open Source Vipuls Razor
  • Commercial Brightmail, SpamNet (Cloudmark)

11
Machine Learning Filters
  • Let the computer figure out the interesting stuff
  • Several flavors best known is Naïve Bayesian
    Classifiers
  • Neural Networks, State Vector Machines are others
  • Sensitive to feature extraction
  • Need two piles of training data
  • Spam and not-spam
  • Some algorithms generalize to N piles
  • Use machine-learning algorithm to learn patterns
    in the two sets
  • Tend to work very well for individuals, less well
    at the server level
  • Our spam may be similar, but our legitimate mail
    varies a lot

12
Bayes Theorem
13
Naïve Bayesian Classifier
14
Challenge-Response
  • Heavily allow- (white-)list-based
  • If sender isnt on recipients allow-list, send a
    challenge
  • Challenges can be simple response, human puzzles,
    computational challenges, etc.
  • Successful response to the challenge puts the
    sender on the recipients allow-list
  • Recipient can permanently block- (black-)list
    senders
  • Challenges arbitrarily complex
  • Simple response
  • Human puzzle such as visual detection (e.g., read
    a word in an image or answer a simple question
    about an image)
  • Computational puzzles such as HashCash
  • Some evidence spammers can subvert human puzzles

15
Traffic-Based Filtering
  • Observe traffic patterns
  • Large number of simultaneous connections from one
    IP block is suspicious
  • Extremely rapid rate of connections is suspicious
  • Many possible algorithms
  • Often slows down (not rejects) high traffic
    spikes
  • Variant Greylisting
  • Tempfail initial connection from unfamiliar
    senders (keeps triple of sender IP address,
    sender email address, recipient email address)
  • Spam engines that dont have queues drop the
    message
  • Really a heuristic
  • But see Email Prioritization Reducing Delays on
    Legitimate Mail Caused by Junk Mail (this
    conference)

16
Identity-Based Filtering
  • Almost always require authentication
  • Exception Realtime Blackhole Lists (RBLs)
  • Allow- and Block- lists (White/Black)
  • Almost always combined with C-R
  • Accreditation and Reputation

17
Allow-lists/Block-lists
  • Basic to nearly all anti-spam technologies
  • Allow-lists allow users to specifically get
    messages that might be classified as spam (e.g.,
    newsletters)
  • Block-lists usable for blocking obnoxious senders
    even if they arent sending UCE
  • Allow-lists can lower receiver costs (avoid
    expensive algorithms)
  • Two philosophies
  • All not explicitly illegal gets through (default
    to accept)
  • All not explicitly legal gets blocked (default to
    deny)

18
Sender Authentication
  • Problem anyone can claim to be anyone they want
  • Authentication directly addresses fraud and
    phishing
  • Authentication is not an anti-spam solution in
    and of itself, but is essential for
    identity-based algorithms
  • Already have SMTP AUTH and TLS
  • Both are MTA-to-MTA, not end-to-end
  • Requires advance arrangements
  • Absolutely essential for identity-based
    algorithms
  • Per-user authentication PGP, S/MIME
  • Historic difficulties with key management (PKI)
  • Per-domain authentication may be good enough for
    spam
  • Assign responsibility to someone who can do
    something about it
  • Should really be thought of as non-repudiability

19
IP-Address Based Authentication
  • Looks at domain of sender, gets list of outgoing
    email addresses for that domain
  • If actual connection IP address on the list,
    probably good
  • Broken by forwarding, MX records, aliasing, etc.
  • Specifically, breaks store-and-forward model of
    email
  • Proposals to fix, some very ugly
  • Current schemes store information in DNS TXT
    records
  • Concerns about resilience and capacity of DNS
  • No reason they couldnt be modified to use
    another protocol

20
SPF
  • Sender Permitted From Sender Policy Framework
  • Meng Weng Wong
  • Validates MAIL FROM address
  • Pros
  • Very low cost for senders, fairly low cost for
    recipients
  • Compact policy language
  • Cons
  • Not an identity that users normally see limited
    help for phishing
  • Hard (at least ugly) to fix forwarding problem
  • Example
  • Many SPF records already published
  • Unclear how many recipients are checking them

vspf1 a mx -all
21
Caller Id For Email
  • Microsoft proposal
  • Validates message header
  • Resent-Sender, Resent-From, Sender, From
  • Pros
  • Low cost for sender, fairly low cost for
    recipients
  • Addresses phishing head on
  • Relatively easy to solve forwarding problem
  • Cons
  • Policy language is XML, which may be too verbose
  • Its from Microsoft ? FUD
  • Example

ltep xmlns'http//ms.net/1'gt ltoutgt
ltmgtltagtlt/agtlt/mgtltmgtltmx/gtlt/mgt lt/outgt lt/epgt
22
Sender Id (SPF Caller Id)
  • Proposed, in development
  • From MARID WG of IETF
  • Validates message header or envelope
  • Requires SMTP extension
  • Uses either SPF or Caller Id syntax (in
    discussion)
  • Pros
  • Addresses phishing
  • Cons
  • Requires software updates at both ends
  • Unclear
  • What policy language to use?
  • Does it make forwarding even harder?

MAIL FROMltgt SUBMITTERsender_at_example.com
23
Cryptographic Domain Authentication
  • Sign messages using per-domain key
  • Distribute public cert using DNS (or other
    mechanism)
  • Doesnt have aliasing/forwarding problems
  • Loses if message is modified
  • 8 ? 7 bit translations
  • NL ? CRLF translations
  • Blank lines added at end
  • Proposals
  • Yahoo DomainKeys
  • DNS distribution of public RSA keys
  • Signs entire message, including header
  • Verisign
  • Variant of S/MIME, not yet public

24
Accreditation
  • Third-party vouching
  • Generally carrot and stick, e.g., Bonding
  • Forward-looking (I promise to )
  • Sender pays
  • Can have rich information set
  • What kind of mail does sender send?
  • What opt in/out policy does sender use?

25
Reputation
  • What has been prior behavior?
  • Backward-looking (to date, this site has )
  • Another collaborative service, with all the
    associated pluses and minuses
  • Easy to acquire negative information hard to
    directly acquire positive information
  • Can be sender-pays or recipient-pays
  • Should be sender-pays recipient already pays
    too much
  • RBLs are essentially negative reputation services

26
Economic Schemes (1)
  • Generally shift costs from recipient to sender
  • Very small cost doesnt hurt usual sender
    (perhaps 100/day), but does hurt bulk senders
    (millions/day)
  • Not necessarily cash-based
  • See my rant in ACM Queue
  • ePostage
  • Attach eCash to each message (might be credit)
  • Recipient probably gets to decide whether to
    collect
  • Who gets the money?
  • Variant bid for attention schemes
  • Research SHRED (Spam Harassment Reduction
    through Economic Disincentives)

27
Economic Schemes (2)
  • Bonding (essentially accreditation)
  • Post a bond with a third party
  • Lose the bond if you spam
  • Commercial Habeus
  • HashCash
  • Similar pay with CPU/memory/disk bandwidth
  • Difficult to compute, easy to verify
  • Research Penny Black (Microsoft)
  • Combined schemes
  • Pay only on challenge
  • Problems
  • What about (free) mailing lists?
  • What about purchase acknowledgements?

28
Zombies
  • Machines controlled by an evil entity without
    permission of the owners
  • Strongly associated with home machines on cable
    modems and DSL
  • Estimate multiple zombie clusters of gt 100,000
    machines
  • Estimate zombie clusters are the worlds largest
    supercomputer clusters
  • Report Comcast blocks port 25, Internet spam
    traffic dropped 35
  • MAJOR PROBLEM Zombies break many algorithms
  • Computational puzzles
  • Perhaps crypto-based if key easily accessible

29
Predictions
  • Spam will never go away (think cockroaches)
  • Authentication will help, but wont solve the
    problem by itself
  • Spam will be manageable within twothree years
  • Legislation will scare away bit players, not
    large commercial spammers
  • Authentication will help legislation enforcement

30
The State of the SpamJuly 2004
  • Eric Allman
  • Sendmail, Inc.

31
Evaluating Anti-Spam Systems Terminology
  • TrueFalse PositiveNegative
  • Generally abbreviated TP, FP, TN, FN
  • Great sensitivity to False Positives (FPs) (can
    be very expensive)
  • Accuracy
  • ? (TP TN) / (TP FP TN FN)
  • Intuitively, of messages correctly classified
  • Precision
  • ? TP / (TP FP)
  • Intuitively, of tagged messages correctly
    classified
  • Recall
  • ? TP / (TP FN)
  • Intuitively, of actual positives detected

32
International Legislation
  • USA CAN SPAM act recently signed
  • Strongly backed by DMA
  • Opt out, overrides states, limited PROA
  • Prohibits fraud and harvesting requires working
    opt out
  • Have been some prosecutions
  • EU Directive on Privacy and Electronic
    Communications
  • Article 13(1) (directive covers much more than
    spam)
  • Opt in consent based
  • Member states must do their own legislation
    results vary
  • Korea much like USA
  • Australia Spam Act 2003
  • Opt in
  • Prohibits fraud, harvesting requires working opt
    out
Write a Comment
User Comments (0)
About PowerShow.com