The State of the Spam July 2004 - PowerPoint PPT Presentation

About This Presentation

Title:

The State of the Spam July 2004

Description:

Decreasing in effectiveness as spammers use more tricks to avoid content filters ... Caller Id For Email. Microsoft proposal. Validates message header ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 33

Provided by: Erica127

Learn more at: https://www.usenix.org

Category:

more less

Transcript and Presenter's Notes

Title: The State of the Spam July 2004

1
The State of the SpamJuly 2004

Eric Allman
Sendmail, Inc.

2
Full Disclosure

These are not the exact slides I presented at
USENIX ATC on July 2, 2004. I have taken the
opportunity to fix one or two minor mistakes,
fill out spam data for June 2004, and add a few
minor points. Other than that, these are the
same material.

3
My Personal View

Everything in this talk is my personal view
Doesnt represent Sendmail, Inc.
Doesnt represent Yahoo!
Doesnt represent Microsoft
Doesnt represent Bobs Sushi Bar and Bait Shop
Its really a grab bag of observations and
thoughts
Focus on technology
A little data, a little math, a lot of arm waving
A whole bunch of opinion
Lots of overlap between various scheme
classifications

4
The Traditional Anti-Spam Talk

What is spam? (three slides)
Spam is bad (two slides)
No, its really really bad (two slides)
Its getting worse (three slides)
Did I mention how bad it is? (one slide)
If we dont fix spam, all of London will be
buried under 25 feet of spam by 2104 (one-two
slides)
Its really really really really bad (one slide)
The sky is falling (one slide)
Buy my product and all your problems will be
solved (17 slides)

5
Statistics and Claims

There are approximately 80 world class spammers
who pay on the order of US100K/month for
bandwidth and servers
200 spam operations account for 90 of all spam
Spamhaus, June 2004
Spam costs 300? (3?10-6 US) per message David
Harris, LinuxPro, October 2003
AOL rejects 80 of all incoming mail as spam 80
of remaining is never opened (probably spam) for
96 total spam Carl Hutzler, AOL, personal
communication, 2004
Email volume increased 600 in 2003, all of it
spam MAAWG, April 2004
Zombie machines generate 90 of all spam AOL,
via MSNBC, June 2004 (at least 40, Atlanta
Journal-Constitution, June 2004) (four fifths,
Sandvine, via The Register, June 2004) pick your
number

6
Erics Personal and Business Spam By Month
Gap in data due to operator error
7
Technology Summary

Increasing breadth and depth of technologies
Often combined to create hybrids
Realtime Blackhole Lists (RBLs)
Lots of collateral damage
Content filtering
Heuristics
Fingerprinting collaboration
Machine learning (ML)
Challenge-Response (C-R)
Nearly always combined with other schemes
Traffic analysis
Identity (sender authentication source
filtering)
IP Address based
Cryptography based
Economic Schemes

We Are Here
8
Realtime IP Blackhole Lists

DNS-based lists of known spamming IP addresses
Tend to have a lot of collateral damage
Operators vary from quite clean to very shady
Increasingly used as input to richer algorithms

9
Heuristic Filters

Smart people observe what spammers are doing and
means to detect and counter them
Often based on header weirdness, excessive use of
HTML tags and attributes, encoding tricks
Sometimes based on content (e.g., dollar signs,
exclamation points)
Leaves the good guys (thats us!) in a reactive
mode
Decreasing in effectiveness as spammers use more
tricks to avoid content filters
Generally an unwinnable arms race

10
Fingerprinting and Collaboration

Central clearing house of known spam
Stores fingerprints, not full message
Fingerprints sometimes fuzzy to avoid
snowflaking
Often collaborative (community based)
Extremely low false positive rate
Problem reactive (by definition, some people
have to receive the spam)
Claim updates every 5 minutes OK but
preferably 15 seconds
Problem reputation of submitters
Open Source Vipuls Razor
Commercial Brightmail, SpamNet (Cloudmark)

11
Machine Learning Filters

Let the computer figure out the interesting stuff
Several flavors best known is Naïve Bayesian
Classifiers
Neural Networks, State Vector Machines are others
Sensitive to feature extraction
Need two piles of training data
Spam and not-spam
Some algorithms generalize to N piles
Use machine-learning algorithm to learn patterns
in the two sets
Tend to work very well for individuals, less well
at the server level
Our spam may be similar, but our legitimate mail
varies a lot

12
Bayes Theorem
13
Naïve Bayesian Classifier
14
Challenge-Response

Heavily allow- (white-)list-based
If sender isnt on recipients allow-list, send a
challenge
Challenges can be simple response, human puzzles,
computational challenges, etc.
Successful response to the challenge puts the
sender on the recipients allow-list
Recipient can permanently block- (black-)list
senders
Challenges arbitrarily complex
Simple response
Human puzzle such as visual detection (e.g., read
a word in an image or answer a simple question
about an image)
Computational puzzles such as HashCash
Some evidence spammers can subvert human puzzles

15
Traffic-Based Filtering

Observe traffic patterns
Large number of simultaneous connections from one
IP block is suspicious
Extremely rapid rate of connections is suspicious
Many possible algorithms
Often slows down (not rejects) high traffic
spikes
Variant Greylisting
Tempfail initial connection from unfamiliar
senders (keeps triple of sender IP address,
sender email address, recipient email address)
Spam engines that dont have queues drop the
message
Really a heuristic
But see Email Prioritization Reducing Delays on
Legitimate Mail Caused by Junk Mail (this
conference)

16
Identity-Based Filtering

Almost always require authentication
Exception Realtime Blackhole Lists (RBLs)
Allow- and Block- lists (White/Black)
Almost always combined with C-R
Accreditation and Reputation

17
Allow-lists/Block-lists

Basic to nearly all anti-spam technologies
Allow-lists allow users to specifically get
messages that might be classified as spam (e.g.,
newsletters)
Block-lists usable for blocking obnoxious senders
even if they arent sending UCE
Allow-lists can lower receiver costs (avoid
expensive algorithms)
Two philosophies
All not explicitly illegal gets through (default
to accept)
All not explicitly legal gets blocked (default to
deny)

18
Sender Authentication

Problem anyone can claim to be anyone they want
Authentication directly addresses fraud and
phishing
Authentication is not an anti-spam solution in
and of itself, but is essential for
identity-based algorithms
Already have SMTP AUTH and TLS
Both are MTA-to-MTA, not end-to-end
Requires advance arrangements
Absolutely essential for identity-based
algorithms
Per-user authentication PGP, S/MIME
Historic difficulties with key management (PKI)
Per-domain authentication may be good enough for
spam
Assign responsibility to someone who can do
something about it
Should really be thought of as non-repudiability

19
IP-Address Based Authentication

Looks at domain of sender, gets list of outgoing
email addresses for that domain
If actual connection IP address on the list,
probably good
Broken by forwarding, MX records, aliasing, etc.
Specifically, breaks store-and-forward model of
email
Proposals to fix, some very ugly
Current schemes store information in DNS TXT
records
Concerns about resilience and capacity of DNS
No reason they couldnt be modified to use
another protocol

20
SPF

Sender Permitted From Sender Policy Framework
Meng Weng Wong
Validates MAIL FROM address
Pros
Very low cost for senders, fairly low cost for
recipients
Compact policy language
Cons
Not an identity that users normally see limited
help for phishing
Hard (at least ugly) to fix forwarding problem
Example
Many SPF records already published
Unclear how many recipients are checking them

vspf1 a mx -all
21
Caller Id For Email

Microsoft proposal
Validates message header
Resent-Sender, Resent-From, Sender, From
Pros
Low cost for sender, fairly low cost for
recipients
Addresses phishing head on
Relatively easy to solve forwarding problem
Cons
Policy language is XML, which may be too verbose
Its from Microsoft ? FUD
Example

ltep xmlns'http//ms.net/1'gt ltoutgt
ltmgtltagtlt/agtlt/mgtltmgtltmx/gtlt/mgt lt/outgt lt/epgt
22
Sender Id (SPF Caller Id)

Proposed, in development
From MARID WG of IETF
Validates message header or envelope
Requires SMTP extension
Uses either SPF or Caller Id syntax (in
discussion)
Pros
Addresses phishing
Cons
Requires software updates at both ends
Unclear
What policy language to use?
Does it make forwarding even harder?

MAIL FROMltgt SUBMITTERsender_at_example.com
23
Cryptographic Domain Authentication

Sign messages using per-domain key
Distribute public cert using DNS (or other
mechanism)
Doesnt have aliasing/forwarding problems
Loses if message is modified
8 ? 7 bit translations
NL ? CRLF translations
Blank lines added at end
Proposals
Yahoo DomainKeys
DNS distribution of public RSA keys
Signs entire message, including header
Verisign
Variant of S/MIME, not yet public

24
Accreditation

Third-party vouching
Generally carrot and stick, e.g., Bonding
Forward-looking (I promise to )
Sender pays
Can have rich information set
What kind of mail does sender send?
What opt in/out policy does sender use?

25
Reputation

What has been prior behavior?
Backward-looking (to date, this site has )
Another collaborative service, with all the
associated pluses and minuses
Easy to acquire negative information hard to
directly acquire positive information
Can be sender-pays or recipient-pays
Should be sender-pays recipient already pays
too much
RBLs are essentially negative reputation services

26
Economic Schemes (1)

Generally shift costs from recipient to sender
Very small cost doesnt hurt usual sender
(perhaps 100/day), but does hurt bulk senders
(millions/day)
Not necessarily cash-based
See my rant in ACM Queue
ePostage
Attach eCash to each message (might be credit)
Recipient probably gets to decide whether to
collect
Who gets the money?
Variant bid for attention schemes
Research SHRED (Spam Harassment Reduction
through Economic Disincentives)

27
Economic Schemes (2)