Title: The State of the Spam July 2004
1The State of the SpamJuly 2004
- Eric Allman
- Sendmail, Inc.
2Full Disclosure
- These are not the exact slides I presented at
USENIX ATC on July 2, 2004. I have taken the
opportunity to fix one or two minor mistakes,
fill out spam data for June 2004, and add a few
minor points. Other than that, these are the
same material.
3My Personal View
- Everything in this talk is my personal view
- Doesnt represent Sendmail, Inc.
- Doesnt represent Yahoo!
- Doesnt represent Microsoft
- Doesnt represent Bobs Sushi Bar and Bait Shop
- Its really a grab bag of observations and
thoughts - Focus on technology
- A little data, a little math, a lot of arm waving
- A whole bunch of opinion
- Lots of overlap between various scheme
classifications
4The Traditional Anti-Spam Talk
- What is spam? (three slides)
- Spam is bad (two slides)
- No, its really really bad (two slides)
- Its getting worse (three slides)
- Did I mention how bad it is? (one slide)
- If we dont fix spam, all of London will be
buried under 25 feet of spam by 2104 (one-two
slides) - Its really really really really bad (one slide)
- The sky is falling (one slide)
- Buy my product and all your problems will be
solved (17 slides)
5Statistics and Claims
- There are approximately 80 world class spammers
who pay on the order of US100K/month for
bandwidth and servers - 200 spam operations account for 90 of all spam
Spamhaus, June 2004 - Spam costs 300? (3?10-6 US) per message David
Harris, LinuxPro, October 2003 - AOL rejects 80 of all incoming mail as spam 80
of remaining is never opened (probably spam) for
96 total spam Carl Hutzler, AOL, personal
communication, 2004 - Email volume increased 600 in 2003, all of it
spam MAAWG, April 2004 - Zombie machines generate 90 of all spam AOL,
via MSNBC, June 2004 (at least 40, Atlanta
Journal-Constitution, June 2004) (four fifths,
Sandvine, via The Register, June 2004) pick your
number
6Erics Personal and Business Spam By Month
Gap in data due to operator error
7Technology Summary
- Increasing breadth and depth of technologies
- Often combined to create hybrids
- Realtime Blackhole Lists (RBLs)
- Lots of collateral damage
- Content filtering
- Heuristics
- Fingerprinting collaboration
- Machine learning (ML)
- Challenge-Response (C-R)
- Nearly always combined with other schemes
- Traffic analysis
- Identity (sender authentication source
filtering) - IP Address based
- Cryptography based
- Economic Schemes
We Are Here
8Realtime IP Blackhole Lists
- DNS-based lists of known spamming IP addresses
- Tend to have a lot of collateral damage
- Operators vary from quite clean to very shady
- Increasingly used as input to richer algorithms
9Heuristic Filters
- Smart people observe what spammers are doing and
means to detect and counter them - Often based on header weirdness, excessive use of
HTML tags and attributes, encoding tricks - Sometimes based on content (e.g., dollar signs,
exclamation points) - Leaves the good guys (thats us!) in a reactive
mode - Decreasing in effectiveness as spammers use more
tricks to avoid content filters - Generally an unwinnable arms race
10Fingerprinting and Collaboration
- Central clearing house of known spam
- Stores fingerprints, not full message
- Fingerprints sometimes fuzzy to avoid
snowflaking - Often collaborative (community based)
- Extremely low false positive rate
- Problem reactive (by definition, some people
have to receive the spam) - Claim updates every 5 minutes OK but
preferably 15 seconds - Problem reputation of submitters
- Open Source Vipuls Razor
- Commercial Brightmail, SpamNet (Cloudmark)
11Machine Learning Filters
- Let the computer figure out the interesting stuff
- Several flavors best known is Naïve Bayesian
Classifiers - Neural Networks, State Vector Machines are others
- Sensitive to feature extraction
- Need two piles of training data
- Spam and not-spam
- Some algorithms generalize to N piles
- Use machine-learning algorithm to learn patterns
in the two sets - Tend to work very well for individuals, less well
at the server level - Our spam may be similar, but our legitimate mail
varies a lot
12Bayes Theorem
13Naïve Bayesian Classifier
14Challenge-Response
- Heavily allow- (white-)list-based
- If sender isnt on recipients allow-list, send a
challenge - Challenges can be simple response, human puzzles,
computational challenges, etc. - Successful response to the challenge puts the
sender on the recipients allow-list - Recipient can permanently block- (black-)list
senders - Challenges arbitrarily complex
- Simple response
- Human puzzle such as visual detection (e.g., read
a word in an image or answer a simple question
about an image) - Computational puzzles such as HashCash
- Some evidence spammers can subvert human puzzles
15Traffic-Based Filtering
- Observe traffic patterns
- Large number of simultaneous connections from one
IP block is suspicious - Extremely rapid rate of connections is suspicious
- Many possible algorithms
- Often slows down (not rejects) high traffic
spikes - Variant Greylisting
- Tempfail initial connection from unfamiliar
senders (keeps triple of sender IP address,
sender email address, recipient email address) - Spam engines that dont have queues drop the
message - Really a heuristic
- But see Email Prioritization Reducing Delays on
Legitimate Mail Caused by Junk Mail (this
conference)
16Identity-Based Filtering
- Almost always require authentication
- Exception Realtime Blackhole Lists (RBLs)
- Allow- and Block- lists (White/Black)
- Almost always combined with C-R
- Accreditation and Reputation
17Allow-lists/Block-lists
- Basic to nearly all anti-spam technologies
- Allow-lists allow users to specifically get
messages that might be classified as spam (e.g.,
newsletters) - Block-lists usable for blocking obnoxious senders
even if they arent sending UCE - Allow-lists can lower receiver costs (avoid
expensive algorithms) - Two philosophies
- All not explicitly illegal gets through (default
to accept) - All not explicitly legal gets blocked (default to
deny)
18Sender Authentication
- Problem anyone can claim to be anyone they want
- Authentication directly addresses fraud and
phishing - Authentication is not an anti-spam solution in
and of itself, but is essential for
identity-based algorithms - Already have SMTP AUTH and TLS
- Both are MTA-to-MTA, not end-to-end
- Requires advance arrangements
- Absolutely essential for identity-based
algorithms - Per-user authentication PGP, S/MIME
- Historic difficulties with key management (PKI)
- Per-domain authentication may be good enough for
spam - Assign responsibility to someone who can do
something about it - Should really be thought of as non-repudiability
19IP-Address Based Authentication
- Looks at domain of sender, gets list of outgoing
email addresses for that domain - If actual connection IP address on the list,
probably good - Broken by forwarding, MX records, aliasing, etc.
- Specifically, breaks store-and-forward model of
email - Proposals to fix, some very ugly
- Current schemes store information in DNS TXT
records - Concerns about resilience and capacity of DNS
- No reason they couldnt be modified to use
another protocol
20SPF
- Sender Permitted From Sender Policy Framework
- Meng Weng Wong
- Validates MAIL FROM address
- Pros
- Very low cost for senders, fairly low cost for
recipients - Compact policy language
- Cons
- Not an identity that users normally see limited
help for phishing - Hard (at least ugly) to fix forwarding problem
- Example
- Many SPF records already published
- Unclear how many recipients are checking them
vspf1 a mx -all
21Caller Id For Email
- Microsoft proposal
- Validates message header
- Resent-Sender, Resent-From, Sender, From
- Pros
- Low cost for sender, fairly low cost for
recipients - Addresses phishing head on
- Relatively easy to solve forwarding problem
- Cons
- Policy language is XML, which may be too verbose
- Its from Microsoft ? FUD
- Example
ltep xmlns'http//ms.net/1'gt ltoutgt
ltmgtltagtlt/agtlt/mgtltmgtltmx/gtlt/mgt lt/outgt lt/epgt
22Sender Id (SPF Caller Id)
- Proposed, in development
- From MARID WG of IETF
- Validates message header or envelope
- Requires SMTP extension
- Uses either SPF or Caller Id syntax (in
discussion) - Pros
- Addresses phishing
- Cons
- Requires software updates at both ends
- Unclear
- What policy language to use?
- Does it make forwarding even harder?
MAIL FROMltgt SUBMITTERsender_at_example.com
23Cryptographic Domain Authentication
- Sign messages using per-domain key
- Distribute public cert using DNS (or other
mechanism) - Doesnt have aliasing/forwarding problems
- Loses if message is modified
- 8 ? 7 bit translations
- NL ? CRLF translations
- Blank lines added at end
- Proposals
- Yahoo DomainKeys
- DNS distribution of public RSA keys
- Signs entire message, including header
- Verisign
- Variant of S/MIME, not yet public
24Accreditation
- Third-party vouching
- Generally carrot and stick, e.g., Bonding
- Forward-looking (I promise to )
- Sender pays
- Can have rich information set
- What kind of mail does sender send?
- What opt in/out policy does sender use?
25Reputation
- What has been prior behavior?
- Backward-looking (to date, this site has )
- Another collaborative service, with all the
associated pluses and minuses - Easy to acquire negative information hard to
directly acquire positive information - Can be sender-pays or recipient-pays
- Should be sender-pays recipient already pays
too much - RBLs are essentially negative reputation services
26Economic Schemes (1)
- Generally shift costs from recipient to sender
- Very small cost doesnt hurt usual sender
(perhaps 100/day), but does hurt bulk senders
(millions/day) - Not necessarily cash-based
- See my rant in ACM Queue
- ePostage
- Attach eCash to each message (might be credit)
- Recipient probably gets to decide whether to
collect - Who gets the money?
- Variant bid for attention schemes
- Research SHRED (Spam Harassment Reduction
through Economic Disincentives)
27Economic Schemes (2)
- Bonding (essentially accreditation)
- Post a bond with a third party
- Lose the bond if you spam
- Commercial Habeus
- HashCash
- Similar pay with CPU/memory/disk bandwidth
- Difficult to compute, easy to verify
- Research Penny Black (Microsoft)
- Combined schemes
- Pay only on challenge
- Problems
- What about (free) mailing lists?
- What about purchase acknowledgements?
28Zombies
- Machines controlled by an evil entity without
permission of the owners - Strongly associated with home machines on cable
modems and DSL - Estimate multiple zombie clusters of gt 100,000
machines - Estimate zombie clusters are the worlds largest
supercomputer clusters - Report Comcast blocks port 25, Internet spam
traffic dropped 35 - MAJOR PROBLEM Zombies break many algorithms
- Computational puzzles
- Perhaps crypto-based if key easily accessible
29Predictions
- Spam will never go away (think cockroaches)
- Authentication will help, but wont solve the
problem by itself - Spam will be manageable within twothree years
- Legislation will scare away bit players, not
large commercial spammers - Authentication will help legislation enforcement
30The State of the SpamJuly 2004
- Eric Allman
- Sendmail, Inc.
31Evaluating Anti-Spam Systems Terminology
- TrueFalse PositiveNegative
- Generally abbreviated TP, FP, TN, FN
- Great sensitivity to False Positives (FPs) (can
be very expensive) - Accuracy
- ? (TP TN) / (TP FP TN FN)
- Intuitively, of messages correctly classified
- Precision
- ? TP / (TP FP)
- Intuitively, of tagged messages correctly
classified - Recall
- ? TP / (TP FN)
- Intuitively, of actual positives detected
32International Legislation
- USA CAN SPAM act recently signed
- Strongly backed by DMA
- Opt out, overrides states, limited PROA
- Prohibits fraud and harvesting requires working
opt out - Have been some prosecutions
- EU Directive on Privacy and Electronic
Communications - Article 13(1) (directive covers much more than
spam) - Opt in consent based
- Member states must do their own legislation
results vary - Korea much like USA
- Australia Spam Act 2003
- Opt in
- Prohibits fraud, harvesting requires working opt
out