Title: Bayesian Filtering AntiPhishing Toolbar Benefits
1Bayesian Filtering Anti-Phishing Toolbar Benefits
- P. Likarish, E. Jung,
- D. Dunbar, T. E. Hansen, and J.-P. Hourcade
- 12/04/07
- presented by EJ Jung
2Phishing
3Why study phishing?
- Identity Theft
- One of fastest growing crimes
- 15 million Americans/year, 2.8 billion dollars
Gartner, Inc. 2007 press release.
http//www.gartner.com/it/page.jsp?id501912,
March 2007 Phishing report. http//apwg.org
4Phishing leads into malware
Phishing report. Trojans and keyloggers.
http//apwg.org
5Phishing and botnet into black market (Franklin
et al, 2007)
6 and into national security threat
- FBI director Robert Muller says
- Younis Tsouli, and his colleagues stole thousands
of credit card accounts through phishing schemes.
They ran up charges of more than 3 million for
items they thought fellow extremists might need,
from night vision goggles to GPS devices. - botnet is Swiss Army Knifes of hackers
7Phishing attack
8Anti-Phishing Tools
- Client or server side?
- server side protection is limited
- server-client cooperation
- hash of system
- Client side is more common
- web browser toolbar
- password management
9Early Efforts
- Largely heuristics-based
- Set of rules developed by experts
- Still used by most anti-phishing tools
- Examples
- IE7 phishing filter
- SpoofGuard
10SpoofGuard
- IE6 toolbar
- Developed by Chou, Ledesma, Teraguchi, Boneh,
Mitchell at Stanford - Heuristicswhitelist
N. Chou, R. Ledesma, Y. Teraguchi, D. Boneh, and
J. C. Mitchell. Client-side defense against
web-based identity theft. In NDSS '04
Proceedings of the 11th Annual Network and
Distributed System Security Symposium, February
2004
11Stateless Heuristics
- URL check
- Suspicious URLs _at_, IP, hex
- Image check
- Hashed image database
- Image hashing
- Produces same hash for similar images
- Link check
- Fails if gt¼ of links fail URL check
- Password check
12Stateful Heuristics
- Domain check
- Hamming distance to known domains
- Referrals
- From email site?
- May require DNS lookup
- Image-domain association
- Extension of hashed image heuristic
- ltimage, URLgt tuples
13Scoring
TSS Total Spoof Score
0
Ex P1 URL check (0 if page passes, 1 if it
fails) w1 .2
Source N. Chou, R. Ledesma, Y. Teraguchi, D.
Boneh, and J. C. Mitchell. Client-side defense
against web-based identity theft. In NDSS '04
Proceedings of the 11th Annual Network and
Distributed System Security Symposium, February
2004
14Drawbacks to Heuristics
- Difficult to develop accurate rules
- Large number of false positives and negatives
- Heuristics dont evolvephishing sites do.
M. Sahami, S. Dumais, D. Heckerman, and E.
Horvitz. A Bayesian approach to filtering junk
e-mail. In AAAI Workshop on Learning for Text
Categorization, July 1998. Y. Zhang, J. I.
Hong, and L. F. C Y. Zhang, J. I. Hong, and L. F.
Cranor. CANTINA a content-based approach to
detecting phishing web sites. In WWW '07
Proceedings of the 16th international conference
on World Wide Web, pages 639648, New York, NY,
USA, 2007. ACM Press.
15Next Blacklist/Whitelist
- 2004-current
- Largely blacklist-based
- rely on phishing site reports
- still used by most anti-phishing tools
- Examples
- IE7 phishing filter
- Firefox 2 phishing protection Google
safe-browsing - Netcraft Toolbar
Netcraft Ltd. http//toolbar.netcraft.com
16Drawbacks to Blacklist/Whitelist
- Need reliable and timely sources for reports
- Window of vulnerability
- after site launch before being blacklisted
- avg lifetime of a phishing site 3 days
- avg lifetime after blacklisted 22 hours
- cost of undoing identity theft priceless
- adapt classification methods
- -CANTINA, B-APT
Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA
a content-based approach to detecting phishing
web sites. In WWW '07 Proceedings of the 16th
international conference on World Wide Web, pages
639648, New York, NY, USA, 2007. ACM Press.
17CANTINA
- Technique
- TF-IDF Robust Hyperlinks
- Domain name
- Heuristics
- Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA
a content-based approach to detecting phishing
web sites. In WWW '07 Proceedings of the 16th
international conference on World Wide Web, pages
639648, New York, NY, USA, 2007. ACM Press.
18TF-IDF
- Text classification technique
- Information retrieval
- Term Frequency-Inverse Document Frequency
- Importance of a word in a document in a given
corpus - Document website
- Corpus English language
19TF-IDF, contd.
- Source for equations http//en.wikipedia.org/wiki
/Tf-idf
20Robust Hyperlinks
- Phelps and Wilensky
- TF-IDF on all words on page
- Lexical signature
- 5 words with highest TF-IDF scores
- Almost uniquely id 1,000,000,000 pages
21TF-IDF Hyperlinks in CANTINA
- Calculate lexical signature
- Google search on signature
- If domain name is within top 30 hits, site is
legitimate - Otherwise, it is phishing
- Results
- 94 true positives 30 false positives
22Improving on TF-IDF
- Add domain name to Google search
- 97
- 30
- TF-IDF Zero results-Means-Phishing domain
name - 97 t.p. 10 f.p.
? 67 t.p.
? 10 f.p.
23Adding heuristics to CANTINA
- Heuristics from SpoofGuard and other sources
- Trade-off
- Reduces true positive accuracy
- 97 ? 89 t.p.
- Reduces false positive rate
- 10 ? 1 f.p.
24Drawbacks to CANTINA
- Relies on outside sources for information
- Google
- Requires heuristics to reduce false positives
- Reduces accuracy
- Language-specific
- Different corpus for each foreign language
- Difficulties with East Asian languages
- Unacceptable false positive rate
- Misclassifications undermine user confidence in
tool
25CANTINA vs. Netcraft
- classificationheuristics vs. blacklistheuristics
- True positives
- CANTINA 97 (or, 89)
- Netcraft toolbar 97
- (SpoofGuard 91)
- False positives
- CANTINA 6 (or, 1)
- Netcraft toolbar 0
- (SpoofGuard 48)
26B-APT Bayesian Anti-Phishing toolbar
- Firefox browser toolbar
- will extend to other browsers
- goals detect, communicate, and educate
- Bayesian filtering whitelist
- similar to spam filtering
- different from spam filtering
- phishing sites mirror legitimate sites
- hard to find training set (inbox vs. blacklist
database) - comprehensive whitelist
- Innovative UI
- no known effective security indicators for
warning user of phishing sites (Dhamija, 2006
Wu, 2007)
27Bayesian classification
- Bayes law on conditional probability
- Pros
- easy to compute
- training and tayloring
- Cons
- assume independence among words
- Bayesian poisoning
28Implementation details
- Training on phishing pages and legitimate pages
- Phishtrack HTML of phishing pages
- 1200 phishing sites 160 unique sites
- Alexa top 500 most popular websites
- same KBs of phishing sites (17k vs 64k tokens)
http//www.dslreports.com/phishtrack http//www
.alexa.com/
29B-APT detecting phishing sites
Anti-phishing tools tested on 60 phishing sites
30B-APT detecting legitimate sites
Anti-phishing tools tested on 60 legitimate sites
31Summary
- Classification heuristics do well
- B-APT has no false negative, some false positive
- working on communicating false positives
- detect, communicate, and educate
- Use of any toolbar is better than none
- the least number was 42 of IE7
- blacklist-based ones get better as time passes
(Zhang, 2007) - Beware of malware
- Badware.org with Google