CANTINA:%20A%20Content-Based%20Approach%20to%20Detecting%20Phishing%20Web%20Sites - PowerPoint PPT Presentation

About This Presentation

Title:

CANTINA:%20A%20Content-Based%20Approach%20to%20Detecting%20Phishing%20Web%20Sites

Description:

Fast to find new phishing sites (zero-day) But false positives, may be fragile to new attacks ... that phishing sites alive 4.5 days. Few sites link to phishing ... – PowerPoint PPT presentation

Number of Views:459

Avg rating:3.0/5.0

Slides: 30

Provided by: jason203

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CANTINA:%20A%20Content-Based%20Approach%20to%20Detecting%20Phishing%20Web%20Sites

1
CANTINA A Content-Based Approach to Detecting
Phishing Web Sites
Yue ZhangUniversity of PittsburghJason I.
Hong, Lorrie F. CranorCarnegie Mellon University
2
Phishing email
Subject eBay Urgent Notification From Billing
Department
We regret to inform you that your eBay account
could be suspended if you dont update your
account information.
3
Phishing is a Plague on the Internet

Estimated 3.5 million people have fallen for
phishing
Estimated to cost 1-2.8 billion a year (and
growing)
9255 unique phishing sites reported in June 2006
Easier (and safer) to phish than rob a bank

4
Strategies to Counter Phishing

Make it invisible
Taking down phishing web pages
Filtering out phishing email
Detecting phishing web pages (SpoofGuard, etc)
Provide better user interfaces
Extended certificate verification
Anti-phishing toolbars (SpoofGuard, eBay,
Netcraft, etc)
Train the users
Embedded training (Kumaguru et al, CHI 2007)
Games (Sheng et al, SOUPS 2007)

5
Two Ways of Detecting Phishing Pages

Human-verified Blacklists
No false positives, easy to implement, robust to
new attacks
But tedious, slow to update, and not
comprehensive
Only one toolbar found more than 60 phishing
sites (Egelman et al, NDSS 2007)
Heuristics
Fast to find new phishing sites (zero-day)
But false positives, may be fragile to new
attacks
Not much work in this area
Our work contributes to the understanding of
heuristics

6
Our Solution CANTINA

CANTINA uses a simple content-based approach
Examines content of a web page and creates a
fingerprint
Sends that fingerprint as a query to a search
engine
Sees if the web page in question is in the top
search results
If so, then we label it legitimate
Otherwise, we label it phishing
Nice properties
Fast
Scales well
No maintenance by us (done by search engines)
Highly accurate

7
Talk Overview

Problem Statement and Overview
Using Robust Hyperlinks for Fingerprinting
CANTINA Iteration 1
CANTINA Iteration 2
Conclusions

8
How Robust Hyperlinks Work

Developed by Phelps and Wilensky to solve 404
not found problem (D-Lib Magazine 2000)
Add lexical signature to URLs
If link doesnt work, then feed signature to
search engine
Ex. http//abc.com/page.html?sigword1word2...
word5
How to generate useful signatures?
Term Frequency / Inverse Document Frequency
(TF-IDF)
Their informal evaluation found using top five
words as scored by TF-IDF was surprisingly
effective

9
Adapting TF-IDF for Anti-Phishing

Can same basic approach be used for
anti-phishing?
Scammers often directly copy legitimate web pages
or include keywords like name of legitimate
organization

Fake
10
Adapting TF-IDF for Anti-Phishing

Can same basic approach be used for
anti-phishing?
Scammers often directly copy legitimate web pages
or include keywords like name of legitimate
organization

Real
11
Adapting TF-IDF for Anti-Phishing

Can same basic approach be used for
anti-phishing?
Scammers often directly copy legitimate web pages
or include keywords like name of legitimate
organization
With Google, phishing site should have low page
rank
APWG states that phishing sites alive 4.5 days
Few sites link to phishing sites
Hence, phishing sites unlikely to be in top
search results
Hypothesis
CANTINA will be able to discriminate between
legitimate and phishing sites quite well

12
How CANTINA Works (Iteration 1)

Given a web page, calculate TF-IDF score for
each word in that page
Take five words with highest TF-IDF weights
Feed these five words into a search engine
(Google)
If domain name of current web page is in top N
search results, we consider it legitimate
N30 worked well
No improvement by increasing N

13
Fake
eBay, user, sign, help, forgot
14
Real
eBay, user, sign, help, forgot
15
(No Transcript)
16
(No Transcript)
17
Evaluating Effectiveness of CANTINA