Title: Phishing Webpage Detection
1Phishing Webpage Detection
- Jau-Yuan Chen
- COMS E6125 WHIM
- March 24, 2009
2What is Phishing?
- Source "Phishing Activity Trends Report," APWG,
December 2008 - APWG Anti-Phishing Working Group
- (Definition)
- Phishing is a criminal mechanism employing both so
cial engineer-ing and technical subterfuge to stea
l consumers personal identity data and financial
account credentials. - Social-engineering schemes use spoofed e-mails pur
porting to be from legitimate businesses and agenc
ies to lead consumers to counterfeit websites des
igned to trick recipients into divulging financia
l data such as usernames and passwords. - Technical-subterfuge schemes plant crimeware onto
PCs to steal credentials directly, often using sy
stems to intercept consumers online account user
names and passwords - and to corrupt local naviga
tional infrastructures to misdirect consumers to c
ounterfeit websites (or authentic websites through
phisher-controlled proxies used to monitor and i
ntercept consumers keystrokes).
3Severity of the Phishing Problem
- The number of crimeware-spreading sites infecting
PCs with password-stealing crimeware reached an
all time high of 31,173 in December, 2008. - Unique phishing reports submitted to APWG
recorded a yearly high of 34,758 in December,
2008. - in 2007 (a survey by Gartner, Inc.)
- more than 3.2 billion was lost to phishing
attacks in the US - 3.6 million adults lost money in phishing attacks
4WHY PHISHING PAGE DETECTION?
5eBay?
Its difficult to distinguish these pages!
6Most Targeted Industry
7Current Anti-phishing Solutions
- text-based page analysis
- URL analysis
- HTML parsing
- keyword extraction
- however, phishers can easily avoid detection by
using non-html components, such as - images,
- Flash,
- ActiveX, etc.
8Image-based Anti-phishing Schemefocus on "what
you see", not "how the page is
composed"!J.-Y. Chen, and K.-T. Chen, A
Robust Local Feature-based Scheme for Phishing
Page Detection and Discrimination, Web 2.0 Trust
2008. K.-T. Chen, J.-Y. Chen, C.-R. Huang, and
C.-S. Chen, Fighting Phishing with
Discriminative Keypoint Features of Webpages,
IEEE Internet Computing, to appear.
9Page Matching
10Page Scoring
effective grids
11Page Classification
- naïve Bayesian classifier with 10-fold
cross-validation - training data
- a pre-stored phishing page set a legitimate
page set - phishing page set (positive data set)
- comparisons between phishing pages and their
target pages - legitimate page set (negative data set)
- comparisons between legitimate pages of different
sites
12Performance Evaluation
13Data description
- phishing pages 2,058 pages on 74 sites
- source http//www.phishtank.com,
http//www.antiphishing.org - records of top 5 phishing target sites are more
than half of our records - potential target pages 300 vulnerable pages
- source http//www.ciphertrust.com/resources/stati
stics/ - pre-stored data set
- positive 2,058 comparisons
- negative 44,000 comparisons
Domain Number of Records
eBay 701
PayPal 632
Marshall Ilsley 138
Charter One 116
Bank of America 51
14Earth Movers Distance (EMD) based Scheme
- Fu et al., IEEE Trans. on Dependable Secure
Computing, 2006 - the 1st image-based phishing detecting approach
- to evaluate the distance between two signatures
- Signature (S)
- the frequency and the centroid of each color used
- Weight (p, q)
- a linear combination of the Euclidian distance
and the centroids of colors - Visual similarity degree (VSD)
- VSD 1 (EMD)a
- pros simple and fast
- cons only suitable for basic phishing cases
- it tends to fail if phishing pages and the
official ones are partially similar - however, phishing pages are usually partially
different from their targets!
15Parameter Settings
- CCH settings
- levels to describe salient points (L) 4
- Euclidean distance between two salient points
(Dist) 7 pixels - input image size original webpage resolution
(mostly 800 600) - k-means parameter (k) 4
- naïve Bayesian classifier
- EMD settings
- we follow the suggestion in Fu et al.'s previous
work - input image size 100 100 (Lanczos3 resampling
algorithm) - color degrading factor (CDF) 32
- amplifier for the EMD value (a) 0.5
- the of colors used for the signature (Ss) 20
- the weight for the color distance (p) 0.5
- the weight for the color centroid distance (q)
0.5 - naïve Bayesian classifier is used instead of
per-page threshold
16- Top 5 Phishing Target Sites
- AUC
- CCH 0.998
- EMD 0.956
17- Impact of Image Size on Computation Time
18Conclusions
- We proposed an image-based phishing detection
technique with local features. - Our experimental results show that we have
- an over 96 successful phishing recognition rate,
and - less than 0.30 second per phishing identification
on average. - Our experiments show that local features are more
suitable than global information for phishing
page detection.
19Thank you!