Title: AntiPhish Lessons Learnt
1AntiPhish Lessons Learnt
- André Bergholz
- Fraunhofer IAIS, St. Augustin
- Workshop on CyberSecurity and Intelligence
Informatics (CSI-KDD) - June 28th, 2009
2Phishing
- E-mail fraud
- Send officially looking email
- Include web link or form
- Ask for confidential information, e.g., password,
account details - Attacker uses information to withdraw money,
enter computer system, etc.
3Phishing Target Sites
- Target customers of banks and online payment
services - Obtain sensitive data from U.S. taxpayers by
pretended IRS- emails - Identity theft for social network sites, e.g.
myspace.com - Recently more non-financial brands were attacked
including social networking, VOIP, and numerous
large web-based email providers.
http//www.antiphishing.org/
4Phishing Techniques
- Upward trend in number of phishing mails sent
- Massive increase of phishing sites over the past
- Increasing sophistication
- Link manipulation, URL spelling
- Website address manipulation
- Evolution of phishing methods from shotgun-style
email - Image phishing
- Spear phishing (targeted)
- Voice over IP phishing
- Whaling High-profile people
http//www.antiphishing.org/
5Phishing Damage
- Gartner (The War on Phishing Is Far From Over,
2009) - 5 million US consumers affected between 09/2007
and 2008 (39.8 increase) - Average loss per consumer 351 (60 decrease),
Total loss 1.8 billion dollars - Top-Three most attacked countries USA, UK, Italy
RSA Online Fraud Report, 2009 - 90 of internet users are fooled by good phishing
websites Dhamija et al., SIGCHI 2006 - For the individual phisher low-skill, low-income
business Herley and Florencio, New Security
Paradigms Workshop, 2008
http//www.antiphishing.org/
6Approaches against Phishing
- Network- and Encryption-based countermeasures
Email authentification, two factor
authentification, mobile TANs, etc. - Blacklisting and whitelisting Lists of phishing
sites and legitimate sites - Content-based filtering for websites and emails
- Typical formulations urging the user to enter
confidential information - Design elements, trademarks, and logos for known
brands (only relatively few brands are attacked) - Spoofed sender addresses and URLs
- Invisible content inserted to fool automatic
filtering approaches - Images containing the message text
7EU-Project AntiPhish
- Period 01/2006 06/2009
- Develop content-based phishing filters
- Use realistic email corpora
- Deploy in realistic workflows
- Trainable and adaptive filtersè adapt to new
phishing attacksè anticipate attacks
8Agenda
è
- Email Classification based on Advanced Text
Mining - Hidden Salting and Anticipating Evasion
- Real-Life AntiPhish Deployment
- Conclusions
9Phishing Filtering as Classification Problem
- Task Automatically classify emails based on
content
- Use email features relevant to detect phishing
- Training data emails labeled with classes ham,
spam, phishing - Train a classifier
- Apply to new emails
10Message Preprocessing
Standardized email data file (flat
representation)
Structured representationincluding embedded
images, attachments
11Basic Features
- Can be derived directly from the email itself,
i.e., do not require information about specific
websites
- Structural Features (4) Number of body parts
(total, discrete, composite, alternative) - Link Features (8) Number of links (total,
internal, external, w/ IP-numbers, deceptive,
image), Number of dots, Action word links - Element Features (4) HTML, Scripts, JavaScript,
Forms - Spam Filter Features (2) SpamAssassin (untrained)
score and classification - Word List Features (9) Indicator words, e.g.,
account, update, confirm, verify, secur, notif,
log, click, inconvenien
12Dynamic Markov Chains
- Operate on the bit representation of the natural
language text of the email - Model a bit sequence as a stationary and ergodic
Markov source with limited memory
01010010100100101110101001011010010101001010100111
01001010101010101
- Incrementally build such an automaton / Markov
chain to model the training sequences - Train one DMC for each of the classes (i.e., ham,
spam, phishing), For a new email look which model
fits best - Has been successfully applied to spam
classificationBratko et al., JMLR 2006
13Dynamic Markov Chains Details
- States Two probabilities representing the
likelihood that the source emits 1 or 0 as next
symbol - Prediction Move through automaton, add up
likelihoods - Training (incremental) States are cloned when
reached via a frequently used transition - Model size reduction Use training examples that
the model cannot already classify well enough
(after some initial training, see also
uncertainty sampling in active learning) - Features Expected cross entropies of a message
for either model (ham and phishing), Boolean
membership indicators
14Latent Topic Models
- Analyze on the co-occurrence of words
- Similar to word clustering Specify the number of
topics in advance - Common methods LDA, PLSA
- Probabilistic latent semantic analysis Models
the probability of each co-occurrence as a
mixture of conditionally independent multinomial
distributions - Latent Dirichlet allocation Generative Bayesian
version with Dirichlet prior - Document Mixture of various topics
15Latent Topic Models Class Specific
- Analyze on the co-occurrence of words Class-Topic
Model (CLTOM) Extension of LDA - Incorporates class information
- LDA Uniform per-document topic Dirichlet prior
a, uniform per-topic word Dirichlet prior b - CLTOM Class-specific per-document
topicDirichlet prior ac - Training using EM / Mean Field Approximation
- Features Probabilities for each topic
16Latent Topic Model Topics
Relevance for phishing
Words of a topic sorted by probability
17Feature Processing and Selection
- Feature Processing
- Scaling Guarantees that all features have
values within the same range - Normalization Sets length of the feature
vectors to one, which is adequate for
inner-product based classifiers
- Feature Selection
- Goal Select a subset of relevant features
- Abstract Search in a state space Kohavi and
John, AI Journal 1997 - Operates on an independent validation set
- Best-first search strategy Expands the current
subset by the node with the highest estimated
performance, stores additional nodes to overcome
local maxima - Compound operators Combine set of
best-performing children
18Evaluation Method and Test Corpus
- Standard method 10-fold cross-validation
- Criteria Precision, recall, F-measure, false
positive rate, false negative rate, accuracy for
comparison with related work - Note Errors are not of equal importance
- Test Corpus Assembled by Fette et al., WWW
2007 - Ham emails SpamAssassin corpus
- Phishing emails Collected by Nazario
- Total size 7808 emails, 6951 ham (89) and 857
phishing (11)
19Overall result
Missed phishing emails
Lost ham emails
- FPR reduced by 92, FNR by 64
- Statistically significant difference to Fette et
al. 07 with less than 1 error - Feature selection Better result with fewer
features and less training data (20 reserved for
validation)
20Agenda
- Email Classification based on Advanced Text
Mining - Hidden Salting and Anticipating Evasion
- Real-Life AntiPhish Deployment
- Conclusions
è
21Salting
- Salting Intentional addition or distortion of
content to evade automatic filtering - Can be applied to any medium (e.g., text, images,
audio) and to any content genre (e.g., emails,
web pages, MMS messages) - Visible salting Additional text, images
containing random pixels, etc. - Hidden salting Not perceivable by the user
(e.g., text in invisible color, text behind
objects, reading order manipulation)
22Email source text
internal representation
drawing canvas
end user
lthtmlgt ltheadgt lt/headgt ltbodygt lth1gtA
storylt/h1gt ltpgt Once there was lt/pgt lt/bodygt lt/h
tmlgt
lthtmlgt
A story Once there was a noble prince. He lived
in a fancy castle. Read more
ltheadgt
ltbodygt
lth1gt
ltpgt
ltpgt
rendering
ltemgt
ltagt
23Hidden Salting Simulation
- We tap into the rendering process to detect
hidden content, i.e., manifestations of salting. - Intercept requests for drawing text primitives
- Build an internal representation of the
characters, i.e., a list of attributed glyphs in
compositional order - Test for glyph visibility
- Clipping The glyph is drawn within the physical
bounds of the drawing clip. - Concealment The glyph is not concealed by other
glyphs or shapes. - Font Color The glyphs fill color contrasts well
with the background color. - Glyph Size The glyphs size and shape is
sufficiently large.
24Hidden Salting Simulation (cont.)
- We feed the intercepted, visible text into a
cognitive hidden salting simulation model, which
returns the simulated perceived text. - Reading order Detected based on a layout
characteristic where we expect that glyphs of
parallel lines are aligned - Compliance of the text with the language specific
distributions of character n-grams, common words
and word lengths - For details See De Beer and Moens, Tech. Report
KU Leuven 2007
25Evasion Detection
- Cat-and-Mouse game Spammers are developing
tricks filter developers are adapting their
filters - So far Hidden salting simulation model
- Closing the loop Identifying email messages that
are likely to make the hidden salting simulation
system fail - Method Compare the simulated perceived text as
generated by our hidden salting simulation system
and the message text as obtained by applying OCR
to the rendered email message
26Evasion Detection Approach
27Example
HTML Source
Email on Screen
lthtmlgt ltbodygt ltfont color"ffffff"gtINNOCENT TEXT
TO TRICK FILTER lt/fontgt ltpgtYour home refinance
loan is approved!ltbrgtlt/pgtltbrgt ltpgtTo get your
approved amount lta href"http//www.mortgagepower3
.com/"gtgo herelt/agt.lt/pgt ltbrgtltbrgtltbrgt ltpgtTo be
excluded from further notices lta
href"http//www.mortgagepower3.com/remove.html"gt
go herelt/agt.lt/pgt lt/bodygt ltfont color"ffffff"gt1gat
e lt/htmlgt 5297gdqK6-498jyxl3033RafD3-195RTcz6485ob
QU9-615LOLg9l49
Hidden Salting Simulation
OCR Text
INNOCENT TEXT TO TRICK FILTER Your home refinance
loan is approved! To get your approved amount go
here. To be excluded from further notices go here.
Your _ome refinance loan is approve_! To get your
approve_ amount _o_o _ere. To De exclu_e_ from
furt_er notices __o _ere.
Detect Difference
28Evaluation Method
- Method Simulate the detection of a new salting
trick by disabling the detection of one of the
known tricks - Classifier One-class SVM
- Training set One-class Class of normal
emails, i.e., emails that contain no or only
known salting tricks - Test set Both emails with and without the
disabled (new) salting trick - Features Robust text distance measures
- Classifier marks outliers, i.e., emails that are
not in the one class, which indicates that they
may contain a previously unseen salting trick - Classifier produces some real-valued output, we
automatically compute the cutting threshold by
reapplying the classifier to the training set - OCR engines gocr, ocrad
29Test Data
- 6951 ham, 2154 spam messages, 4559 phishing
messages from SpamAssassin and Nazario corpus - Considered tricks Font color, font size
- Training set 800 messages w/o trick
- Test set 100 messages with / 300 messages w/o
trick
30Overall Result
31Agenda
- Email Classification based on Advanced Text
Mining - Hidden Salting and Anticipating Evasion
- Real-Life AntiPhish Deployment
- Conclusions
è
32Filtering a Real-Life Email Stream Challenges
- Fixed scenario with fixed parameters
- Data
- From present real-life stream
- Mostly English and Italian
- (Almost) Unskewed
- All data is unlabeled, not easy to eliminate spam
- Very strict privacy regulations
- Experiments Almost online
33General Deployment Approach
Start Initial AntiPhish model M0
- For every day t Î 1, . . . , n
- Capture a set of emails St , sent in real-time
through spam filters - Select a test subset Tt Ì St for evaluation of
the current AntiPhish model Mt-1 - Select a subset At Ì St of emails that are
difficult to classify to be used for active
learning - Obtain labels for sets Tt and At
- Evaluate current model Mt-1 on the set Tt
- Add set At to the training set, train the new
model Mt
34Details
- AntiPhish system is evaluated on arbitrary
collected emails. - Deployment period n 20 days.
- Used features unigram, DMC, semantic topics with
k 25 topics, link, and lexical features - Every day a total of Tt È At 750 emails are
selected. - An email is classified as non-ham if and only if
it is considered with a probability of at least
95 to be non-ham.
35Stratified Evaluation
- Tt Stratified sample of its underlying base set
St - Idea. Better represent interesting emails
- Two buckets Emails that are difficult or easy to
classify - Basic procedure Oversample the difficult emails,
but give them a lower weight in evaluation - More specifically Let St St(u) È St(c) , we
want to sample k1 and k2 emails respectively - Then
- We use a probability of p 95 (for non-ham) as
certainty threshold.
36Active Learning
Email Stream
Previous Trainingset
Current Model
- Set of additional training emails per day At,
At 500 - 400 top-ranked emails from St having the lowest
confidence in classification - . . . plus 100 emails randomly selected from the
rest of St - Minimization of duplicates among the 400
uncertain emails Ignore duplicates
Uncertain Emails
Certain Emails
Randomover sampling
Random undersampling
New Trainingset
Training
New Model
37Initial Dataset
- Initial dataset Six days of 750 messages each
- Total 4489 messages
- Ham 1514 (34)
- Phishing 1342 (30)
- Spam 1633 (36)
- Non-Ham 2975 (66)
- Time period for experiment subsequent 20 days
38Additional Training Data Through Active Learning
39Test Data and Evaluation
- 250 messages per day
- k1 k2 125 difficult and easy messages
- Sometimes less, because not enough difficult
emails were found - Evaluation
- False Positive Rate Proportion of lost ham
emails in all ham emails - False Negative Rate Proportion of missed non-ham
emails in all non-ham emails
40Test Data
41Baseline Result
Ham classified as Non-Ham
Non-Ham classified as Ham
FNR Average 7.09
42Result forSelectedThresholds
Threshold in on predicted probability of Non-Ham
43Effect of Active Learning
Ham classified as Non-Ham
- Three different fixed models
- Initial model M0
- Model after five days of active learning M5
- Model after ten days of active learning M10
Non-Ham classified as Ham
44Effect of Active Learning
45Spam Filter Vote as Feature
46Phishing vs. Ham Classification
- Phishing vs. Ham (instead of Non-Ham vs. Ham)
- FPR 0, FNR 7.62
47Identifying Potential Phishing in Spam
- Second real-life application
- Anti-spam operations use spam traps to gather the
latest spam samples so that these can be better
defended against - The ability to separate out the phishing leads to
a quicker defence against such fraudulent
activity
HoneypotNetwork
Updatedsignatures
Fast Updateof Spam Filter
Spam Phishing
Phishing
Classifier
RegularSpam
48Related Laboratory Experiment
- Labeled data phishing and regular spam from a
probe network - Training 53 phishing vs. 1060 regular spam per
week - Test 75 phishing vs. 1443 regular spam per week
(on average) - Duration June to November 2008 (26 weeks)
- System Parameters
- Features DMC, semantic topics with 10 topics,
unigram, wordlist, DMC-link - Threshold Neutral (50)
- Evaluation Sliding window strategy
- Each week is filtered on classifier trained on
previous N4 weeks
- Result
- FPR Spam classified as Phishing 0.18
- FNR Phishing classified as Spam 4.89
49Sliding Window, Training N4 weeks
Phishing classified as Spam
Spam classified as Phishing
50Agenda
- Email Classification based on Advanced Text
Mining - Hidden Salting and Anticipating Evasion
- Real-Life AntiPhish Deployment
- Conclusions
è
51Conclusions Lessons Learnt
- Phishing Multi-billion dollar activity
- AntiPhish Phishing prevention through
content-based email filtering - Advanced text mining features boost performance
Dynamic Markov chains, Latent topic models - Most of these techniques are language-independent
- Anticipatory learning Detecting new filter
evasion techniques, Require high-speed
high-quality OCR - Real-life deployment
- Active learning keeps filters up-to-date
- Combination with spam filters improves
performance through incorporation of current
blacklist information. - Identifying phishing in a honeypot network
permits prioritization in spam-filter updating.