AntiPhish Lessons Learnt - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

AntiPhish Lessons Learnt

Description:

Identity theft for social network sites, e.g. myspace.com ... Add set At to the training set, train the new model Mt. Details ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 51

Provided by: gerha

Category:

more less

Transcript and Presenter's Notes

Title: AntiPhish Lessons Learnt

1
AntiPhish Lessons Learnt

André Bergholz
Fraunhofer IAIS, St. Augustin
Workshop on CyberSecurity and Intelligence
Informatics (CSI-KDD)
June 28th, 2009

2
Phishing

E-mail fraud
Send officially looking email
Include web link or form
Ask for confidential information, e.g., password,
account details
Attacker uses information to withdraw money,
enter computer system, etc.

3
Phishing Target Sites

Target customers of banks and online payment
services
Obtain sensitive data from U.S. taxpayers by
pretended IRS- emails
Identity theft for social network sites, e.g.
myspace.com
Recently more non-financial brands were attacked
including social networking, VOIP, and numerous
large web-based email providers.

http//www.antiphishing.org/
4
Phishing Techniques

Upward trend in number of phishing mails sent
Massive increase of phishing sites over the past
Increasing sophistication
Link manipulation, URL spelling
Website address manipulation
Evolution of phishing methods from shotgun-style
email
Image phishing
Spear phishing (targeted)
Voice over IP phishing
Whaling High-profile people

http//www.antiphishing.org/
5
Phishing Damage

Gartner (The War on Phishing Is Far From Over,
2009)
5 million US consumers affected between 09/2007
and 2008 (39.8 increase)
Average loss per consumer 351 (60 decrease),
Total loss 1.8 billion dollars
Top-Three most attacked countries USA, UK, Italy
RSA Online Fraud Report, 2009
90 of internet users are fooled by good phishing
websites Dhamija et al., SIGCHI 2006
For the individual phisher low-skill, low-income
business Herley and Florencio, New Security
Paradigms Workshop, 2008

http//www.antiphishing.org/
6
Approaches against Phishing

Network- and Encryption-based countermeasures
Email authentification, two factor
authentification, mobile TANs, etc.
Blacklisting and whitelisting Lists of phishing
sites and legitimate sites
Content-based filtering for websites and emails
Typical formulations urging the user to enter
confidential information
Design elements, trademarks, and logos for known
brands (only relatively few brands are attacked)
Spoofed sender addresses and URLs
Invisible content inserted to fool automatic
filtering approaches
Images containing the message text

7
EU-Project AntiPhish

Period 01/2006 06/2009
Develop content-based phishing filters
Use realistic email corpora
Deploy in realistic workflows
Trainable and adaptive filtersè adapt to new
phishing attacksè anticipate attacks

8
Agenda
è

Email Classification based on Advanced Text
Mining
Hidden Salting and Anticipating Evasion
Real-Life AntiPhish Deployment
Conclusions

9
Phishing Filtering as Classification Problem

Task Automatically classify emails based on
content

Use email features relevant to detect phishing
Training data emails labeled with classes ham,
spam, phishing
Train a classifier
Apply to new emails

10
Message Preprocessing
Standardized email data file (flat
representation)
Structured representationincluding embedded
images, attachments
11
Basic Features

Can be derived directly from the email itself,
i.e., do not require information about specific
websites

Structural Features (4) Number of body parts
(total, discrete, composite, alternative)
Link Features (8) Number of links (total,
internal, external, w/ IP-numbers, deceptive,
image), Number of dots, Action word links
Element Features (4) HTML, Scripts, JavaScript,
Forms
Spam Filter Features (2) SpamAssassin (untrained)
score and classification
Word List Features (9) Indicator words, e.g.,
account, update, confirm, verify, secur, notif,
log, click, inconvenien

12
Dynamic Markov Chains

Operate on the bit representation of the natural
language text of the email
Model a bit sequence as a stationary and ergodic
Markov source with limited memory

01010010100100101110101001011010010101001010100111
01001010101010101

Incrementally build such an automaton / Markov
chain to model the training sequences
Train one DMC for each of the classes (i.e., ham,
spam, phishing), For a new email look which model
fits best
Has been successfully applied to spam
classificationBratko et al., JMLR 2006

13
Dynamic Markov Chains Details

States Two probabilities representing the
likelihood that the source emits 1 or 0 as next
symbol
Prediction Move through automaton, add up
likelihoods
Training (incremental) States are cloned when
reached via a frequently used transition
Model size reduction Use training examples that
the model cannot already classify well enough
(after some initial training, see also
uncertainty sampling in active learning)
Features Expected cross entropies of a message
for either model (ham and phishing), Boolean
membership indicators

14
Latent Topic Models

Analyze on the co-occurrence of words
Similar to word clustering Specify the number of
topics in advance
Common methods LDA, PLSA
Probabilistic latent semantic analysis Models
the probability of each co-occurrence as a
mixture of conditionally independent multinomial
distributions
Latent Dirichlet allocation Generative Bayesian
version with Dirichlet prior
Document Mixture of various topics

15
Latent Topic Models Class Specific

Analyze on the co-occurrence of words Class-Topic
Model (CLTOM) Extension of LDA
Incorporates class information
LDA Uniform per-document topic Dirichlet prior
a, uniform per-topic word Dirichlet prior b
CLTOM Class-specific per-document
topicDirichlet prior ac
Training using EM / Mean Field Approximation
Features Probabilities for each topic

16
Latent Topic Model Topics
Relevance for phishing
Words of a topic sorted by probability
17
Feature Processing and Selection

Feature Processing
Scaling Guarantees that all features have
values within the same range
Normalization Sets length of the feature
vectors to one, which is adequate for
inner-product based classifiers

Feature Selection
Goal Select a subset of relevant features
Abstract Search in a state space Kohavi and
John, AI Journal 1997
Operates on an independent validation set
Best-first search strategy Expands the current
subset by the node with the highest estimated
performance, stores additional nodes to overcome
local maxima
Compound operators Combine set of
best-performing children

18
Evaluation Method and Test Corpus

Standard method 10-fold cross-validation
Criteria Precision, recall, F-measure, false
positive rate, false negative rate, accuracy for
comparison with related work
Note Errors are not of equal importance

Test Corpus Assembled by Fette et al., WWW
2007
Ham emails SpamAssassin corpus
Phishing emails Collected by Nazario
Total size 7808 emails, 6951 ham (89) and 857
phishing (11)

19
Overall result
Missed phishing emails
Lost ham emails

FPR reduced by 92, FNR by 64
Statistically significant difference to Fette et
al. 07 with less than 1 error
Feature selection Better result with fewer
features and less training data (20 reserved for
validation)

20
Agenda

Email Classification based on Advanced Text
Mining
Hidden Salting and Anticipating Evasion
Real-Life AntiPhish Deployment
Conclusions

è
21
Salting

Salting Intentional addition or distortion of
content to evade automatic filtering
Can be applied to any medium (e.g., text, images,
audio) and to any content genre (e.g., emails,
web pages, MMS messages)
Visible salting Additional text, images
containing random pixels, etc.
Hidden salting Not perceivable by the user
(e.g., text in invisible color, text behind
objects, reading order manipulation)

22
Email source text
internal representation
drawing canvas
end user
lthtmlgt ltheadgt lt/headgt ltbodygt lth1gtA
storylt/h1gt ltpgt Once there was lt/pgt lt/bodygt lt/h
tmlgt
lthtmlgt
A story Once there was a noble prince. He lived
in a fancy castle. Read more
ltheadgt
ltbodygt
lth1gt
ltpgt
ltpgt
rendering
ltemgt
ltagt
23
Hidden Salting Simulation

We tap into the rendering process to detect
hidden content, i.e., manifestations of salting.
Intercept requests for drawing text primitives
Build an internal representation of the
characters, i.e., a list of attributed glyphs in
compositional order
Test for glyph visibility
Clipping The glyph is drawn within the physical
bounds of the drawing clip.
Concealment The glyph is not concealed by other
glyphs or shapes.
Font Color The glyphs fill color contrasts well
with the background color.
Glyph Size The glyphs size and shape is
sufficiently large.

24
Hidden Salting Simulation (cont.)

We feed the intercepted, visible text into a
cognitive hidden salting simulation model, which
returns the simulated perceived text.
Reading order Detected based on a layout
characteristic where we expect that glyphs of
parallel lines are aligned
Compliance of the text with the language specific
distributions of character n-grams, common words
and word lengths
For details See De Beer and Moens, Tech. Report
KU Leuven 2007

25
Evasion Detection

Cat-and-Mouse game Spammers are developing
tricks filter developers are adapting their
filters
So far Hidden salting simulation model
Closing the loop Identifying email messages that
are likely to make the hidden salting simulation
system fail
Method Compare the simulated perceived text as
generated by our hidden salting simulation system
and the message text as obtained by applying OCR
to the rendered email message

26
Evasion Detection Approach
27
Example
HTML Source
Email on Screen
lthtmlgt ltbodygt ltfont color"ffffff"gtINNOCENT TEXT
TO TRICK FILTER lt/fontgt ltpgtYour home refinance
loan is approved!ltbrgtlt/pgtltbrgt ltpgtTo get your
approved amount lta href"http//www.mortgagepower3
.com/"gtgo herelt/agt.lt/pgt ltbrgtltbrgtltbrgt ltpgtTo be
excluded from further notices lta
href"http//www.mortgagepower3.com/remove.html"gt
go herelt/agt.lt/pgt lt/bodygt ltfont color"ffffff"gt1gat
e lt/htmlgt 5297gdqK6-498jyxl3033RafD3-195RTcz6485ob
QU9-615LOLg9l49
Hidden Salting Simulation
OCR Text
INNOCENT TEXT TO TRICK FILTER Your home refinance
loan is approved! To get your approved amount go
here. To be excluded from further notices go here.
Your _ome refinance loan is approve_! To get your
approve_ amount _o_o _ere. To De exclu_e_ from
furt_er notices __o _ere.
Detect Difference
28
Evaluation Method

Method Simulate the detection of a new salting
trick by disabling the detection of one of the
known tricks
Classifier One-class SVM
Training set One-class Class of normal
emails, i.e., emails that contain no or only
known salting tricks
Test set Both emails with and without the
disabled (new) salting trick
Features Robust text distance measures
Classifier marks outliers, i.e., emails that are
not in the one class, which indicates that they
may contain a previously unseen salting trick
Classifier produces some real-valued output, we
automatically compute the cutting threshold by
reapplying the classifier to the training set
OCR engines gocr, ocrad

29
Test Data

6951 ham, 2154 spam messages, 4559 phishing
messages from SpamAssassin and Nazario corpus
Considered tricks Font color, font size
Training set 800 messages w/o trick
Test set 100 messages with / 300 messages w/o
trick

30
Overall Result
31
Agenda

Email Classification based on Advanced Text
Mining
Hidden Salting and Anticipating Evasion
Real-Life AntiPhish Deployment
Conclusions

è
32
Filtering a Real-Life Email Stream Challenges

Fixed scenario with fixed parameters
Data
From present real-life stream
Mostly English and Italian
(Almost) Unskewed
All data is unlabeled, not easy to eliminate spam
Very strict privacy regulations
Experiments Almost online

33
General Deployment Approach
Start Initial AntiPhish model M0

For every day t Î 1, . . . , n
Capture a set of emails St , sent in real-time
through spam filters
Select a test subset Tt Ì St for evaluation of
the current AntiPhish model Mt-1
Select a subset At Ì St of emails that are
difficult to classify to be used for active
learning
Obtain labels for sets Tt and At
Evaluate current model Mt-1 on the set Tt
Add set At to the training set, train the new
model Mt

34
Details

AntiPhish system is evaluated on arbitrary
collected emails.
Deployment period n 20 days.
Used features unigram, DMC, semantic topics with
k 25 topics, link, and lexical features
Every day a total of Tt È At 750 emails are
selected.
An email is classified as non-ham if and only if
it is considered with a probability of at least
95 to be non-ham.

35
Stratified Evaluation

Tt Stratified sample of its underlying base set
St
Idea. Better represent interesting emails
Two buckets Emails that are difficult or easy to
classify
Basic procedure Oversample the difficult emails,
but give them a lower weight in evaluation
More specifically Let St St(u) È St(c) , we
want to sample k1 and k2 emails respectively
Then
We use a probability of p 95 (for non-ham) as
certainty threshold.

36
Active Learning
Email Stream
Previous Trainingset
Current Model

Set of additional training emails per day At,
At 500
400 top-ranked emails from St having the lowest
confidence in classification
. . . plus 100 emails randomly selected from the
rest of St
Minimization of duplicates among the 400
uncertain emails Ignore duplicates

Uncertain Emails
Certain Emails
Randomover sampling
Random undersampling
New Trainingset
Training
New Model
37
Initial Dataset

Initial dataset Six days of 750 messages each
Total 4489 messages
Ham 1514 (34)
Phishing 1342 (30)
Spam 1633 (36)
Non-Ham 2975 (66)
Time period for experiment subsequent 20 days

38
Additional Training Data Through Active Learning
39
Test Data and Evaluation

250 messages per day
k1 k2 125 difficult and easy messages
Sometimes less, because not enough difficult
emails were found
Evaluation
False Positive Rate Proportion of lost ham
emails in all ham emails
False Negative Rate Proportion of missed non-ham
emails in all non-ham emails

40
Test Data
41
Baseline Result
Ham classified as Non-Ham

FPR Average 0.34

Non-Ham classified as Ham
FNR Average 7.09
42
Result forSelectedThresholds
Threshold in on predicted probability of Non-Ham
43
Effect of Active Learning
Ham classified as Non-Ham

Three different fixed models
Initial model M0
Model after five days of active learning M5
Model after ten days of active learning M10

Non-Ham classified as Ham
44
Effect of Active Learning
45
Spam Filter Vote as Feature
46
Phishing vs. Ham Classification

Phishing vs. Ham (instead of Non-Ham vs. Ham)
FPR 0, FNR 7.62

47
Identifying Potential Phishing in Spam

Second real-life application
Anti-spam operations use spam traps to gather the
latest spam samples so that these can be better
defended against
The ability to separate out the phishing leads to
a quicker defence against such fraudulent
activity

HoneypotNetwork
Updatedsignatures
Fast Updateof Spam Filter
Spam Phishing
Phishing
Classifier
RegularSpam
48
Related Laboratory Experiment

Labeled data phishing and regular spam from a
probe network
Training 53 phishing vs. 1060 regular spam per
week
Test 75 phishing vs. 1443 regular spam per week
(on average)
Duration June to November 2008 (26 weeks)
System Parameters
Features DMC, semantic topics with 10 topics,
unigram, wordlist, DMC-link
Threshold Neutral (50)
Evaluation Sliding window strategy
Each week is filtered on classifier trained on
previous N4 weeks

Result
FPR Spam classified as Phishing 0.18
FNR Phishing classified as Spam 4.89

49
Sliding Window, Training N4 weeks
Phishing classified as Spam
Spam classified as Phishing
50
Agenda

Email Classification based on Advanced Text
Mining
Hidden Salting and Anticipating Evasion
Real-Life AntiPhish Deployment
Conclusions

è
51
Conclusions Lessons Learnt

Phishing Multi-billion dollar activity
AntiPhish Phishing prevention through
content-based email filtering
Advanced text mining features boost performance
Dynamic Markov chains, Latent topic models
Most of these techniques are language-independent
Anticipatory learning Detecting new filter
evasion techniques, Require high-speed
high-quality OCR
Real-life deployment
Active learning keeps filters up-to-date
Combination with spam filters improves
performance through incorporation of current
blacklist information.
Identifying phishing in a honeypot network
permits prioritization in spam-filter updating.