Identifying Suspicious URLs: An Application of Large-Scale Online Learning

About This Presentation

Title:

Identifying Suspicious URLs: An Application of Large-Scale Online Learning

Description:

http://fblight.com. http://mail.ru. 3. Problem in a Nutshell ... facebook.com. fblight.com. 4. Today's Talk. Problem. Approach. Learning to detect malicious URLs ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 25

Provided by: Just48

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Identifying Suspicious URLs: An Application of Large-Scale Online Learning

1
Identifying Suspicious URLs An Application of
Large-Scale Online Learning

Justin Ma, Lawrence Saul, Stefan Savage, Geoff
Voelker
Computer Science Engineering
UC San Diego
Presentation for ICML 2009
June 15, 2009

2
Detecting Malicious Web Sites

Safe URL?
Web exploit?
Spam-advertised site?
Phishing site?

URL Uniform Resource Locator http//www.cs.mcgi
ll.ca/icml2009/abstracts.html http//www.bfuduui
oo1fp.mobi/ws/ebayisapi.dll http//fblight.com h
ttp//mail.ru
Predict what is safe without committing to risky
actions
3
Problem in a Nutshell

URL features to identify malicious Web sites
Different classes of URLs
Benign, spam, phishing, exploits, scams...
For now, distinguish benign vs. malicious

facebook.com
fblight.com
4
Today's Talk

Problem
Approach
Learning to detect malicious URLs
Challenges scale and non-stationarity
Evaluations
Need for large, fresh training sets
Online learning
Conclusion

5
State of the Practice

Current approaches
Blacklists
Learning on hand-tuned features
Limitations
Cannot learn from newest examples quickly
Cannot quickly adapt to newest features
Arms race fast feedback cycle is critical

More automated approach?
6
Live URL Classification System
Label
Example
Hypothesis
7
Live Training Feed

Malicious URLs (spamming and phishing)
6,0007,500 per day from Web mail provider
Benign URLs
From Yahoo Web directory
Total of 20,000 URLs per day
Live collection since Jan. 5, 2009
Months of data
Two million examples after 100 days

8
Feature vector construction
http//www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration 3/25/2009 Hosted from
208.78.240.0/22 IP hosted in San Mateo Connection
speed T1 Has DNS PTR record? Yes Registrant
Chad ...
_ _ 0 0 0 1 1 1 1 0 1 1
Host-based
Lexical
Real-valued
60 features
1.8 million
1.1 million
GROWING
Day 100
9
Live URL Classficiation System
Online learning
10
Practical Challenges of ML in Systems

Industrial concerns
Scale millions of examples, features
Non-stationarity examples change over time (arms
race w/ criminals)
Pivotal decision batch or online?

11
Batch vs. Online Learning

Batch/offline learning
SVM, logistic regression, decision trees, etc
Multiple passes over data
No incremental updates
Potentially high memory and processing overhead

Online learning
Perceptron-style algorithms
Single pass over data
Incremental updates
Low memory and processing overheard

Online learning addresses scale and
non-stationarity
12
Evaluations

Online learning for URL reputation
Need for large, fresh training sets
Comparing online algorithms

13
Need lots of fresh training data?
SVM trained once
SVM retrained daily

Fresh data helps

14
Need lots of fresh training data?
SVM trained once on 2 weeks
SVM w/ 2-week sliding window

Fresh data helps
More data helps

15
Which online algorithm?

Perceptron
Stochastic Gradient Descent for Logistic
Regression
Confidence-Weighted Learning

16
Perceptron
Rosenblatt, 1958

Convergence result

-

radius

-
-
Number of mistakes

-
-
margin
-

Update on each mistake

17
Logistic Regression with SGD
Bottou, 1998

Log likelihood

For every example

where
Proportional
18
Confidence-Weighted Learning
Dredze et al., 2008 Crammer et al., 2009

Maintain Gaussian distribution over weight vector

Constrained problem

Closed-form update

Treat features differently
19
Which online algorithms?
Perceptron
20
Which online algorithms?
Perceptron
LR w/ SGD

Proportional update helps

21
Which online algorithms?
Perceptron
LR w/ SGD
Confidence-Weighted

Proportional update helps
Per-feature confidence really helps

22
Batch...
Batch

Fresh data helps
More data helps

23
Batch vs. Online
Batch
Confidence-Weighted

Fresh data helps
More data helps
Online matches batch

24
Conclusion

Detecting malicious URLs
Relevant real-world problem
Successful application of online learning
Confidence-Weighted vs. Batch
As accurate
More adaptive
Less resources
Future work
Scaling up for deployment

Write a Comment

User Comments (0)

About PowerShow.com

Identifying Suspicious URLs: An Application of Large-Scale Online Learning - PowerPoint PPT Presentation

Identifying Suspicious URLs: An Application of Large-Scale Online Learning

http://fblight.com. http://mail.ru. 3. Problem in a Nutshell ... facebook.com. fblight.com. 4. Today's Talk. Problem. Approach. Learning to detect malicious URLs ... – PowerPoint PPT presentation