Title: Intelligent Detection of Malicious Script Code
1Intelligent Detection of Malicious Script Code
- CS194, 2007-08
- Benson Luk
- Eyal Reuveni
- Kamron Farrokh
- Advisor Adnan Darwiche
- Sponsored by Symantec
2Outline for Project
- Phase I Setup
- Set up machine for testing environment
- Ensure that whitelist is clean
- Phase II Crawling
- Modify crawler to output only necessary data.
This means - Grab only necessary information from webcrawling
results - Listen into Internet Explorers Javascript
interpreter and output relevant behavior - Phase III Database
- Research and develop an effective structure for
storing data and link it to webcrawler - Phase IV Analysis
- Research and develop an effective algorithm for
learning from massive amounts of data
3Completed Tasks First Quarter
- Phase I
- Configured machine with Norton Antivirus and
Heritrix web crawler - Webcrawler will be used to grab additional URLs,
and Norton Antivirus will be used to verify that
a URL has not launched an attack - Created a Python script to ensure that visited
sites are clean - Captures Nortons web attack logs before and
after loading a site in Internet Explorer, then
compares the logs for new entries and signals
whether or not a sites data should be discarded - Phase II
- Configured Heritrix to run specific crawls that
target a set of domains, and output minimal
information - The purpose is to gather as many URLs with
scripts as possible for a large sample base - Created a parser for Heritrix logs to filter out
irrelevant websites - For example, we are omitting URLs that point to
images since they will not contain scripts
4Completed Tasks Second Quarter
- Phase I
- Whitelist integrated Symantec component to check
whether visited site is malicious, so all of the
data we gather is from clean sources - Hard drive installed a 750 GB hard drive
5Completed Tasks Second Quarter
- Phase II
- Crawling We ran a shallow crawl with 200 domains
as seed, and that is the current base of our
data. The result was 18,500 URLs that we run
through with our Script Listening component
6Completed Tasks Second Quarter
- Phase II
- Script Listening received a customizable tool
from Symantec that listens to the Javascript
interpreter in Internet Explorer - We modified it to output the information we need
- GUID -gt DISPID -gt ArgType -gt ArgVal
7Completed Tasks Second Quarter
DISPID (function) GUID (object) of Args Arg Type Arg Value
1030 3050f55f-98b5-11cf-bb82-00aa00bdce0b 1 BSTR 130
8Completed Tasks Second Quarter
- Phase III
- The amount of data we have gotten is too large to
use in a database. The pure text file is 4GB (50
million function calls), and querying such a
database is too slow on the computer we have. - Instead, we are storing the data as a text file,
and doing operations on it with Python scripts.
9Results and Findings Second Quarter
- Phase IV
- We have analyzed data from our first two result
sets - Crawl with 5 initial seeds
- 3,476,348 function calls
- 109 distinct GUIDs, 7364 GUID-DispID pairs
- Crawl with 15 initial seeds
- 3,706,454 function calls
- 95 distinct GUIDS, 5575 GUID-DispID pairs
- Looked at most common functions, most common
int-argument functions, and distribution of the
argument values for these functions
10Results and Findings Second Quarter
- Function 1
- GUID 3050f55d-98b5-11cf-bb82-00aa00bdce0b
- GUID object name DispHTMLWindow2
- DispID 1103
- Most popular int-argument function in both result
sets - Mostly random distribution, but signs of
regularity - Results from two sets show significant differences
11Results and Findings Second Quarter
12Results and Findings Second Quarter
- Function 2
- GUID 3050f55f-98b5-11cf-bb82-00aa00bdce0b
- GUID object name DispHTMLDocument
- DispID 1013
- Second most popular int-argument function in both
result sets - Shows a regular distribution with distinct
characteristics - Results from two sets show significant differences
13Results and Findings Second Quarter
14Results and Findings Second Quarter
- Function 3
- GUID 3050f51b-98b5-11cf-bb82-00aa00bdce0b
- GUID object name DispHTMLIFrame
- Dispid -2147418107
- Third most popular int-argument function 1st
result set, 95th most popular in 2nd result set - Shows a random distribution with distinct
characteristics - Results are dramatically different between data
sets - All arguments in the 2nd result set are 0
15Results and Findings Second Quarter
16Results and Findings Second Quarter
- Found significant differences between the data
sets in both the frequencies of specific
functions, and the arguments of specific
functions - Suspect that differences result from biases due
to small amount of original seeds (5 and 15) - Ran a much broader crawl (200 seeds) in hopes of
getting more general, unbiased results - Just from partial results of this crawl (roughly
8000 websites), we have so far found - A much larger average of calls to our listener
per website - A large percentage of function calls that take 0
arguments - Will post complete results once crawl is finished
17Direction for Next Quarter
- Further analyze the gathered data for patterns
- Compare trends in normal data to what occurs in
malicious scripts