Link Analysis of Evolving Graphs

About This Presentation

Title:

Link Analysis of Evolving Graphs

Description:

Department of Mathematics, Statistics and Computer Science. University of ... Classification based approaches that use heuristics or rules - SpamAssasin. ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 41

Provided by: jun82

Category:

more less

Transcript and Presenter's Notes

Title: Link Analysis of Evolving Graphs

1
Link Analysis of Evolving Graphs

Prasanna Desikan
Department of Mathematics, Statistics and
Computer Science
University of Wisconsin-Stout.
desikanp_at_uwstout.edu

Joint work with Advisor Prof. Jaideep
Srivastava, Department of Computer Science,
University of Minnesota
2
Outline

Definition of Link Analysis
Link Analysis applied on the Web
Evolving graphs
Motivation
Modeling evolving graphs
Applications and Challenges
Detection of machines sending E-Mail Spam
Issue Addressed Developing time aware relevancy
measures
Computation of Metrics as Graphs Evolve
Issue Addressed Computational Challenges
Ongoing Work

3
Link Analysis Definition

Link analysis is
The study of network (graph) structures that
emerge in many applications
To identify interesting network properties
Which lead to a better understanding of the
impact of connectedness in the application
And enable us to better manage the application

4
Sample Applications

Web graph.
Node Web Page Link Hyperlink.
Applications Web Search, Web Communities, Web
page classification.
Computer networks.
Node IP, Port, AS Link Connection.
Applications Network Intrusion detection based
on connections between source of attack to the
destination of attack.
Citation index of research papers.
Node Paper Link Citation.
Applications Citation Prediction, Research
Communities.
Social and Knowledge networks.
Node People/Projects Link social connection,
shared knowledge.
Applications Identifying experts, Information
flow.

5
Link Analysis on the Web

Link Analysis on the Web is used for a wide
variety of purposes, ranging from ranking pages
returned from a web search engine to identifying
Web communities.

6
Googles PageRank
Key idea Rank of a web page depends on the rank
of the web pages pointing to it
7
Hubs and Authorities

Key ideas
Hubs and authorities are fans and centers in
a bipartite core of a web graph
A good hub page is one that points to many good
authority pages
A good authority page is one that is pointed to
by many good hub pages

Link Analysis for Evolving Graphs

9
What do evolving graphs capture?

Graphs that model real world phenomena change
with respect to time as the phenomena that models
them changes.
To help us understand such phenomena better,
there is a need to study the evolution of the
graphs that represent these phenomenon
Typical examples include Web Graph, network
connections, social connections, e-mail
communications, telephone calls etc.
Such graphs are large and it would be
computationally expensive to measure interesting
properties on such graphs.

Goal To develop computational techniques to that
allows for analyzing evolution of large scale
graphs.
10
Modeling a Phenomenon as Sequence of Graphs

Consider a real world phenomenon that can be
viewed as a set of objects that interact with
each other over time.

Let the set of interactions during a time period
(t0, tb be captured as a graph at instance tb,
denoted by Gb.
The real world phenomenon can thus be captured as
set of graphs, SG ordered sequentially over time
as n graphs in n time instances SG G1,
G2, Gn
11
Modeling the Change with time

Given SG G1, G2, Gn
Let the graph change from to , in a
time period . The change in the interaction
between the objects can be captured by

12
Applications and Challenges

Detection of machines sending E-Mail Spam
Issue Addressed Developing time aware relevancy
measures
Computation of metrics as Graphs evolve
Issue Addressed Computational Challenges

13
A. Problems Addressed

Detection of Email Spamming Machines
Determine Anomalous Behavior for a given time
period.
Mine information from evolution of anomalous
behavior of machines.

Primary Technique Link Analysis
14
Application Link Analysis for Network Security

Network Flow Data can be modeled as a graph with
nodes as machines and connections as edges. Such
graphs can be used for
Identify nodes (machines) and edges (connections)
that are anomalous in behavior.
Identify nodes highly likely to be possible
sources of attack or are vulnerable over a period
of time.
Identify communities of machines involved in
normal as well as anomalous connections.
Study temporal behavior of these graphs to detect
sudden changes.

Hence, there is a need to develop models for
detecting anomalous behavior and study their
evolution
15
A. Motivation

Email A file transfer from one machine to
another initiated by the sender.
Use Most popular form of communication.
AbuseSpam!
Why is Spam a concern?
Annoying mailboxes. More than half the email
received is spam!
High network bandwidth consumption
Machines held hostage and security compromised
Corporate companies spend 10bn/yr fighting spam

16
A. Motivation

Sources of Email spam
An individual user spreading spam.
A machine sending spam on its own.
An outside source using a machine as a relay to
send spam.
Spam can be fought at different levels

High Privacy Intrusion

Identify email as spam based on content. E.g.
Spam Filters
Identify individual users responsible for spam.
Identify machines sending spam and used as relays

Focus of this work
17
A. Email Architecture
18
A. Privacy Issues with Email Data
19
A. Related Work for Spam Detection

Collaborative filtering approaches have also been
developed by analyzing the content SpamNet
Cloudmark(San Francisco, CA)
Classification based approaches that use
heuristics or rules - SpamAssasin.
Bayesian based approaches to classify e-mails as
spam MSN8
Behavior based techniques using user profiles -
E-Mail Mining Toolkit.
Detection of spam trojans using of behavior based
techniques coupled with signature based
techniques for detection of spam trojans-
Sandvine Incorporated

Signature based methods or looking at user
profiles will require privacy intrusiveness and
fail to detect novel attacks
20
A. Detecting Anomalous Email Behavior

Email servers send and receive mails from each
other and these interactions form a community.
These Email servers can be modeled as Fans and
Centers in a bipartite graph.
Fans are machines that send mails
Centers are machines that receive mails.
Email servers serve as both good fans and
centers.

The behavior of Email servers can thus be
captured by the HITS algorithm.
21
A. HITS Algorithm

Hits Algorithm
Let a is the vector of authority scores and h be
the vector of hub scores
a1,1,.1, h 1,1,..1
do
a ATh
h Aa
Normalize a and h
while a and h do not converge (reach a
convergence threshold)
a a
h h
return a, h
The vectors a and h represent the authority
and hub weights

22
A. Identifying spamming machines
23
A. Identifying spamming machines

Sequence of steps
Pre-process the netflow data and construct the
graph for e-mail connections.
Graphs can be constructed for patterns that
represent other kind of services like ftp.
Node can be an IP or AS or port or any
combination depending on the problem. We do our
analysis at an IP Level.
Perform the HITS Algorithm on the generated
graph.
The nodes with top hub and authority scores
represent typical e-mail servers
Remove edges between top k of hubs to top k
authorities.
These top k connections correspond to normal
e-mail traffic between regular mail servers that
have high hub and authority score.
Perform the HITS algorithm on the resultant
graph.
A simple outdegree also works fine on the
resultant graph.
The new scores are the Perpetrator Scores.
Spamming machines obtain high rank compared to
other e-mail servers.

24
A. Mining Information from Anomalous Rank
Evolution

Link graphs based on email connections change
rapidly across time.
This evolving behavior helps in identifying
Machines that suddenly behave as Email servers
Possible spam (initiates)
Machines that send low volume spam continuously
Email servers going down.

25
A. Metrics of Interest

Perpetrator Rank
For each node, determine its Perpetrator Rank
(PR) based on its
Perpetrator Score (PScore).
Perpetrator Height
The height is a measure of how far a node is
from an infinitely low ranked node. For a node i
at time t, its height is
PHeightit log2 ( 1 1/PR )
Rate of change in the rank of a node over time
v (? PHeight) / (? t)
Rank Energy
Rank Energy Weight v2
The changing behavior of above metrics is a good
way to detect sudden changes in Email behavior of
machines.

26
A. Experiments and Results

Data Source Netflow Data
Detecting Anomalous Behavior
10 min window from 0710 to 0720 hrs on June
17th, 2004
Analyzing Rank Evolution
3 hour window from 7am to 10 am on July 21st,
2004
Steps Involved
Data Preprocessing Graph Construction from
Netflow data
Running the Modified HITS algorithm
Determine Anomalous Score
Determine rate of change to capture Rank Evolution

27
A. Anomalous Behavior Detection
The experiment was run for a 10 min window from
0710 to 0720 hrs on the 17th of June, 2004
At this time, 134.84.S.44 was known to be sending
spam. All of the other hosts were known, good
email servers that were sending email
Total Flows 856470Email Flows 10368Distinct
IPs (Total) 228276Distinct IPs (Email) 1633
Sorted by Perpetrator Score
Sorted by Outdegree

The rank of the spam sending machines
(134.84.S.44) was pushed higher compared to other
good e-mail servers.
The ranks of the IP according to authority
indegree scores were (1563/1633) and (1572/1633)
respectively. Indicates these machines were
sending spam and not were not good receivers.

28
A. Analysis of Rank Evolution
Variation of Spam Hub Score over time for IPs
Mail Server possibly sending news letters
Height Metric
Energy Metric

Machine found to be affected and sending spam
during the time period 7am to 10am on July 21st
2004 in the CS network
Ranked 1 according to the height metric for the
aggregate time period.
Ranked 3 according to the energy metric

29
Applications and Challenges

Detection of machines sending E-Mail
Issue Addressed Developing time aware relevancy
measures
Computation of metrics as Graphs evolve
Issue Addressed Computational Challenges

30
B. Computation of Metrics for Evolving Web Graphs

Goal Model a sequence of evolving graphs, and
develop efficient computational models for
determining relevance metrics
Dataset Used
Computer Science Website University of
Minnesota

31
Computational Challenges

Analysis of evolution of large scale graphs
gives rise to two kinds of computational
challenges
Computation on a single large scale graph.
Parallel computation by graph partitioning.
Computation on set of evolving graphs
Incremental Computation on evolving graphs.

32
Efficient Computation of Large Evolving Graphs

Theory
Understanding the popular and base models for
computing relevance measures on large graphs.
First Order Markov Models - Googles PageRank
Using the properties of the underlying model, to
partition the graph into graphs of smaller size,
thus decreasing the size of the problem.

33
A. Googles PageRank
Key idea Rank of a web page depends on the rank
of the web pages pointing to it
34
A. The PageRank Algorithm

Input
A is the adjacency matrix such that
A(p,q) 0 if there is no directed edge from p
to q.
A(p,q) 1/OutDegree(p), if there is an edge from
p to q
The PageRank Algorithm
Set PR ? r1, r2, ..rN, where ri is some
initial rank of page i, and N the number of Web
pages in the graph
d ? 0.15 D ? 1/N.1/NT
do
PRi1 ? ATPRi
PRi1 ? (1-d) PRi1 dD
? ? PRi1 - PRi1
while ? lt ?, where ? is a small number
indicating the convergence threshold
return PR.

35
A. Theoretical Overview of Partitioning Approach
36
A Theoretical Approach
Vertex on left partition,
Vertex on the border of left partition from which
there are outgoing edges to the right partition
Vertex on right partition,
37
Computing for different partitions
38
Incremental PR Computation on Evolving Graphs

Motivation
Large graphs that represent various relationship
phenomena, such as World Wide Web are evolving
with respect to time.
The portion of graph that changes over time is
very little compared to the whole size of graph.
Computing metrics based on first order markov
models for such large graphs at each time
instance is expensive
Problem
Given snapshots of evolving graph at two
consecutive time instances - G1,G2 , to compute
PageRank of the graph at the second time instance
-G2 in a cost effective manner.

39
Overall Approach
Principle Idea PageRank of a page depends only
on the pages that point to it and is independent
of the pages pointed by the page.
Page Rank of node remains unaffected by a set of
nodes from which there is no directed path to the
given node
40
Methodology

Methodology
Detect a changed portion of graph.
Partition the graph into scalable computation
partition ( P) and non-scalable computation
partition (Q), such that there are no incoming
links from partition, Q to partition P.
Compute PageRank for non-scalable partition and
scale the PageRanks of Scalable partition.
Merge the rankings of the two independent
partitions.
Then scaling is done with respect to the number
of vertices in partition, P-n(P) to the total
number of nodes in the whole graph, G n(P UQ)V.
The PageRank values of partition P are obtained
by simple scaling, with scaling factor
n(G1)/n(G2).

Figure Incremental Computation of PageRank
41
Results and Discussion
Experimental Results

Conclusions
Significant speed up in computation of PageRank
using our approach
Effective Partitioning technique for incremental
computation of any first order markov model based
metric.

42
Ongoing Work

Developing a parallel computation methodology
using graph partitioning for measures such as
PageRank
This is not an approximation method.
Principle idea is to partition the graph into
subgraphs, such that computation on each such
partition can be done independently to speed up
the overall computation cost.

43
Questions
44
Biography

Research Interests
Data Mining , Web Mining, Link Analysis
Techniques and Applications.
Education
Ph.D. in Computer Science, Univ. of Minnesota,
Minneapolis (expected 2006).
M.S. in Computer Science, Univ. of Minnesota,
Minneapolis (2004).
B.E. in Mechanical Engineering, Osmania
University, Hyderabad, India (1998).
Related Publications
D. Padmanabhan, P. Desikan, J. Srivastava, K.
Riaz, WICER A Weighted Inter-Cluster Edge
Ranking for Clustered Graphs, IEEE/ACM
Conference on Web Intelligence, September 2005.
P. Desikan, N. Pathak, J. Srivastava and V.
Kumar, Incremental PageRank Computation on
Evolving Graphs, Poster Paper at 14th
International World Wide Web Conference on May
10-14, 2005, in Chiba, Japan.
P. Desikan and J. Srivastava, Analyzing Network
Traffic to Detect E-Mail Spamming Machines, ICDM
Workshop on Privacy and Security Aspects of Data
Mining, Brighton, UK, November 2004.
P. Desikan and J. Srivastava, Mining Temporally
Evolving Graphs, WebKDD Workshop on Knowledge
Discovery in the Web, Seattle, August 2004 .
J. Srivastava, P. Desikan and V. Kumar. Web
Mining Concepts, Applications and Research
Directions. Book Chapter in Data Mining Next
Generation Challenges and Future Directions,
MIT/AAAI 2004 .
P. Desikan, J. Srivastava, V. Kumar, and P. N.
Tan. Hyperlink Analysis Techniques and
Applications, AHPCRC Technical Report TR-
2002-0152.
More Information
E-Mail desikanp_at_uwstout.edu, desikan_at_cs.umn.edu
URL http//www-users.cs.umn.edu/desikan/