Title: Link Analysis of Evolving Graphs
1Link Analysis of Evolving Graphs
- Prasanna Desikan
- Department of Mathematics, Statistics and
Computer Science - University of Wisconsin-Stout.
- desikanp_at_uwstout.edu
Joint work with Advisor Prof. Jaideep
Srivastava, Department of Computer Science,
University of Minnesota
2Outline
- Definition of Link Analysis
- Link Analysis applied on the Web
- Evolving graphs
- Motivation
- Modeling evolving graphs
- Applications and Challenges
- Detection of machines sending E-Mail Spam
- Issue Addressed Developing time aware relevancy
measures - Computation of Metrics as Graphs Evolve
- Issue Addressed Computational Challenges
- Ongoing Work
3Link Analysis Definition
- Link analysis is
- The study of network (graph) structures that
emerge in many applications - To identify interesting network properties
- Which lead to a better understanding of the
impact of connectedness in the application - And enable us to better manage the application
4Sample Applications
- Web graph.
- Node Web Page Link Hyperlink.
- Applications Web Search, Web Communities, Web
page classification. - Computer networks.
- Node IP, Port, AS Link Connection.
- Applications Network Intrusion detection based
on connections between source of attack to the
destination of attack. - Citation index of research papers.
- Node Paper Link Citation.
- Applications Citation Prediction, Research
Communities. - Social and Knowledge networks.
- Node People/Projects Link social connection,
shared knowledge. - Applications Identifying experts, Information
flow.
5Link Analysis on the Web
- Link Analysis on the Web is used for a wide
variety of purposes, ranging from ranking pages
returned from a web search engine to identifying
Web communities.
6Googles PageRank
Key idea Rank of a web page depends on the rank
of the web pages pointing to it
7Hubs and Authorities
- Key ideas
- Hubs and authorities are fans and centers in
a bipartite core of a web graph - A good hub page is one that points to many good
authority pages - A good authority page is one that is pointed to
by many good hub pages
8- Link Analysis for Evolving Graphs
9What do evolving graphs capture?
- Graphs that model real world phenomena change
with respect to time as the phenomena that models
them changes. - To help us understand such phenomena better,
there is a need to study the evolution of the
graphs that represent these phenomenon - Typical examples include Web Graph, network
connections, social connections, e-mail
communications, telephone calls etc. - Such graphs are large and it would be
computationally expensive to measure interesting
properties on such graphs.
Goal To develop computational techniques to that
allows for analyzing evolution of large scale
graphs.
10Modeling a Phenomenon as Sequence of Graphs
- Consider a real world phenomenon that can be
viewed as a set of objects that interact with
each other over time.
Let the set of interactions during a time period
(t0, tb be captured as a graph at instance tb,
denoted by Gb.
The real world phenomenon can thus be captured as
set of graphs, SG ordered sequentially over time
as n graphs in n time instances SG G1,
G2, Gn
11Modeling the Change with time
- Given SG G1, G2, Gn
- Let the graph change from to , in a
time period . The change in the interaction
between the objects can be captured by
12Applications and Challenges
- Detection of machines sending E-Mail Spam
- Issue Addressed Developing time aware relevancy
measures - Computation of metrics as Graphs evolve
- Issue Addressed Computational Challenges
13A. Problems Addressed
- Detection of Email Spamming Machines
- Determine Anomalous Behavior for a given time
period. - Mine information from evolution of anomalous
behavior of machines.
Primary Technique Link Analysis
14Application Link Analysis for Network Security
- Network Flow Data can be modeled as a graph with
nodes as machines and connections as edges. Such
graphs can be used for - Identify nodes (machines) and edges (connections)
that are anomalous in behavior. - Identify nodes highly likely to be possible
sources of attack or are vulnerable over a period
of time. - Identify communities of machines involved in
normal as well as anomalous connections. - Study temporal behavior of these graphs to detect
sudden changes.
Hence, there is a need to develop models for
detecting anomalous behavior and study their
evolution
15A. Motivation
- Email A file transfer from one machine to
another initiated by the sender. - Use Most popular form of communication.
- AbuseSpam!
- Why is Spam a concern?
- Annoying mailboxes. More than half the email
received is spam! - High network bandwidth consumption
- Machines held hostage and security compromised
- Corporate companies spend 10bn/yr fighting spam
16A. Motivation
- Sources of Email spam
- An individual user spreading spam.
- A machine sending spam on its own.
- An outside source using a machine as a relay to
send spam. - Spam can be fought at different levels
High Privacy Intrusion
- Identify email as spam based on content. E.g.
Spam Filters - Identify individual users responsible for spam.
- Identify machines sending spam and used as relays
Focus of this work
17A. Email Architecture
18A. Privacy Issues with Email Data
19A. Related Work for Spam Detection
- Collaborative filtering approaches have also been
developed by analyzing the content SpamNet
Cloudmark(San Francisco, CA) - Classification based approaches that use
heuristics or rules - SpamAssasin. - Bayesian based approaches to classify e-mails as
spam MSN8 - Behavior based techniques using user profiles -
E-Mail Mining Toolkit. - Detection of spam trojans using of behavior based
techniques coupled with signature based
techniques for detection of spam trojans-
Sandvine Incorporated
Signature based methods or looking at user
profiles will require privacy intrusiveness and
fail to detect novel attacks
20A. Detecting Anomalous Email Behavior
- Email servers send and receive mails from each
other and these interactions form a community. - These Email servers can be modeled as Fans and
Centers in a bipartite graph. - Fans are machines that send mails
- Centers are machines that receive mails.
- Email servers serve as both good fans and
centers.
The behavior of Email servers can thus be
captured by the HITS algorithm.
21A. HITS Algorithm
- Hits Algorithm
- Let a is the vector of authority scores and h be
the vector of hub scores - a1,1,.1, h 1,1,..1
- do
- a ATh
- h Aa
- Normalize a and h
- while a and h do not converge (reach a
convergence threshold) - a a
- h h
- return a, h
- The vectors a and h represent the authority
and hub weights -
22A. Identifying spamming machines
23A. Identifying spamming machines
- Sequence of steps
- Pre-process the netflow data and construct the
graph for e-mail connections. - Graphs can be constructed for patterns that
represent other kind of services like ftp. - Node can be an IP or AS or port or any
combination depending on the problem. We do our
analysis at an IP Level. - Perform the HITS Algorithm on the generated
graph. - The nodes with top hub and authority scores
represent typical e-mail servers - Remove edges between top k of hubs to top k
authorities. - These top k connections correspond to normal
e-mail traffic between regular mail servers that
have high hub and authority score. - Perform the HITS algorithm on the resultant
graph. - A simple outdegree also works fine on the
resultant graph. - The new scores are the Perpetrator Scores.
- Spamming machines obtain high rank compared to
other e-mail servers.
24A. Mining Information from Anomalous Rank
Evolution
- Link graphs based on email connections change
rapidly across time. - This evolving behavior helps in identifying
- Machines that suddenly behave as Email servers
- Possible spam (initiates)
- Machines that send low volume spam continuously
- Email servers going down.
25A. Metrics of Interest
- Perpetrator Rank
- For each node, determine its Perpetrator Rank
(PR) based on its - Perpetrator Score (PScore).
- Perpetrator Height
- The height is a measure of how far a node is
from an infinitely low ranked node. For a node i
at time t, its height is - PHeightit log2 ( 1 1/PR )
- Rate of change in the rank of a node over time
- v (? PHeight) / (? t)
- Rank Energy
- Rank Energy Weight v2
- The changing behavior of above metrics is a good
way to detect sudden changes in Email behavior of
machines.
26A. Experiments and Results
- Data Source Netflow Data
- Detecting Anomalous Behavior
- 10 min window from 0710 to 0720 hrs on June
17th, 2004 - Analyzing Rank Evolution
- 3 hour window from 7am to 10 am on July 21st,
2004 - Steps Involved
- Data Preprocessing Graph Construction from
Netflow data - Running the Modified HITS algorithm
- Determine Anomalous Score
- Determine rate of change to capture Rank Evolution
27A. Anomalous Behavior Detection
The experiment was run for a 10 min window from
0710 to 0720 hrs on the 17th of June, 2004
At this time, 134.84.S.44 was known to be sending
spam. All of the other hosts were known, good
email servers that were sending email
Total Flows 856470Email Flows 10368Distinct
IPs (Total) 228276Distinct IPs (Email) 1633
Sorted by Perpetrator Score
Sorted by Outdegree
- The rank of the spam sending machines
(134.84.S.44) was pushed higher compared to other
good e-mail servers. - The ranks of the IP according to authority
indegree scores were (1563/1633) and (1572/1633)
respectively. Indicates these machines were
sending spam and not were not good receivers.
28A. Analysis of Rank Evolution
Variation of Spam Hub Score over time for IPs
Mail Server possibly sending news letters
Height Metric
Energy Metric
- Machine found to be affected and sending spam
during the time period 7am to 10am on July 21st
2004 in the CS network - Ranked 1 according to the height metric for the
aggregate time period. - Ranked 3 according to the energy metric
29Applications and Challenges
- Detection of machines sending E-Mail
- Issue Addressed Developing time aware relevancy
measures - Computation of metrics as Graphs evolve
- Issue Addressed Computational Challenges
30B. Computation of Metrics for Evolving Web Graphs
- Goal Model a sequence of evolving graphs, and
develop efficient computational models for
determining relevance metrics - Dataset Used
- Computer Science Website University of
Minnesota
31Computational Challenges
- Analysis of evolution of large scale graphs
gives rise to two kinds of computational
challenges - Computation on a single large scale graph.
- Parallel computation by graph partitioning.
- Computation on set of evolving graphs
- Incremental Computation on evolving graphs.
32Efficient Computation of Large Evolving Graphs
- Theory
- Understanding the popular and base models for
computing relevance measures on large graphs. - First Order Markov Models - Googles PageRank
- Using the properties of the underlying model, to
partition the graph into graphs of smaller size,
thus decreasing the size of the problem.
33A. Googles PageRank
Key idea Rank of a web page depends on the rank
of the web pages pointing to it
34A. The PageRank Algorithm
- Input
- A is the adjacency matrix such that
- A(p,q) 0 if there is no directed edge from p
to q. - A(p,q) 1/OutDegree(p), if there is an edge from
p to q - The PageRank Algorithm
- Set PR ? r1, r2, ..rN, where ri is some
initial rank of page i, and N the number of Web
pages in the graph - d ? 0.15 D ? 1/N.1/NT
- do
- PRi1 ? ATPRi
- PRi1 ? (1-d) PRi1 dD
- ? ? PRi1 - PRi1
- while ? lt ?, where ? is a small number
indicating the convergence threshold - return PR.
35A. Theoretical Overview of Partitioning Approach
36A Theoretical Approach
Vertex on left partition,
Vertex on the border of left partition from which
there are outgoing edges to the right partition
Vertex on right partition,
37Computing for different partitions
38Incremental PR Computation on Evolving Graphs
- Motivation
- Large graphs that represent various relationship
phenomena, such as World Wide Web are evolving
with respect to time. - The portion of graph that changes over time is
very little compared to the whole size of graph. - Computing metrics based on first order markov
models for such large graphs at each time
instance is expensive - Problem
- Given snapshots of evolving graph at two
consecutive time instances - G1,G2 , to compute
PageRank of the graph at the second time instance
-G2 in a cost effective manner.
39Overall Approach
Principle Idea PageRank of a page depends only
on the pages that point to it and is independent
of the pages pointed by the page.
Page Rank of node remains unaffected by a set of
nodes from which there is no directed path to the
given node
40Methodology
- Methodology
- Detect a changed portion of graph.
- Partition the graph into scalable computation
partition ( P) and non-scalable computation
partition (Q), such that there are no incoming
links from partition, Q to partition P. - Compute PageRank for non-scalable partition and
scale the PageRanks of Scalable partition. - Merge the rankings of the two independent
partitions. - Then scaling is done with respect to the number
of vertices in partition, P-n(P) to the total
number of nodes in the whole graph, G n(P UQ)V. - The PageRank values of partition P are obtained
by simple scaling, with scaling factor
n(G1)/n(G2).
Figure Incremental Computation of PageRank
41Results and Discussion
Experimental Results
- Conclusions
- Significant speed up in computation of PageRank
using our approach - Effective Partitioning technique for incremental
computation of any first order markov model based
metric.
42Ongoing Work
- Developing a parallel computation methodology
using graph partitioning for measures such as
PageRank - This is not an approximation method.
- Principle idea is to partition the graph into
subgraphs, such that computation on each such
partition can be done independently to speed up
the overall computation cost.
43Questions
44Biography
- Research Interests
- Data Mining , Web Mining, Link Analysis
Techniques and Applications. - Education
- Ph.D. in Computer Science, Univ. of Minnesota,
Minneapolis (expected 2006). - M.S. in Computer Science, Univ. of Minnesota,
Minneapolis (2004). - B.E. in Mechanical Engineering, Osmania
University, Hyderabad, India (1998). - Related Publications
- D. Padmanabhan, P. Desikan, J. Srivastava, K.
Riaz, WICER A Weighted Inter-Cluster Edge
Ranking for Clustered Graphs, IEEE/ACM
Conference on Web Intelligence, September 2005. - P. Desikan, N. Pathak, J. Srivastava and V.
Kumar, Incremental PageRank Computation on
Evolving Graphs, Poster Paper at 14th
International World Wide Web Conference on May
10-14, 2005, in Chiba, Japan. - P. Desikan and J. Srivastava, Analyzing Network
Traffic to Detect E-Mail Spamming Machines, ICDM
Workshop on Privacy and Security Aspects of Data
Mining, Brighton, UK, November 2004. - P. Desikan and J. Srivastava, Mining Temporally
Evolving Graphs, WebKDD Workshop on Knowledge
Discovery in the Web, Seattle, August 2004 . - J. Srivastava, P. Desikan and V. Kumar. Web
Mining Concepts, Applications and Research
Directions. Book Chapter in Data Mining Next
Generation Challenges and Future Directions,
MIT/AAAI 2004 . - P. Desikan, J. Srivastava, V. Kumar, and P. N.
Tan. Hyperlink Analysis Techniques and
Applications, AHPCRC Technical Report TR-
2002-0152. - More Information
- E-Mail desikanp_at_uwstout.edu, desikan_at_cs.umn.edu
- URL http//www-users.cs.umn.edu/desikan/
45Thank You !