Link Analysis of Evolving Graphs - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Link Analysis of Evolving Graphs

Description:

Department of Mathematics, Statistics and Computer Science. University of ... Classification based approaches that use heuristics or rules - SpamAssasin. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 41
Provided by: jun82
Category:

less

Transcript and Presenter's Notes

Title: Link Analysis of Evolving Graphs


1
Link Analysis of Evolving Graphs
  • Prasanna Desikan
  • Department of Mathematics, Statistics and
    Computer Science
  • University of Wisconsin-Stout.
  • desikanp_at_uwstout.edu

Joint work with Advisor Prof. Jaideep
Srivastava, Department of Computer Science,
University of Minnesota
2
Outline
  • Definition of Link Analysis
  • Link Analysis applied on the Web
  • Evolving graphs
  • Motivation
  • Modeling evolving graphs
  • Applications and Challenges
  • Detection of machines sending E-Mail Spam
  • Issue Addressed Developing time aware relevancy
    measures
  • Computation of Metrics as Graphs Evolve
  • Issue Addressed Computational Challenges
  • Ongoing Work

3
Link Analysis Definition
  • Link analysis is
  • The study of network (graph) structures that
    emerge in many applications
  • To identify interesting network properties
  • Which lead to a better understanding of the
    impact of connectedness in the application
  • And enable us to better manage the application

4
Sample Applications
  • Web graph.
  • Node Web Page Link Hyperlink.
  • Applications Web Search, Web Communities, Web
    page classification.
  • Computer networks.
  • Node IP, Port, AS Link Connection.
  • Applications Network Intrusion detection based
    on connections between source of attack to the
    destination of attack.
  • Citation index of research papers.
  • Node Paper Link Citation.
  • Applications Citation Prediction, Research
    Communities.
  • Social and Knowledge networks.
  • Node People/Projects Link social connection,
    shared knowledge.
  • Applications Identifying experts, Information
    flow.

5
Link Analysis on the Web
  • Link Analysis on the Web is used for a wide
    variety of purposes, ranging from ranking pages
    returned from a web search engine to identifying
    Web communities.

6
Googles PageRank
Key idea Rank of a web page depends on the rank
of the web pages pointing to it
7
Hubs and Authorities
  • Key ideas
  • Hubs and authorities are fans and centers in
    a bipartite core of a web graph
  • A good hub page is one that points to many good
    authority pages
  • A good authority page is one that is pointed to
    by many good hub pages

8
  • Link Analysis for Evolving Graphs

9
What do evolving graphs capture?
  • Graphs that model real world phenomena change
    with respect to time as the phenomena that models
    them changes.
  • To help us understand such phenomena better,
    there is a need to study the evolution of the
    graphs that represent these phenomenon
  • Typical examples include Web Graph, network
    connections, social connections, e-mail
    communications, telephone calls etc.
  • Such graphs are large and it would be
    computationally expensive to measure interesting
    properties on such graphs.

Goal To develop computational techniques to that
allows for analyzing evolution of large scale
graphs.
10
Modeling a Phenomenon as Sequence of Graphs
  • Consider a real world phenomenon that can be
    viewed as a set of objects that interact with
    each other over time.

Let the set of interactions during a time period
(t0, tb be captured as a graph at instance tb,
denoted by Gb.
The real world phenomenon can thus be captured as
set of graphs, SG ordered sequentially over time
as n graphs in n time instances SG G1,
G2, Gn
11
Modeling the Change with time
  • Given SG G1, G2, Gn
  • Let the graph change from to , in a
    time period . The change in the interaction
    between the objects can be captured by

12
Applications and Challenges
  • Detection of machines sending E-Mail Spam
  • Issue Addressed Developing time aware relevancy
    measures
  • Computation of metrics as Graphs evolve
  • Issue Addressed Computational Challenges

13
A. Problems Addressed
  • Detection of Email Spamming Machines
  • Determine Anomalous Behavior for a given time
    period.
  • Mine information from evolution of anomalous
    behavior of machines.

Primary Technique Link Analysis
14
Application Link Analysis for Network Security
  • Network Flow Data can be modeled as a graph with
    nodes as machines and connections as edges. Such
    graphs can be used for
  • Identify nodes (machines) and edges (connections)
    that are anomalous in behavior.
  • Identify nodes highly likely to be possible
    sources of attack or are vulnerable over a period
    of time.
  • Identify communities of machines involved in
    normal as well as anomalous connections.
  • Study temporal behavior of these graphs to detect
    sudden changes.

Hence, there is a need to develop models for
detecting anomalous behavior and study their
evolution
15
A. Motivation
  • Email A file transfer from one machine to
    another initiated by the sender.
  • Use Most popular form of communication.
  • AbuseSpam!
  • Why is Spam a concern?
  • Annoying mailboxes. More than half the email
    received is spam!
  • High network bandwidth consumption
  • Machines held hostage and security compromised
  • Corporate companies spend 10bn/yr fighting spam

16
A. Motivation
  • Sources of Email spam
  • An individual user spreading spam.
  • A machine sending spam on its own.
  • An outside source using a machine as a relay to
    send spam.
  • Spam can be fought at different levels

High Privacy Intrusion
  • Identify email as spam based on content. E.g.
    Spam Filters
  • Identify individual users responsible for spam.
  • Identify machines sending spam and used as relays

Focus of this work
17
A. Email Architecture
18
A. Privacy Issues with Email Data
19
A. Related Work for Spam Detection
  • Collaborative filtering approaches have also been
    developed by analyzing the content SpamNet
    Cloudmark(San Francisco, CA)
  • Classification based approaches that use
    heuristics or rules - SpamAssasin.
  • Bayesian based approaches to classify e-mails as
    spam MSN8
  • Behavior based techniques using user profiles -
    E-Mail Mining Toolkit.
  • Detection of spam trojans using of behavior based
    techniques coupled with signature based
    techniques for detection of spam trojans-
    Sandvine Incorporated

Signature based methods or looking at user
profiles will require privacy intrusiveness and
fail to detect novel attacks
20
A. Detecting Anomalous Email Behavior
  • Email servers send and receive mails from each
    other and these interactions form a community.
  • These Email servers can be modeled as Fans and
    Centers in a bipartite graph.
  • Fans are machines that send mails
  • Centers are machines that receive mails.
  • Email servers serve as both good fans and
    centers.

The behavior of Email servers can thus be
captured by the HITS algorithm.
21
A. HITS Algorithm
  • Hits Algorithm
  • Let a is the vector of authority scores and h be
    the vector of hub scores
  • a1,1,.1, h 1,1,..1
  • do
  • a ATh
  • h Aa
  • Normalize a and h
  • while a and h do not converge (reach a
    convergence threshold)
  • a a
  • h h
  • return a, h
  • The vectors a and h represent the authority
    and hub weights

22
A. Identifying spamming machines
23
A. Identifying spamming machines
  • Sequence of steps
  • Pre-process the netflow data and construct the
    graph for e-mail connections.
  • Graphs can be constructed for patterns that
    represent other kind of services like ftp.
  • Node can be an IP or AS or port or any
    combination depending on the problem. We do our
    analysis at an IP Level.
  • Perform the HITS Algorithm on the generated
    graph.
  • The nodes with top hub and authority scores
    represent typical e-mail servers
  • Remove edges between top k of hubs to top k
    authorities.
  • These top k connections correspond to normal
    e-mail traffic between regular mail servers that
    have high hub and authority score.
  • Perform the HITS algorithm on the resultant
    graph.
  • A simple outdegree also works fine on the
    resultant graph.
  • The new scores are the Perpetrator Scores.
  • Spamming machines obtain high rank compared to
    other e-mail servers.

24
A. Mining Information from Anomalous Rank
Evolution
  • Link graphs based on email connections change
    rapidly across time.
  • This evolving behavior helps in identifying
  • Machines that suddenly behave as Email servers
  • Possible spam (initiates)
  • Machines that send low volume spam continuously
  • Email servers going down.

25
A. Metrics of Interest
  • Perpetrator Rank
  • For each node, determine its Perpetrator Rank
    (PR) based on its
  • Perpetrator Score (PScore).
  • Perpetrator Height
  • The height is a measure of how far a node is
    from an infinitely low ranked node. For a node i
    at time t, its height is
  • PHeightit log2 ( 1 1/PR )
  • Rate of change in the rank of a node over time
  • v (? PHeight) / (? t)
  • Rank Energy
  • Rank Energy Weight v2
  • The changing behavior of above metrics is a good
    way to detect sudden changes in Email behavior of
    machines.

26
A. Experiments and Results
  • Data Source Netflow Data
  • Detecting Anomalous Behavior
  • 10 min window from 0710 to 0720 hrs on June
    17th, 2004
  • Analyzing Rank Evolution
  • 3 hour window from 7am to 10 am on July 21st,
    2004
  • Steps Involved
  • Data Preprocessing Graph Construction from
    Netflow data
  • Running the Modified HITS algorithm
  • Determine Anomalous Score
  • Determine rate of change to capture Rank Evolution

27
A. Anomalous Behavior Detection
The experiment was run for a 10 min window from
0710 to 0720 hrs on the 17th of June, 2004
At this time, 134.84.S.44 was known to be sending
spam. All of the other hosts were known, good
email servers that were sending email
Total Flows 856470Email Flows 10368Distinct
IPs (Total) 228276Distinct IPs (Email) 1633
Sorted by Perpetrator Score
Sorted by Outdegree
  • The rank of the spam sending machines
    (134.84.S.44) was pushed higher compared to other
    good e-mail servers.
  • The ranks of the IP according to authority
    indegree scores were (1563/1633) and (1572/1633)
    respectively. Indicates these machines were
    sending spam and not were not good receivers.

28
A. Analysis of Rank Evolution
Variation of Spam Hub Score over time for IPs
Mail Server possibly sending news letters
Height Metric
Energy Metric
  • Machine found to be affected and sending spam
    during the time period 7am to 10am on July 21st
    2004 in the CS network
  • Ranked 1 according to the height metric for the
    aggregate time period.
  • Ranked 3 according to the energy metric

29
Applications and Challenges
  • Detection of machines sending E-Mail
  • Issue Addressed Developing time aware relevancy
    measures
  • Computation of metrics as Graphs evolve
  • Issue Addressed Computational Challenges

30
B. Computation of Metrics for Evolving Web Graphs
  • Goal Model a sequence of evolving graphs, and
    develop efficient computational models for
    determining relevance metrics
  • Dataset Used
  • Computer Science Website University of
    Minnesota

31
Computational Challenges
  • Analysis of evolution of large scale graphs
    gives rise to two kinds of computational
    challenges
  • Computation on a single large scale graph.
  • Parallel computation by graph partitioning.
  • Computation on set of evolving graphs
  • Incremental Computation on evolving graphs.

32
Efficient Computation of Large Evolving Graphs
  • Theory
  • Understanding the popular and base models for
    computing relevance measures on large graphs.
  • First Order Markov Models - Googles PageRank
  • Using the properties of the underlying model, to
    partition the graph into graphs of smaller size,
    thus decreasing the size of the problem.

33
A. Googles PageRank
Key idea Rank of a web page depends on the rank
of the web pages pointing to it
34
A. The PageRank Algorithm
  • Input
  • A is the adjacency matrix such that
  • A(p,q) 0 if there is no directed edge from p
    to q.
  • A(p,q) 1/OutDegree(p), if there is an edge from
    p to q
  • The PageRank Algorithm
  • Set PR ? r1, r2, ..rN, where ri is some
    initial rank of page i, and N the number of Web
    pages in the graph
  • d ? 0.15 D ? 1/N.1/NT
  • do
  • PRi1 ? ATPRi
  • PRi1 ? (1-d) PRi1 dD
  • ? ? PRi1 - PRi1
  • while ? lt ?, where ? is a small number
    indicating the convergence threshold
  • return PR.

35
A. Theoretical Overview of Partitioning Approach
36
A Theoretical Approach
Vertex on left partition,
Vertex on the border of left partition from which
there are outgoing edges to the right partition
Vertex on right partition,
37
Computing for different partitions
38
Incremental PR Computation on Evolving Graphs
  • Motivation
  • Large graphs that represent various relationship
    phenomena, such as World Wide Web are evolving
    with respect to time.
  • The portion of graph that changes over time is
    very little compared to the whole size of graph.
  • Computing metrics based on first order markov
    models for such large graphs at each time
    instance is expensive
  • Problem
  • Given snapshots of evolving graph at two
    consecutive time instances - G1,G2 , to compute
    PageRank of the graph at the second time instance
    -G2 in a cost effective manner.

39
Overall Approach
Principle Idea PageRank of a page depends only
on the pages that point to it and is independent
of the pages pointed by the page.
Page Rank of node remains unaffected by a set of
nodes from which there is no directed path to the
given node
40
Methodology
  • Methodology
  • Detect a changed portion of graph.
  • Partition the graph into scalable computation
    partition ( P) and non-scalable computation
    partition (Q), such that there are no incoming
    links from partition, Q to partition P.
  • Compute PageRank for non-scalable partition and
    scale the PageRanks of Scalable partition.
  • Merge the rankings of the two independent
    partitions.
  • Then scaling is done with respect to the number
    of vertices in partition, P-n(P) to the total
    number of nodes in the whole graph, G n(P UQ)V.
  • The PageRank values of partition P are obtained
    by simple scaling, with scaling factor
    n(G1)/n(G2).

Figure Incremental Computation of PageRank
41
Results and Discussion
Experimental Results
  • Conclusions
  • Significant speed up in computation of PageRank
    using our approach
  • Effective Partitioning technique for incremental
    computation of any first order markov model based
    metric.

42
Ongoing Work
  • Developing a parallel computation methodology
    using graph partitioning for measures such as
    PageRank
  • This is not an approximation method.
  • Principle idea is to partition the graph into
    subgraphs, such that computation on each such
    partition can be done independently to speed up
    the overall computation cost.

43
Questions
44
Biography
  • Research Interests
  • Data Mining , Web Mining, Link Analysis
    Techniques and Applications.
  • Education
  • Ph.D. in Computer Science, Univ. of Minnesota,
    Minneapolis (expected 2006).
  • M.S. in Computer Science, Univ. of Minnesota,
    Minneapolis (2004).
  • B.E. in Mechanical Engineering, Osmania
    University, Hyderabad, India (1998).
  • Related Publications
  • D. Padmanabhan, P. Desikan, J. Srivastava, K.
    Riaz, WICER A Weighted Inter-Cluster Edge
    Ranking for Clustered Graphs, IEEE/ACM
    Conference on Web Intelligence, September 2005.
  • P. Desikan, N. Pathak, J. Srivastava and V.
    Kumar, Incremental PageRank Computation on
    Evolving Graphs, Poster Paper at 14th
    International World Wide Web Conference on May
    10-14, 2005, in Chiba, Japan.
  • P. Desikan and J. Srivastava, Analyzing Network
    Traffic to Detect E-Mail Spamming Machines, ICDM
    Workshop on Privacy and Security Aspects of Data
    Mining, Brighton, UK, November 2004.
  • P. Desikan and J. Srivastava, Mining Temporally
    Evolving Graphs, WebKDD Workshop on Knowledge
    Discovery in the Web, Seattle, August 2004 .
  • J. Srivastava, P. Desikan and V. Kumar. Web
    Mining Concepts, Applications and Research
    Directions. Book Chapter in Data Mining Next
    Generation Challenges and Future Directions,
    MIT/AAAI 2004 .
  • P. Desikan, J. Srivastava, V. Kumar, and P. N.
    Tan. Hyperlink Analysis Techniques and
    Applications, AHPCRC Technical Report TR-
    2002-0152.
  • More Information
  • E-Mail desikanp_at_uwstout.edu, desikan_at_cs.umn.edu
  • URL http//www-users.cs.umn.edu/desikan/

45
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com