Link Analysis: Current State of the Art - PowerPoint PPT Presentation

About This Presentation
Title:

Link Analysis: Current State of the Art

Description:

Julia Louis-Dreyfus of TV's Seinfeld, however, needs two links to make a path: Julia Louis-Dreyfus was in Christmas Vacation (1989) with Keith MacKechnie. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 57
Provided by: ron856
Category:

less

Transcript and Presenter's Notes

Title: Link Analysis: Current State of the Art


1
Link Analysis Current State of the Art
  • Ronen Feldman
  • Computer Science Department
  • Bar-Ilan University, ISRAEL
  • ronenf_at_gmail.com

2
Introduction to Text Mining
3
TM ! Search
Find Documents matching the Query
Display Information relevant to the Query
Long lists of documents
Aggregate over entire collection
4
Let Text Mining Do the Legwork for You
Text Mining
Find Material
Read
Understand
Consolidate
Absorb / Act
5
What Is Unique in Text Mining?
  • Feature extraction.
  • Very large number of features that represent each
    of the documents.
  • The need for background knowledge.
  • Even patterns supported by small number of
    document may be significant.
  • Huge number of patterns, hence need for
    visualization, interactive exploration.

6
Document Types
  • Structured documents
  • Output from CGI
  • Semi-structured documents
  • Seminar announcements
  • Job listings
  • Ads
  • Free format documents
  • News
  • Scientific papers

7
Text Representations
  • Character Trigrams
  • Words
  • Linguistic Phrases
  • Non-consecutive phrases
  • Frames
  • Scripts
  • Role annotation
  • Parse trees

8
The 100,000 foot Picture
9
Intelligent Auto-Tagging
(c) 2001, Chicago Tribune. Visit the Chicago
Tribune on the Internet at http//www.chicago.trib
une.com/ Distributed by Knight Ridder/Tribune
Information Services. By Stephen J. Hedges and
Cam Simpson
. The Finsbury Park Mosque is the center of
radical Muslim activism in England. Through its
doors have passed at least three of the men now
held on suspicion of terrorist activity in
France, England and Belgium, as well as one
Algerian man in prison in the United States.
The mosque's chief cleric, Abu Hamza al-Masri
lost two hands fighting the Soviet Union in
Afghanistan and he advocates the elimination of
Western influence from Muslim countries. He was
arrested in London in 1999 for his alleged
involvement in a Yemen bomb plot, but was set
free after Yemen failed to produce enough
evidence to have him extradited. .''
10
Intelligence Article
11
Googles Article
12
Merger
13
Leveraging Content Investment
  • Any type of content
  • Unstructured textual content (current focus)
  • Structured data audio video (future)
  • In any format
  • Documents PDFs E-mails articles etc
  • Raw or categorized
  • Formal informal combination

Text Mining
  • From any source
  • WWW file systems news feeds etc.
  • Single source or combined sources

14
Information Extraction
15
Relevant IE Definitions
  • Entity an object of interest such as a person or
    organization.
  • Attribute a property of an entity such as its
    name, alias, descriptor, or type.
  • Fact a relationship held between two or more
    entities such as Position of a Person in a
    Company.
  • Event an activity involving several entities
    such as a terrorist act, airline crash,
    management change, new product introduction.

16
IE Accuracy by Information Type
Information Type Accuracy
Entities 90-98
Attributes 80
Facts 60-70
Events 50-60
17
MUC Conferences
Conference Year Topic
MUC 1 1987 Naval Operations
MUC 2 1989 Naval Operations
MUC 3 1991 Terrorist Activity
MUC 4 1992 Terrorist Activity
MUC 5 1993 Joint Venture and Micro Electronics
MUC 6 1995 Management Changes
MUC 7 1997 Spaces Vehicles and Missile Launches
18
Applications of Information Extraction
  • Routing of Information
  • Infrastructure for IR and for Categorization
    (higher level features)
  • Event Based Summarization.
  • Automatic Creation of Databases and Knowledge
    Bases.

19
Where would IE be useful?
  • Semi-Structured Text
  • Generic documents like News articles.
  • Most of the information in the document is
    centered around a set of easily identifiable
    entities.

20
Approaches for Building IE Systems
  • Knowledge Engineering Approach
  • Rules are crafted by linguists in cooperation
    with domain experts.
  • Most of the work is done by inspecting a set of
    relevant documents.
  • Can take a lot of time to fine tune the rule set.
  • Best results were achieved with KB based IE
    systems.
  • Skilled/gifted developers are needed.
  • A strong development environment is a MUST!

21
Approaches for Building IE Systems
  • Automatically Trainable Systems
  • The techniques are based on pure statistics and
    almost no linguistic knowledge
  • They are language independent
  • The main input is an annotated corpus
  • Need a relatively small effort when building the
    rules, however creating the annotated corpus is
    extremely laborious.
  • Huge number of training examples is needed in
    order to achieve reasonable accuracy.
  • Hybrid approaches can utilize the user input in
    the development loop.

22
Components of IE System
23
Why is IE Difficult?
  • Different Languages
  • Morphology is very easy in English, much harder
    in German and Hebrew.
  • Identifying word and sentence boundaries is
    fairly easy in European language, much harder in
    Chinese and Japanese.
  • Some languages use orthography (like english)
    while others (like hebrew, arabic etc) do no have
    it.
  • Different types of style
  • Scientific papers
  • Newspapers
  • memos
  • Emails
  • Speech transcripts
  • Type of Document
  • Tables
  • Graphics
  • Small messages vs. Books

24
Link Analysis on Large Textual Networks
  • Social Network Analysis

25
The Kevin Bacon Game
  • The game works as follows given any actor, find
    a path between the actor and Kevin Bacon that has
    less than 6 edges.
  • For instance, Kevin Costner links to Kevin Bacon
    by using one direct link Both were in JFK.
  • Julia Louis-Dreyfus of TV's Seinfeld, however,
    needs two links to make a path Julia
    Louis-Dreyfus was in Christmas Vacation (1989)
    with Keith MacKechnie. Keith MacKechnie was in We
    Married Margo (2000) with Kevin Bacon.
  • You can play the game by using the following URL
    http//www.cs.virginia.edu/oracle/.

26
The Erdos Number
  • A similar idea is also used in the mathematical
    society and is called the Erdös number of a
    researcher.
  • Paul Erdös (19131996), wrote hundreds of
    mathematical research papers in many different
    areas, many in collaboration with others.
  • There is a link between any two mathematicians if
    they co-authored a paper.
  • Paul Erdös is the root of the mathematical
    research network and his Erdös number is 0.
  • Erdöss co-authors have Erdös number 1.
  • People other than Erdös who have written a joint
    paper with someone with Erdös number 1 but not
    with Erdös have Erdös number 2, and so on.

27
Running Example
28
Hijackers by Flight
Flight 77 Pentagon Flight 11 WTC 1 Flight 175 WTC 2 Flight 93 PA
Khalid Al-Midhar Satam Al Suqami Marwan Al-Shehhi Saeed Alghamdi
Majed Moqed Waleed M. Alshehri Fayez Ahmed Ahmed Alhaznawi
Nawaq Alhamzi Wail Alshehri Ahmed Alghamdi Ahmed Alnami
Salem Alhamzi Mohamed Atta Hamza Alghamdi Ziad Jarrahi
Hani Hanjour Abdulaziz Alomari Mohald Alshehri  
29
Automatic layout of networks
  • Pretty Graph Drawing

30
Motivation I
  • In order to display large networks on the screen
    we need to use automatic layout algorithms. These
    algorithms display the graphs in an aesthetic way
    without any user intervention.
  • The most commonly used aesthetic criteria are to
    expose symmetries and make drawing as compact as
    possible or alternatively fill the space
    available for the drawing.

31
Motivation II
  • Many of the higher-level aesthetic criteria are
    implicit consequences of
  • minimized number of edge crossings
  • evenly distributed edge length
  • evenly distributed vertex positions on the graph
    area
  • sufficiently large vertex-edge distances
  • sufficiently large angular resolution between
    edges.

32
Disadvantages of the Spring based methods
  • They are computationally expensive and hence
    minimizing the energy function when dealing with
    large graphs is computationally prohibitive.
  • Since all methods rely on heuristics, there is no
    guarantee that the best layout will be found.
  • The methods behave as black boxes and hence it is
    almost impossible to integrate additional
    constraints on the layout (such as fixing the
    positions of certain vertices, or specifying the
    relative ordering of the vertices)
  • Even when the graphs are planar it is quite
    possible that we will get edge crossings.
  • The methods try to optimize just the placement of
    vertices and edges while ignoring the exact shape
    of the vertices or the fact the vertices may have
    labels.

33
Kamada and Kawais (KK) Method
34
Fruchterman Reingold (FR) Method
35
Classic Graph Operations
36
Finding the shortest Path (from Atta)
37
A better Visualization
38
Centrality
39
Degree
  • If the graph is undirected then the degree of a
    vertex v ? V is the number of other vertices that
    are directly connected to it.
  • degree(v) (v1, v2) ? E v1 v or v2 v
  • If the graph is directed then we can talk about
    in-degree or out-degree. An edge (v1,v2) ? E in
    the directed graph is leading from vertex v1 to
    v2.
  • In-degree(v) (v1, v) ? E
  • Out-degree(v) (v, v2) ? E

40
Degree of the Hijackers
41
Closeness Centrality - Motivation
  • Degree centrality measures might be criticized
    because they only take into account the direct
    connections that an entity has, rather than
    indirect connections to all other entities.
  • One entity might be directly connected to a large
    number of entities that might be pretty isolated
    from the network. Such an entity is central only
    in a local neighborhood of the network.

42
Closeness Centrality
  • This measure is based on the calculation of the
    geodesic distance between the entity and all
    other entities in the network.
  • We can either use directed or undirected geodesic
    distances between the entities.
  • The sum of these geodesic distances for each
    entity is the "farness" of the entity from all
    other entities.
  • We can convert this into a measure of closeness
    centrality by taking the reciprocal.
  • In addition, we can normalize the closeness
    measure by dividing it by the closeness measure
    of the most central entity.

43
Closeness Formally
  • let d(v1,v2) the minimal distance between v1
    and v2, i.e., the minimal number of vertices that
    we need to pass on the way from v1 to v2.

44
Closeness of the Hijackers
Name Closeness
Abdulaziz Alomari 0.6
Ahmed Alghamdi 0.5454545
Ziad Jarrahi 0.5294118
Fayez Ahmed 0.5294118
Mohamed Atta 0.5142857
Majed Moqed 0.5142857
Salem Alhamzi 0.5142857
Hani Hanjour 0.5
Marwan Al Shehhi 0.4615385
Satam Al Suqami 0.4615385
Waleed M. Alshehri 0.4615385
Wail Alshehri 0.4615385
Hamza Alghamdi 0.45
Khalid Al Midhar 0.4390244
Mohald Alshehri 0.4390244
Nawaq Alhamzi 0.3673469
Saeed Alghamdi 0.3396226
Ahmed Alnami 0.2571429
Ahmed Alhaznawi 0.2571429
45
Betweeness Centrality
  • The betweeness centrality measures the
    effectiveness in which the vertex connects the
    various parts of the network.
  • The main idea behind betweeness centrality is
    that entities that are mediators have more power.
    Entities that are on many geodesic paths between
    other pairs of entities are more powerful since
    they control the flow of information between the
    pairs.

46
Betweeness - Formally
  • Highest Possible Betweeness
  • gjk the number of geodetic paths that connect
    vj with vk
  • gjk(vi) the number of geodetic paths that
    connect vj with vk and pass via vi.

47
Betweenness of the Hijackers
48
Eigen Vector Centrality
  • The main idea behind eigenvector centrality is
    that entities receiving many communications from
    other well connected entities, will be better and
    more valuable sources of information, and hence
    be considered central. The Eigenvector centrality
    scores correspond to the values of the principal
    eigenvector of the adjacency matrix M.
  • Formally, the vector v satisfies the equation
    where l is the corresponding eigenvalue and M is
    the adjacency matrix.

49
EigenVector centralities of the hijackers
Name E1
Mohamed Atta 0.518
Marwan Al-Shehhi 0.489
Abdulaziz Alomari 0.296
Ziad Jarrahi 0.246
Fayez Ahmed 0.246
Satam Al Suqami 0.241
Waleed M. Alshehri 0.241
Wail Alshehri 0.241
Salem Alhamzi 0.179
Majed Moqed 0.165
Hani Hanjour 0.151
Khalid Al-Midhar 0.114
Ahmed Alghamdi 0.085
Nawaq Alhamzi 0.064
Mohald Alshehri 0.054
Hamza Alghamdi 0.015
Saeed Alghamdi 0.002
Ahmed Alnami 0
Ahmed Alhaznawi 0
50
Power Centrality
  • Given an adjacency matrix M, the power centrality
    of vertex i (denoted ci), is given by
  • a is used to normalize the score the
    normalization parameter is automatically selected
    so that the sum of squares of the verticess
    centralities is equal to the number of vertices
    in the network.
  • b is an attenuation factor that controls the
    effect that the power centralities of the
    neighboring vertices should have on the power
    centrality of the vertex.

51
Power - Motivation
  • In a similar way to the eigenvector centrality,
    the power centrality of each vertex is determined
    by the centrality of the vertices it is connected
    to.
  • By specifying positive or negative values to b
    the user can control if the fact that a vertex is
    connected to powerful vertices should have a
    positive effect on its score or a negative
    effect.
  • The rational for specifying a positive b is that
    if you are connected to powerful colleagues it
    makes you more powerful.
  • On the other hand, the rational for a negative b
    is that powerful colleagues have many connections
    and hence are not controlled by you, while
    isolated colleagues have no other sources of
    information and hence are pretty much controlled
    by you.

52
Power of the Hijackers
  Power b 0.99 Power b -0.99
Mohamed Atta 2.254 2.214
Marwan Al-Shehhi 2.121 0.969
Abdulaziz Alomari 1.296 1.494
Ziad Jarrahi 1.07 1.087
Fayez Ahmed 1.07 1.087
Satam Al Suqami 1.047 0.861
Waleed M. Alshehri 1.047 0.861
Wail Alshehri 1.047 0.861
Salem Alhamzi 0.795 1.153
Majed Moqed 0.73 1.029
Hani Hanjour 0.673 1.334
Khalid Al-Midhar 0.503 0.596
Ahmed Alghamdi 0.38 0.672
Nawaq Alhamzi 0.288 0.574
Mohald Alshehri 0.236 0.467
Hamza Alghamdi 0.07 0.566
Saeed Alghamdi 0.012 0.656
Ahmed Alnami 0.003 0.183
Ahmed Alhaznawi 0.003 0.183
53
Network Centralization
  • In addition to the individual vertex
    centralization measures, we can assign a number
    between 0 and 1 that will signal the level of
    centralization of the whole network.
  • The network centralization measures will be
    computed based on the centralization values of
    its vertices and hence we will have for type of
    individual centralization measure an associated
    network centralization measure.
  • A network that is structured like a circle will
    have a network centralization value of 0 (since
    all vertices have the same centralization value),
    while a network that structured like a star will
    have a network centralization value of 1.
  • We will now provide some of the formulas for the
    different network centralization measures.

54
Degree

For the Hijackers network NetDegree 0.31
55
Betweenness
For the Hijackers network NetBet 0.24
56
Summary Diagram
Write a Comment
User Comments (0)
About PowerShow.com