Link Analysis, Web Structure Mining and Web communities - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Link Analysis, Web Structure Mining and Web communities

Description:

Link Analysis, Web Structure Mining and Web communities. Mike Thelwall ... Web IR academic papers report unscientific experiments or inconclusive results ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 35
Provided by: scitW
Category:

less

Transcript and Presenter's Notes

Title: Link Analysis, Web Structure Mining and Web communities


1
Link Analysis, Web Structure Mining and Web
communities
An overview of methods and results
  • Mike Thelwall

2
Contents
  • PageRank ranking all pages
  • HITS ranking topic-relevant pages
  • Community Identification Algorithm identifying
    web communities

3
1. PageRank
  • Assumptions
  • A page with many links to it is more likely to be
    useful than one with few links to it
  • The links from a page that itself is the target
    of many links are likely to be particularly
    important

4
Example
X
Y
X seems to be the most important page since 2
important pages link to it
5
Simple voting model round 1
1
1
1
1
6
Simple voting model round 2
1.5
1
0
1.5
7
Simple voting model round 3
2
0
0
2
8
Revised voting model round 1
1
1
1
1
  • Allocate 1 vote to each node after each voting
    round
  • Remove votes from leaf nodes

9
Revised voting model round 2
1.5
2
1
1.5
10
Revised voting model round 3
2
2
1
2
The middle node only has one link to it, but this
does not share its votes with other nodes
11
Revised voting model cycling problem
1
1
1
12
PageRank
  • Use a proportion of vote, redistribute the rest
  • If proportion is lt 1 then no cycling will occur
  • Voting can also be performed by a matrix
  • Find votes from principle left eigenvector of
    matrix

13
PageRank round 1, giving votes
.4
1
1
.8
1
.4
1
  • 4 votes in system allocate 80 of vote,
    redistribute 20 of each, plus the lost votes
    from leaf nodes 2.4 votes. Redistribute 2.4/4
    0.6 extra votes to each node

14
PageRank round 2, receiving votes
0.60.4
0.60.8
1
1.4
0.6
1
0.60.4
15
PageRank round 3, giving votes
1.4x0.8/20.56
1
1.4
0.6x0.80.48
0.6
1.4x0.8/20.56
1
Lost votes 0.6x0.2 1.4x0.2 1 1 2.4.
Redistribute 2.4/4 0.6 votes to each node
16
PageRank round 3, receiving votes
0.60.56
0.60.48
1.16
1.08
0.6
1.16
17
PageRank Summary
  • The pages that get the highest PageRank are those
    that are linked to by many pages or by important
    pages
  • Spammers try to exploit this by creating dummy
    sites to link to their main sites

18
2. Kleinbergs HITS
  • Also uses link structures, but also uses page
    content to identify pages that are useful for a
    coherent topic on the web
  • An Authority is a page that is linked to by many
    other pages from the same topic
  • A Hub is a page that links to many pages from the
    same topic

19
Hubs and authorities
A
H
H
20
The HITS algorithm
  • Another iterative algorithm
  • Each page has a hub value and an authority value
  • Unlike PageRank, it is topic-specific, and needs
    to be recomputed for each user query

21
HITS 1 Finding the Base Set (simplified version)
  • Use a text-based query to obtain the top t
    matching pages
  • Add all pages linked to or linking to the
    matching pages
  • Remove all links between pages within the same
    site

22
HITS 2 Computing Hubs and Authorities
(simplified)
  • (Initialising) Assign each page an equal
    authority weight x and a hub weight y
  • (Iterating 1) Add the hub weight of each page to
    the authority value of each page linked to
  • (Iterating 1) Add the authority weight of each
    page to the hub value of each page linked from
  • Normalise and repeat until stable

23
0.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
24
0.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
25
HITS 3 Computing Ranks
  • Use Hub and Authority values but return a mixture
    of the top hub values and top authority values
  • This should avoid pages that would rank highly
    for general reasons but are not authoritative for
    the topic

26
3. The Community Identification Algorithm
  • The Community Identification Algorithm operates
    on the link structure of the Web alone
  • It identifies communities collections of pages
    that tend to link amongst each other but do not
    tend to link to pages outside of the community
  • This is useful for topic identification

27
The CIA algorithm
  • Is based upon the maximal flow algorithm
  • Start with one or more relevant pages the seed
    set
  • Partitions the web into two groups
  • Pages within the community of the seed set
  • Pages outside of the seed set
  • Works by creating an artificial network
  • Is very complicated!

28
CIA Example what is the community around the
given node?
Initial community
29
How much water can flow into the well for any
value of k?
Water tank
k
k
k
1
1
1
1
Well
Initial community
30
How much water can flow into the well for k1?
Water tank
1
1/1
2
1
1
1/1
1
1/1
Well
Cut through full pipes
31
How much water can flow into the well for k2?
Water tank
0/2
2/2
3
1/2
0/1
1/1
0/1
1/1
Well
Cut through full pipes
32
How much water can flow into the well for k4?
Water tank
1/4
3/4
4
1/4
1/1
1/1
1/1
1/1
Well
Cut through full pipes
33
4. Link Algorithms - Overview
  • The success of HITS and PageRank indicates the
    importance of links as a new information source
  • More needs to be known about patterns of linking
  • But there is still little hard evidence that link
    approaches work well for Web IR academic papers
    report unscientific experiments or inconclusive
    results

34
References
  • Brin, S., Page, L. (1998). The anatomy of a
    large scale hypertextual Web search engine.
    Computer Networks and ISDN Systems, 30(1-7),
    107-117.
  • Kleinberg, J. (1999). Authoritative sources in a
    hyperlinked environment. Journal of the ACM,
    46(5), 604-632.
  • Flake, G. W., Lawrence, S., Giles, C. L.,
    Coetzee, F. M. (2002). Self-organization and
    identification of Web communities. IEEE Computer,
    35, 66-71.
Write a Comment
User Comments (0)
About PowerShow.com