Title: Web structure mining / link mining and Web communities Bettina Berendt
1Web structure mining / link miningand Web
communitiesBettina BerendtKnowledge and the
Web summer semester 2005http//www.wiwi.hu-berli
n.de/berendt/lehre/2005s/kaw/last updated
2005-05-04
2Acknowledgementsand how to use these slides
- Some of these slides were taken from the slide
set of the Web Mining book by Baldi, Frasconi,
and Smyth (http//ibook.ics.uci.edu/Slides/MIW20C
hapter205.ppt) thank you for a great book and
slides! - These slides are marked at the bottom left corner
- I also based the slide layout on that slide set
- Some figures were taken from the two presented
articles (see p.4). - Further materials can be found in the directory
of this session - (http//www.wiwi.hu-berlin.de/berendt/lehre/2005s
/kaw/Session4) - Slides that just carry a title were developed in
class and on the blackboard. - Please feel free to re-use these slides in your
own teaching, and please credit their origin.
3Objectives
- To explore whats in a link and to see what
knowlede can therefore be extracted by analysing
links - To calculate the popularity of a site based on
link analysis - To see how linkage defines communities
4Outline Theory and applications of link analysis
for ...
- Search ranking of search engine results
- Scientific communities co-citation analysis and
other bibliometrics - Chen, C. Carr, L. (1999) Visualizing the
evolution of a subject domain A case study.
Proc. IEEE Visualization
1999. - An example of a resulting archive citeseer
- Identification of Web communities
- Flake, G.W., Lawrence, S., Giles, C.L. (2000).
Efficient identification of web communities.
Proc. KDD 2000. - Outlook Social network analysis
5The Web is a graph(Scientific) literature is a
graph
6Recall Trees (slide from ISI)
- A is the root node
- B is the parent of D and E
- D and E are children of B
- (C,F) is an edge
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves
- A, B, C, D, E, F, G, H, I are internal nodes
- The level (or depth) of E is 2 (number of edges
to root) - The height (or order) of the tree is 4 (max
number of edges from root to a leaf node) - The degree of node B is 2 (number of children)
Based on Tom Blough, Introduction to Programming.
http//www.rh.edu/blought/fall02_cish4960/notes/l
ecture11-12.ppt
7Graphs (data structure def.)
- Definition A set of items connected by edges.
Each item is called a vertex or node. Formally, a
graph is a set of vertices and a binary relation
between vertices, adjacency. - Formal Definition A graph G can be defined as a
pair (V,E), where V is a set of vertices, and E
is a set of edges between the vertices E (u,v)
u, v in V. If the graph is undirected, the
adjacency relation defined by the edges is
symmetric, or E u,v u, v in V (sets of
vertices rather than ordered pairs). If the graph
does not allow self-loops, adjacency is
irreflexive. - (http//www.nist.gov/dads/HTML/graph.html)
- Note Edges are also called links (esp. in
hypertext graphs like the WWW). - in denotes the element-of relation
8Whats in a link?1. This is good.
Basic Assumptions of early link analysis
- Hyperlinks contain information about the human
judgment of a site - The more incoming links to a site, the more it is
judged important
9Outline of the link analysis for search engine
ranking part
- Early Approaches to Link Analysis
- Hubs and Authorities HITS
- Page Rank
- Stability
- Probabilistic Link Analysis
- Limitation of Link Analysis
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
10Early Approaches
Bray 1996
- The visibility of a site is measured by the
number of other sites pointing to it - The luminosity of a site is measured by the
number of other sites to which it points - ? Limitation failure to capture the relative
importance of different parents (children) sites
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
11Early Approaches
Mark 1988
- To calculate the score S of a document at
vertex v
1
S S(w)
S(v) s(v)
chv
w ? ch(v)
v a vertex in the hypertext graph G (V,
E) S(v) the global score s(v) the score if the
document is isolated ch(v) children of the
document at vertex v
- Limitation
- - Require G to be a directed acyclic graph (DAG)
- - If v has a single link to w, S(v) gt S(w)
- If v has a long path to w and s(v) lt s(w),
then S(v) gt S (w) - ? unreasonable
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
12HITS - Kleinbergs Algorithm
- HITS Hypertext Induced Topic Selection
- For each vertex v ? V in a subgraph of
interest
a(v) - the authority of v h(v) - the hubness of v
- A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites
- Hubness shows the importance of a site. A good
hub is a site that links to many authoritative
sites
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
13Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
14Authority and Hubness Convergence
- Recursive dependency
-
- a(v) ? S h(w)
- h(v) ? S a(w)
w ? pav
w ? chv
- Using Linear Algebra, we can prove
a(v) and h(v) converge
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
15HITS Example
Find a base subgraph
- Start with a root set R 1, 2, 3, 4
- 1, 2, 3, 4 - nodes relevant to
the topic
- Expand the root set R to include all the
children and a fixed number of parents of nodes
in R
? A new set S (base subgraph) ?
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
16HITS Example
Hubs and authorities two n-dimensional a and h
- HubsAuthorities(G)
- 1 ? 1,,1 ? R
- a ? h ? 1
- t ? 1
- repeat
- for each v in V
- do a (v) ? S h (w)
- h (v) ? S a (w)
- a ? a / a
- h ? h / h
- t ? t 1
- until a a h h lt
e - return (a , h )
V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
17HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
18HITS Improvements
Brarat and Henzinger (1998)
- HITS problems
- The document can contain many identical links to
the same document in another host - Links are generated automatically (e.g. messages
posted on newsgroups) - Solutions
- Assign weight to identical multiple edges, which
are inversely proportional to their multiplicity - Prune irrelevant nodes or regulating the
influence of a node with a relevance weight
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
19Markov Chain Notation
- Random surfer model
- Description of a random walk through the Web
graph - Interpreted as a transition matrix with
asymptotic probability that a surfer is currently
browsing that page
rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?oo) regardless of the initial ranks ?
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
20Limits of Link Analysis
- META tags/ invisible text
- Search engines relying on meta tags in documents
are often misled (intentionally) by web
developers - Pay-for-place
- Search engine bias organizations pay search
engines and page rank - Advertisements organizations pay high ranking
pages for advertising space - With a primary effect of increased visibility to
end users and a secondary effect of increased
respectability due to relevance to high ranking
page
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
21Limits of Link Analysis
- Stability
- Adding even a small number of nodes/edges to the
graph has a significant impact - Topic drift similar to TKC
- A top authority may be a hub of pages on a
different topic resulting in increased rank of
the authority page - Content evolution
- Adding/removing links/content can affect the
intuitive authority rank of a page requiring
recalculation of page ranks
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
22Further Reading
- R. Lempel and S. Moran, Rank Stability and Rank
Similarity of Link-Based Web Ranking Algorithms
in Authority Connected Graphs, Submitted to
Information Retrieval, special issue on Advances
in Mathematics/Formal Methods in Information
Retrieval, 2003. - M. Henzinger, Link Analysis in Web Information
Retreival, Bulletin of the IEEE computer Society
Technical Committee on Data Engineering, 2000. - L. Getoor, N. Friedman, D. Koller, and A.
Pfeffer. Relational Data Mining, S. Dzeroski and
N. Lavrac, Eds., Springer-Verlag, 2001
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
23Whats in a link? 2. This has something to do
with my document / me.
24Ex. the citeseer archive
25Co-citation analysis and bibliographic coupling
basic ideas
26Matrix Notation
Adjacent Matrix
A
http//www.kusatro.kyoto-u.com
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
27Co-citation matrices document similarity
28Turning similarity into spatial proximity an MDS
map
29A clearer visualization pathfinder networks
30Visualizing evolution with pathfinder networks
31Web communities
- A community is a collection of web pages created
by individuals or any kind of associations that
have a common interest on a specific topic, such
as fan pages of a baseball team, and official
pages of PC vendors. Another example are Blog
communities. - Formally Flake et al.s definition of an ideal
community - A Pokemon web site is a site that links to or
is linked by more Pokemon sites than non-Pokemon
sites.
32An example, and how to find communities
33Approximate communities
- To apply this nice theorem, we would need to have
the whole Web on our hard disk! - Realistically, we crawl a part of the Web
starting with some pages that are in the
community we are interested in. - Questions
- What is crawling?
- What pages are retrieved during this crawl?
- What other assumptions have to be made?
34Crawling
- Archives are not always given
- Crawling techniques for assembling archives
from the Web - Simple Unix command-line utility wget
- Sophisticated WIRE (contains analysis) ? next
week - Crawling contains graph search
35Focused community crawling
36(No Transcript)
37What is the virtual sink ( the site that is
definitely not in the community)?
- In the ideal version
- In the approximate version, use artificicial
virtual sink (a theorem ensures correctness even
if this is not really at the center of the graph)
38Repetition for a better result
39(No Transcript)
40What's in a link? 3. "This is my boss."
- Examples of problems created by such nepotistic
links - Web link farms
- Much work since 2000 - http//www.cse.lehigh.edu/
brian/pubs/2000/aaaiws/aaai2000ws.pdf - Science / citation analysis
41Outlook social network analysis
- Bibliometrics and link mining have their roots in
a much older are social network analysis - Direct transfer of the link analysis methods we
have found find "opinion leaders" in ciao.de and
similar sites - ? see also viral marketing
- Others analyse communication patterns, prestige,
power, ...
42Social networks example