Web structure mining / link mining and Web communities Bettina Berendt - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Web structure mining / link mining and Web communities Bettina Berendt

Description:

A top authority may be a hub of pages on a different topic resulting in ... 'A Pokemon web site is a site that links to or is linked by more Pokemon sites ... – PowerPoint PPT presentation

Number of Views:543
Avg rating:3.0/5.0
Slides: 42
Provided by: minh
Category:

less

Transcript and Presenter's Notes

Title: Web structure mining / link mining and Web communities Bettina Berendt


1
Web structure mining / link miningand Web
communitiesBettina BerendtKnowledge and the
Web summer semester 2005http//www.wiwi.hu-berli
n.de/berendt/lehre/2005s/kaw/last updated
2005-05-04
2
Acknowledgementsand how to use these slides
  • Some of these slides were taken from the slide
    set of the Web Mining book by Baldi, Frasconi,
    and Smyth (http//ibook.ics.uci.edu/Slides/MIW20C
    hapter205.ppt) thank you for a great book and
    slides!
  • These slides are marked at the bottom left corner
  • I also based the slide layout on that slide set
  • Some figures were taken from the two presented
    articles (see p.4).
  • Further materials can be found in the directory
    of this session
  • (http//www.wiwi.hu-berlin.de/berendt/lehre/2005s
    /kaw/Session4)
  • Slides that just carry a title were developed in
    class and on the blackboard.
  • Please feel free to re-use these slides in your
    own teaching, and please credit their origin.

3
Objectives
  • To explore whats in a link and to see what
    knowlede can therefore be extracted by analysing
    links
  • To calculate the popularity of a site based on
    link analysis
  • To see how linkage defines communities

4
Outline Theory and applications of link analysis
for ...
  • Search ranking of search engine results 
  • Scientific communities co-citation analysis and
    other bibliometrics
  • Chen, C. Carr, L. (1999) Visualizing the
    evolution of a subject domain A case study.
    Proc. IEEE Visualization
    1999.
  • An example of a resulting archive citeseer
  • Identification of Web communities
  • Flake, G.W., Lawrence, S., Giles, C.L. (2000).
    Efficient identification of web communities.
    Proc. KDD 2000.
  • Outlook Social network analysis

5
The Web is a graph(Scientific) literature is a
graph
6
Recall Trees (slide from ISI)
  • A is the root node
  • B is the parent of D and E
  • D and E are children of B
  • (C,F) is an edge
  • 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves
  • A, B, C, D, E, F, G, H, I are internal nodes
  • The level (or depth) of E is 2 (number of edges
    to root)
  • The height (or order) of the tree is 4 (max
    number of edges from root to a leaf node)
  • The degree of node B is 2 (number of children)

Based on Tom Blough, Introduction to Programming.
http//www.rh.edu/blought/fall02_cish4960/notes/l
ecture11-12.ppt
7
Graphs (data structure def.)
  • Definition A set of items connected by edges.
    Each item is called a vertex or node. Formally, a
    graph is a set of vertices and a binary relation
    between vertices, adjacency.
  • Formal Definition A graph G can be defined as a
    pair (V,E), where V is a set of vertices, and E
    is a set of edges between the vertices E (u,v)
    u, v in V. If the graph is undirected, the
    adjacency relation defined by the edges is
    symmetric, or E u,v u, v in V (sets of
    vertices rather than ordered pairs). If the graph
    does not allow self-loops, adjacency is
    irreflexive.
  • (http//www.nist.gov/dads/HTML/graph.html)
  • Note Edges are also called links (esp. in
    hypertext graphs like the WWW).
  • in denotes the element-of relation

8
Whats in a link?1. This is good.
Basic Assumptions of early link analysis
  • Hyperlinks contain information about the human
    judgment of a site
  • The more incoming links to a site, the more it is
    judged important

9
Outline of the link analysis for search engine
ranking part
  • Early Approaches to Link Analysis
  • Hubs and Authorities HITS
  • Page Rank
  • Stability
  • Probabilistic Link Analysis
  • Limitation of Link Analysis

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
10
Early Approaches
Bray 1996
  • The visibility of a site is measured by the
    number of other sites pointing to it
  • The luminosity of a site is measured by the
    number of other sites to which it points
  • ? Limitation failure to capture the relative
    importance of different parents (children) sites


Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
11
Early Approaches
Mark 1988
  • To calculate the score S of a document at
    vertex v

1
S S(w)
S(v) s(v)
chv
w ? ch(v)
v a vertex in the hypertext graph G (V,
E) S(v) the global score s(v) the score if the
document is isolated ch(v) children of the
document at vertex v
  • Limitation
  • - Require G to be a directed acyclic graph (DAG)
  • - If v has a single link to w, S(v) gt S(w)
  • If v has a long path to w and s(v) lt s(w),
    then S(v) gt S (w)
  • ? unreasonable

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
12
HITS - Kleinbergs Algorithm
  • HITS Hypertext Induced Topic Selection
  • For each vertex v ? V in a subgraph of
    interest

a(v) - the authority of v h(v) - the hubness of v
  • A site is very authoritative if it receives many
    citations. Citation from important sites weight
    more than citations from less-important sites
  • Hubness shows the importance of a site. A good
    hub is a site that links to many authoritative
    sites

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
13
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
14
Authority and Hubness Convergence
  • Recursive dependency
  • a(v) ? S h(w)
  • h(v) ? S a(w)

w ? pav
w ? chv
  • Using Linear Algebra, we can prove

a(v) and h(v) converge
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
15
HITS Example
Find a base subgraph
  • Start with a root set R 1, 2, 3, 4
  • 1, 2, 3, 4 - nodes relevant to
    the topic
  • Expand the root set R to include all the
    children and a fixed number of parents of nodes
    in R

? A new set S (base subgraph) ?
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
16
HITS Example
Hubs and authorities two n-dimensional a and h
  • HubsAuthorities(G)
  • 1 ? 1,,1 ? R
  • a ? h ? 1
  • t ? 1
  • repeat
  • for each v in V
  • do a (v) ? S h (w)
  • h (v) ? S a (w)
  • a ? a / a
  • h ? h / h
  • t ? t 1
  • until a a h h lt
    e
  • return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
17
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
18
HITS Improvements
Brarat and Henzinger (1998)
  • HITS problems
  • The document can contain many identical links to
    the same document in another host
  • Links are generated automatically (e.g. messages
    posted on newsgroups)
  • Solutions
  • Assign weight to identical multiple edges, which
    are inversely proportional to their multiplicity
  • Prune irrelevant nodes or regulating the
    influence of a node with a relevance weight

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
19
Markov Chain Notation
  • Random surfer model
  • Description of a random walk through the Web
    graph
  • Interpreted as a transition matrix with
    asymptotic probability that a surfer is currently
    browsing that page

rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?oo) regardless of the initial ranks ?
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
20
Limits of Link Analysis
  • META tags/ invisible text
  • Search engines relying on meta tags in documents
    are often misled (intentionally) by web
    developers
  • Pay-for-place
  • Search engine bias organizations pay search
    engines and page rank
  • Advertisements organizations pay high ranking
    pages for advertising space
  • With a primary effect of increased visibility to
    end users and a secondary effect of increased
    respectability due to relevance to high ranking
    page

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
21
Limits of Link Analysis
  • Stability
  • Adding even a small number of nodes/edges to the
    graph has a significant impact
  • Topic drift similar to TKC
  • A top authority may be a hub of pages on a
    different topic resulting in increased rank of
    the authority page
  • Content evolution
  • Adding/removing links/content can affect the
    intuitive authority rank of a page requiring
    recalculation of page ranks

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
22
Further Reading
  • R. Lempel and S. Moran, Rank Stability and Rank
    Similarity of Link-Based Web Ranking Algorithms
    in Authority Connected Graphs, Submitted to
    Information Retrieval, special issue on Advances
    in Mathematics/Formal Methods in Information
    Retrieval, 2003.
  • M. Henzinger, Link Analysis in Web Information
    Retreival, Bulletin of the IEEE computer Society
    Technical Committee on Data Engineering, 2000.
  • L. Getoor, N. Friedman, D. Koller, and A.
    Pfeffer. Relational Data Mining, S. Dzeroski and
    N. Lavrac, Eds., Springer-Verlag, 2001

Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
23
Whats in a link? 2. This has something to do
with my document / me.
24
Ex. the citeseer archive
25
Co-citation analysis and bibliographic coupling
basic ideas
26
Matrix Notation
Adjacent Matrix
A
http//www.kusatro.kyoto-u.com
Modeling the Internet and the Web School of
Information and Computer Science University of
California, Irvine
27
Co-citation matrices document similarity
28
Turning similarity into spatial proximity an MDS
map
29
A clearer visualization pathfinder networks
30
Visualizing evolution with pathfinder networks
31
Web communities
  • A community is a collection of web pages created
    by individuals or any kind of associations that
    have a common interest on a specific topic, such
    as fan pages of a baseball team, and official
    pages of PC vendors. Another example are Blog
    communities.
  • Formally Flake et al.s definition of an ideal
    community
  • A Pokemon web site is a site that links to or
    is linked by more Pokemon sites than non-Pokemon
    sites.

32
An example, and how to find communities
33
Approximate communities
  • To apply this nice theorem, we would need to have
    the whole Web on our hard disk!
  • Realistically, we crawl a part of the Web
    starting with some pages that are in the
    community we are interested in.
  • Questions
  • What is crawling?
  • What pages are retrieved during this crawl?
  • What other assumptions have to be made?

34
Crawling
  • Archives are not always given
  • Crawling techniques for assembling archives
    from the Web
  • Simple Unix command-line utility wget
  • Sophisticated WIRE (contains analysis) ? next
    week
  • Crawling contains graph search

35
Focused community crawling
36
(No Transcript)
37
What is the virtual sink ( the site that is
definitely not in the community)?
  • In the ideal version
  • In the approximate version, use artificicial
    virtual sink (a theorem ensures correctness even
    if this is not really at the center of the graph)

38
Repetition for a better result
39
(No Transcript)
40
What's in a link? 3. "This is my boss."
  • Examples of problems created by such nepotistic
    links
  • Web link farms
  • Much work since 2000 - http//www.cse.lehigh.edu/
    brian/pubs/2000/aaaiws/aaai2000ws.pdf
  • Science / citation analysis

41
Outlook social network analysis
  • Bibliometrics and link mining have their roots in
    a much older are social network analysis
  • Direct transfer of the link analysis methods we
    have found find "opinion leaders" in ciao.de and
    similar sites
  • ? see also viral marketing
  • Others analyse communication patterns, prestige,
    power, ...

42
Social networks example
Write a Comment
User Comments (0)
About PowerShow.com