ICS 278: Data Mining Lecture 15: Mining Web Link Structure PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 15: Mining Web Link Structure


1
ICS 278 Data MiningLecture 15 Mining Web
Link Structure

2
Web Mining
  • Web a potentially enormous data set for data
    mining
  • 3 primary aspects of Web mining
  • Web page content
  • e.g., clustering Web pages based on their text
  • Web connectivity
  • e.g., characterizing distributions on path
    lengths between pages
  • e.g., determining importance of pages from graph
    structure
  • Web usage,
  • e.g., understanding user behavior from Web logs
  • All 3 are interconnected/interdependent
  • E.g., Google (and most search engines) use both
    content and connectivity
  • Todays lecture Web connectivity

3
The Web Graph
  • G (V, E)
  • V set of all Web pages
  • E set of all hyperlinks
  • Number of nodes ?
  • Difficult to estimate
  • Crawling the Web is highly non-trivial
  • At least 4.3 billion (Google)
  • Number of edges?
  • E O(V)
  • i.e., mean number of outlinks per page is a small
    constant

4
The Web Graph
  • The Web graph is inherently dynamic
  • nodes and edges are continually appearing and
    disappearing
  • Interested in general properties of the Web graph
  • What is the distribution of the number of
    in-links and out-links?
  • What is the distribution of number of pages per
    site?
  • Typically power-laws for many of these
    distributions
  • How far apart are 2 randomly selected pages on
    the Web?
  • What is the average distance between 2 random
    pages?
  • And so on

5
Social Networks
  • Social networks graphs
  • V set of actors (e.g., students in a class)
  • E set of interactions (e.g., collaborations)
  • Typically small graphs, e.g., V 10 or 50
  • Long history of social network analysis (e.g. at
    UCI)
  • Quantitative data analysis techniques that can
    automatically extract structure or information
    from graphs
  • E.g., who is the most important actor in a
    network?
  • E.g., are there clusters in the network?
  • Comprehensive reference
  • S. Wasserman and K. Faust, Social Network
    Analysis, Cambridge University Press, 1994.

6
Node Importance in Social Networks
  • General idea is that some nodes are more
    important than others in terms of the structure
    of the graph
  • In a directed graph, in-degree may be a useful
    indicator of importance
  • e.g., for a citation network among authors (or
    papers)
  • in-degree is the number of citations gt
    importance
  • However
  • in-degree is only a first-order measure in that
    it implicitly assumes that all edges are of equal
    importance

7
Recursive Notions of Node Importance
  • wij weight of link from node i to node j
  • assume Sj wij 1 and weights are non-negative
  • e.g., default choice wij 1/outdegree(i)
  • more outlinks gt less importance attached to each
  • Define rj importance of node j in a directed
    graph
  • rj Si wij ri
    i,j 1,.n
  • Importance of a node is a weighted sum of the
    importance of nodes that point to it
  • Makes intuitive sense
  • Leads to a set of recursive linear equations

8
Simple Example
1
2
3
4
9
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
10
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
11
Matrix-Vector form
  • Recall rj importance of node j
  • rj Si wij ri
    i,j 1,.n
  • e.g., r2 1 r1 0 r2 0.5 r3 0.5 r4
  • dot product of r vector
    with column 2 of W
  • Let r n x 1 vector of importance values for
    the n nodes
  • Let W n x n matrix of link weights
  • gt we can rewrite the importance equations as
  • r WT r

12
Eigenvector Formulation
  • Need to solve the importance equations for
    unknown r, with known W
  • r WT r
  • This is a standard eigenvalue problem, i.e.,
  • A r l r (where A
    WT)
  • with l an eigenvalue 1
  • and r the eigenvector corresponding to l 1
  • Results from linear algebra tell us that
  • (a) Since W is a stochastic matrix, W and WT
    have the same eigenvectors/eigenvalues
  • (b) The largest of these eigenvalues is
    always 1
  • (c) So the importance vector r corresponds
    to the eigenvector corresponding to the largest
    eigenvector of W (and WT)

13
Solution for the Simple Example
Solving for the eigenvector of W we get r 0.2
0.4 0.133 0.2667 Results are quite intuitive
1
1
2
3
0.5
0.5
0.5
0.5
W
0.5
0.5
4
14
How can we apply this to the Web?
  • Given a set of Web pages and hyperlinks
  • Weights from each page 1/( of outlinks)
  • Solve for the eigenvector (l 1) of the weight
    matrix
  • Problem
  • Solving an eigenvector equation scales as O(n3)
  • For the entire Web graph n gt 4.3 billion (!!)
  • So direct solution is not feasible
  • Can use the power method (iterative)
    r (k1) WT r (k)

  • for k1,2,..

15
Power Method for solving for r
  • r
    (k1) WT r (k)
  • Define a suitable starting vector r (1)
  • e.g., all entries 1/n, or all entries
    indegree(node)/E, etc
  • Each iteration is matrix-vector multiplication
    gtO(n2)
  • - problematic?
  • no since W is highly sparse (Web pages
    have limited outdegree), each
    iteration is effectively O(n)
  • For sparse W, the iterations typically converge
    quite quickly
  • - rate of convergence depends on the spectral
    gap
  • -gt how quickly does error(k) (l2/
    l1)k go to 0 as a function of k ?
  • -gt if l2 is close to 1 ( l1) then
    convergence is slow
  • - empirically Web graph with 300 million
    pages
  • -gt 50 iterations to convergence (Brin and Page,
    1998)

16
(No Transcript)
17
Markov Chain Interpretation
  • W is a stochastic matrix (rows sum to 1) by
    definition
  • gt we can interpret W as defining the transition
    probabilities in a Markov chain
  • wij probability of transitioning from node i to
    node j
  • Markov chain interpretation
    r WT r
  • -gt these are the solutions of the steady-state
    probabilities for a Markov chain
  • page importance ? steady-state Markov
    probabilities ? eigenvector

18
The Random Surfer Interpretation
  • Recall that for the Web model, we set wij
    1/outdegree(i)
  • Thus, in using W for computing importance of Web
    pages, this is equivalent to a model where
  • We have a random surfer who surfs the Web for an
    infinitely long time
  • At each page the surfer randomly selects an
    outlink to the next page
  • importance of a page fraction of visits the
    surfer makes to that page
  • this is intuitive pages that have better
    connectivity will be visited more often

19
Potential Problems
1
2
3
Page 1 is a sink (no outlink) Pages 3 and 4
are also sinks (no outlink from the
system) Markov chain theory tells us that no
steady-state solution exists -
depending on where you start you will end up at 1
or 3, 4 Markov chain is reducible
4
20
Making the Web Graph Irreducible
  • One simple solution to our problem is to modify
    the Markov chain
  • With probability a the random surfer jumps to any
    random page in the system (with probability of
    1/n, conditioned on such a jump)
  • With probability 1-a the random surfer selects an
    outlink (randomly from the set of available
    outlinks)
  • The resulting transition graph is fully connected
    gt Markov system is irreducible gt steady-state
    solutions exist
  • Typically a is chosen to be between 0.1 and 0.2
    in practice
  • New power iterations can be written as
    r (k1) (1- a) WT r (k)
    (a/n) 1T
  • Complexity is still O(n) per iteration for sparse
    W

21
The PageRank Algorithm
  • S. Brin and L. Page, The anatomy of a large-scale
    hypertextual search engine, in Proceedings of the
    7th WWW Conference, 1998.
  • PageRank the method on the previous slide,
    applied to the entire Web graph
  • Crawl the Web (highly non-trivial!)
  • Store both connectivity and content
  • Calculate (off-line) the pagerank r for each
    Web page using the power iteration method
  • How can this be used to answer Web queries
  • Terms in the search query are used to limit the
    set of pages of possible interest
  • Pages are then ordered for the user via
    precomputed pageranks
  • The Google search engine combines r with
    text-based measures
  • This was the first demonstration that link
    information could be used for content-based
    search on the Web

22
Link Structure helps in Web Search
Singhal and Kaszkiel, 2001 SE1, etc, indicate
different (anonymized) commercial search
engines, all using link structure (e.g.,
PageRank) in their rankings
23
PageRank architecture at Google
  • Ranking of pages more important than exact values
    of pi
  • Pre-compute and store the PageRank of each page.
  • PageRank independent of any query or textual
    content.
  • Ranking scheme combines PageRank with textual
    match
  • Unpublished
  • Many empirical parameters, human effort and
    regression testing.
  • Criticism Ad-hoc coupling and decoupling
    between query relevance and graph importance
  • Massive engineering effort
  • Continually crawling the Web and updating page
    ranks

24
(No Transcript)
25
PageRank Limitations
  • rich get richer syndrome
  • not as democratic as originally (nobly) claimed
  • certainly not 1 vote per WWW citizen
  • also crawling frequency tends to be based on
    pagerank
  • for detailed grumblings, see www.google-watch.org,
    etc.
  • not query-sensitive
  • random walk same regardless of query topic
  • whereas real random surfer has some topic
    interests
  • non-uniform jumping vector needed
  • would enable personalization (but requires faster
    eigenvector convergence)
  • Topic of ongoing research
  • ad hoc mix of PageRank keyword match score
  • done in two steps for efficiency, not quality
    motivations

26
(No Transcript)
27
HITS Hub and Authority Rankings
  • J. Kleinberg, Authorative sources in a
    hyperlinked environment, Proceedings of ACM SODA
    Conference, 1998.
  • HITS Hypertext Induced Topic Selection
  • Every page u has two distinct measures of merit,
    its hub score hu and its authority score au.
  • Recursive quantitative definitions of hub and
    authority scores
  • Relies on query-time processing
  • To select base set Vq of links for query q
    constructed by
  • selecting a sub-graph R from the Web (root set)
    relevant to the query
  • selecting any node u which neighbors any r \in R
    via an inbound or outbound edge (expanded set)
  • To deduce hubs and authorities that exist in a
    sub-graph of the Web

28
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
29
Authority and Hubness Convergence
  • Recursive dependency
  • a(v) ? S h(w)
  • h(v) ? S a(w)

w ? pav
w ? chv
  • Using Linear Algebra, we can prove

a(v) and h(v) converge
30
HITS Example
Find a base subgraph
  • Start with a root set R 1, 2, 3, 4
  • 1, 2, 3, 4 - nodes relevant to
    the topic
  • Expand the root set R to include all the
    children and a fixed number of parents of nodes
    in R

? A new set S (base subgraph) ?
31
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
32
Stability of HITS vs PageRank (5 trials)
HITS
randomly deleted 30 of papers
PageRank
33
HITS vs PageRank Stability
  • e.g. Ng Zheng Jordan, IJCAI-01 SIGIR-01
  • HITS can be very sensitive to change in small
    fraction of nodes/edges in link structure
  • PageRank much more stable, due to random jumps
  • propose HITS as bidirectional random walk
  • with probability d, randomly (p1/n) jump to a
    node
  • with probability d-1
  • odd timestep take random outlink from current
    node
  • even timestep go backward on random inlink of
    node
  • this HITS variant seems much more stable as d
    increased
  • issue tuning d (d1 most stable but useless for
    ranking)

34
Future Directions
  • Many other possible search algorithms that
    combine link structure and content
  • E.g., Teoma, Vivisimo, etc
  • Personalized search engines
  • Domain-specific search engines
  • Using Google (or other search engine) as a
    database
  • E.g., combining CiteSeer authorship data and text
    from papers, with queries to Google, and
    combining results

35
Recommended Books
http//www.cs.berkeley.edu/soumen/mining-the-web/
http//www.oreilly.com/catalog/googlehks/
http//www.google.com/apis/
Write a Comment
User Comments (0)
About PowerShow.com