News and Notes, 224 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

News and Notes, 224

Description:

a good hub should point to lots of good authorities ... For each page p in T, define its. hub weight h(p); initialize all to be 1 ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 19
Provided by: CIS4
Category:
Tags: hub | news | notes

less

Transcript and Presenter's Notes

Title: News and Notes, 224


1
News and Notes, 2/24
  • Homework 2 due at the start of Thursdays class
  • New required readings
  • Micromotives and Macrobehavior, chapters 1, 3
    and 4
  • Watts, chapters 7 and 8
  • Midterm Thursday March 4
  • will cover up to and including The Web as
    Network
  • Todays class short lecture an in-class
    exercise

2
News and Notes, 2/19
  • Three new required articles in Web as Network
    section
  • Homework 2 distributed last class due Feb 26
  • Pick up your Homework 1 if you havent
  • Grading of Homework 1, problem 4.1
  • Midterm March 4
  • MK office hours 1030-12

3
The Web as Network
  • Networked Life
  • CSE 112
  • Spring 2004
  • Prof. Michael Kearns

4
The Web as Network
  • Consider the web as a network
  • vertices individual (html) pages
  • edges hyperlinks between pages
  • will view as both a directed and undirected graph
  • What is the structure of this network?
  • connected components
  • degree distributions
  • etc.
  • What does it say about the people building and
    using it?
  • page and link generation
  • visitation statistics
  • What are the algorithmic consequences?
  • web search
  • community identification

5
Graph Structure in the WebBroder et al. paper
  • Report on the results of two massive web crawls
  • Executed by AltaVista in May and October 1999
  • Details of the crawls
  • automated script following hyperlinks (URLs) from
    pages found
  • large set of starting points collected over time
  • crawl implemented as breadth-first search
  • have to deal with spam, infinite paths, timeouts,
    duplicates, etc.
  • May 99 crawl
  • 200 million pages, 1.5 billion links
  • Oct 99 crawl
  • 271 million pages, 2.1 billion links
  • Unaudited, self-reported Sep 03 stats
  • 3 major search engines claim 3 billion pages
    indexed

6
Five Easy Pieces
  • Authors did two kinds of breadth-first search
  • ignoring link direction ? weak connectivity
  • only following forward links ? strong
    connectivity
  • They then identify five different regions of the
    web
  • strongly connected component (SCC)
  • can reach any page in SCC from any other in
    directed fashion
  • component IN
  • can reach any page in SCC in directed fashion,
    but not reverse
  • component OUT
  • can be reached from any page in SCC, but not
    reverse
  • component TENDRILS
  • weakly connected to all of the above, but cannot
    reach SCC or be reached from SCC in directed
    fashion (e.g. pointed to by IN)
  • SCCINOUTTENDRILS form weakly connected
    component (WCC)
  • everything else is called DISC (disconnected from
    the above)
  • here is a visualization of this structure

7
Size of the Five
  • SCC 56M pages, 28
  • IN 43M pages, 21
  • OUT 43M pages, 21
  • TENDRILS 44M pages, 22
  • DISC 17M pages, 8
  • WCC 91 of the web --- the giant component
  • One interpretation of the pieces
  • SCC the heart of the web
  • IN newer sites not yet discovered and linked to
  • OUT insular pages like corporate web sites

8
Diameter Measurements
  • Directed worst-case diameter of the SCC
  • at least 28
  • Directed worst-case diameter of IN ? SCC ? OUT
  • at least 503
  • Over 75 of the time, there is no directed path
    between a random start and finish page in the WCC
  • when there is a directed path, average length is
    16
  • Average undirected distance in the WCC is 7
  • Moral
  • web is a small world when we ignore direction
  • otherwise the picture is more complex

9
Degree Distributions
  • They are, of course, heavy-tailed
  • Power law distribution of component size
  • consistent with the Erdos-Renyi model
  • Undirected connectivity of web not reliant on
    connectors
  • what happens as we remove high-degree vertices?

10
Beyond Macroscopic Structure
  • Such studies tell us the coarse overall structure
    of the web
  • Use and construction of the web are more
    fine-grained
  • people browse the web for certain information or
    topics
  • people build pages that link to related or
    similar pages
  • How do we quantify analyze this more detailed
    structure?
  • Well examine two related examples
  • Kleinbergs hubs and authorities
  • automatic identification of web communities
  • PageRank
  • automatic identification of important pages
  • one of the main criteria used by Google
  • both rely mainly on the link structure of the web
  • both have an algorithm and a theory supporting it

11
Hubs and Authorities
  • Suppose we have a large collection of pages on
    some topic
  • possibly the results of a standard web search
  • Some of these pages are highly relevant, others
    not at all
  • How could we automatically identify the important
    ones?
  • Whats a good definition of importance?
  • Kleinbergs idea there are two kinds of
    important pages
  • authorities highly relevant pages
  • hubs pages that point to lots of relevant pages
  • (I had these backwards last time)
  • If you buy this definition, it further stands to
    reason that
  • a good hub should point to lots of good
    authorities
  • a good authority should be pointed to by many
    good hubs
  • this logic is, of course, circular
  • We need some math and an algorithm to sort it out

12
The HITS System(Hyperlink-Induced Topic Search)
  • Given a user-supplied query Q
  • assemble root set S of pages (e.g. first 200
    pages by AltaVista)
  • grow S to base set T by adding all pages linked
    (undirected) to S
  • might bound number of links considered from each
    page in S
  • Now consider directed subgraph induced on just
    pages in T
  • For each page p in T, define its
  • hub weight h(p) initialize all to be 1
  • authority weight a(p) initialize all to be 1
  • Repeat forever
  • a(p) sum of h(q) over all pages q ? p
  • h(p) sum of a(q) over all pages p ? q
  • renormalize all the weights
  • This algorithm will always converge!
  • weights computed related to eigenvectors of
    connectivity matrix
  • further substructure revealed by different
    eigenvectors
  • Here are some examples

13
The PageRank Algorithm
  • Lets define a measure of page importance we will
    call the rank
  • Notation for any page p, let
  • N(p) be the number of forward links (pages p
    points to)
  • R(p) be the (to-be-defined) rank of p
  • Idea important pages distribute importance over
    their forward links
  • So we might try defining
  • R(p) sum of R(q)/N(q) over all pages q ? p
  • can again define iterative algorithm for
    computing the R(p)
  • if it converges, solution again has an
    eigenvector interpretation
  • problem cycles accumulate rank but never
    distribute it
  • The fix
  • R(p) sum of R(q)/N(q) over all pages q ? p
    E(p)
  • E(p) is some external or exogenous measure of
    importance
  • some technical details omitted here (e.g.
    normalization)
  • Lets play with the PageRank calculator

14
The Random Surfer Model
  • Lets suppose that E(p) sums to 1 (normalized)
  • Then the resulting PageRank solution R(p) will
  • also be normalized
  • can be interpreted as a probability distribution
  • R(p) is the stationary distribution of the
    following process
  • starting from some random page, just keep
    following random links
  • if stuck in a loop, jump to a random page drawn
    according to E(p)
  • so surfer periodically gets bored and jumps to
    a new page
  • E(p) can thus be personalized for each surfer
  • An important component of Googles search
    criteria

15
But What About Content?
  • PageRank and Hubs Authorities
  • both based purely on link structure
  • often applied to an pre-computed set of pages
    filtered for content
  • So how do (say) search engines do this filtering?
  • This is the domain of information retrieval

16
Basics of Information Retrieval
  • Represent a document as a bag of words
  • for each word in the English language, count
    number of occurences
  • so di is the number of times the i-th word
    appears in the document
  • usually ignore common words (the, and, of, etc.)
  • usually do some stemming (e.g. washed ? wash)
  • vectors are very long (100Ks) but very sparse
  • need some special representation exploiting
    sparseness
  • Note all that we ignore or throw away
  • the order in which the words appear
  • the grammatical structure of sentences (parsing)
  • the sense in which a word is used
  • firing a gun or firing an employee

17
Bag of Words Document Comparison
  • View documents as vectors in a very
    high-dimensional space
  • Can now import geometry and linear algebra
    concepts
  • Similarity between documents d and e
  • S diei over all words I
  • may normalize d and e first
  • this is their projection onto each other
  • Improve by using TF/IDF weighting of words
  • term frequency --- how frequent is the word in
    this document?
  • inverse document frequency --- how frequent in
    all documents?
  • give high weight to words with high TF and low
    IDF
  • Search engines
  • view the query as just another document
  • look for similar documents via above

18
Marrying Content and Structure
  • So one overall approach is to
  • use traditional IR methods to find documents with
    desired content
  • apply structural methods to elevate authorities,
    high PageRank, etc.
  • For some problems, a more integrated approach
    exists
  • Example co-training
  • Suppose we want to learn a rule to classify pages
  • e.g. classify as CS faculty home page or not
  • two sources of info
  • content on page --- technical terms, educational
    background, etc.
  • links on page --- to other faculty, department
    site, students, etc.
  • first learn a rule using content only from a
    small labeled seed set
  • now use this rule to label further pages
  • now learn a rule using links only from these
    labeled pages
  • repeat
  • Another example maintaining a list of company
    names
Write a Comment
User Comments (0)
About PowerShow.com