Network Science and the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Network Science and the Web

Description:

Executed by AltaVista in May and October 1999. Details of the crawls: ... assemble root set S of pages (e.g. first 200 pages by AltaVista) ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 18
Provided by: CIS4
Category:

less

Transcript and Presenter's Notes

Title: Network Science and the Web


1
Network Science and the Web
  • Networked Life
  • CIS 112
  • Spring 2008
  • Prof. Michael Kearns

2
The Web as Network
  • Consider the web as a network
  • vertices individual (html) pages
  • edges hyperlinks between pages
  • will view as both a directed and undirected graph
  • What is the structure of this network?
  • connected components
  • degree distributions
  • etc.
  • What does it say about the people building and
    using it?
  • page and link generation
  • visitation statistics
  • What are the algorithmic consequences?
  • web search
  • community identification

3
Graph Structure in the WebBroder et al. paper
  • Report on the results of two massive web crawls
  • Executed by AltaVista in May and October 1999
  • Details of the crawls
  • automated script following hyperlinks (URLs) from
    pages found
  • large set of starting points collected over time
  • crawl implemented as breadth-first search
  • have to deal with webspam, infinite paths,
    timeouts, duplicates, etc.
  • May 99 crawl
  • 200 million pages, 1.5 billion links
  • Oct 99 crawl
  • 271 million pages, 2.1 billion links
  • Unaudited, self-reported Sep 03 stats
  • 3 major search engines claim gt 3 billion pages
    indexed

4
Five Easy Pieces
  • Authors did two kinds of breadth-first search
  • ignoring link direction ? weak connectivity
  • only following forward links ? strong
    connectivity
  • They then identify five different regions of the
    web
  • strongly connected component (SCC)
  • can reach any page in SCC from any other in
    directed fashion
  • component IN
  • can reach any page in SCC in directed fashion,
    but not reverse
  • component OUT
  • can be reached from any page in SCC, but not
    reverse
  • component TENDRILS
  • weakly connected to all of the above, but cannot
    reach SCC or be reached from SCC in directed
    fashion (e.g. pointed to by IN)
  • SCCINOUTTENDRILS form weakly connected
    component (WCC)
  • everything else is called DISC (disconnected from
    the above)
  • here is a visualization of this structure

5
Size of the Five
  • SCC 56M pages, 28
  • IN 43M pages, 21
  • OUT 43M pages, 21
  • TENDRILS 44M pages, 22
  • DISC 17M pages, 8
  • WCC gt 91 of the web --- the giant component
  • One interpretation of the pieces
  • SCC the heart of the web
  • IN newer sites not yet discovered and linked to
  • OUT insular pages like corporate web sites

6
Diameter Measurements
  • Directed worst-case diameter of the SCC
  • at least 28
  • Directed worst-case diameter of IN ? SCC ? OUT
  • at least 503
  • Over 75 of the time, there is no directed path
    between a random start and finish page in the WCC
  • when there is a directed path, average length is
    16
  • Average undirected distance in the WCC is 7
  • Moral
  • web is a small world when we ignore direction
  • otherwise the picture is more complex

7
Degree Distributions
  • They are, of course, heavy-tailed
  • Power law distribution of component size
  • consistent with the Erdos-Renyi model
  • Undirected connectivity of web not reliant on
    connectors
  • what happens as we remove high-degree vertices?

8
Digression Collective Intelligence Foo
Camp_at_google
  • Sponsored by OReilly publishers interesting
    history
  • Interesting attendees
  • Tim OReilly Rod Brooks Larry Page many others
  • Lots of CI start-ups
  • Interesting topics
  • Web 2.0, Wikipedia, recommender systems
  • Prediction markets and corporate apps
  • How to design such systems?
  • How to trick people into working for free?
    (ESP Game and CAPTCHAs)
  • Decomposing more complex problems (see behavioral
    experiments to come)
  • Bad actors and malicious behavior
  • Ants

9
Beyond Macroscopic Structure
  • Such studies tell us the coarse overall structure
    of the web
  • Use and construction of the web are more
    fine-grained
  • people browse the web for certain information or
    topics
  • people build pages that link to related or
    similar pages
  • How do we quantify analyze this more detailed
    structure?
  • Well examine two related examples
  • Kleinbergs hubs and authorities
  • automatic identification of web communities
  • PageRank
  • automatic identification of important pages
  • one of the main criteria used by Google
  • both rely mainly on the link structure of the web
  • both have an algorithm and a theory supporting it

10
Hubs and Authorities
  • Suppose we have a large collection of pages on
    some topic
  • possibly the results of a standard web search
  • Some of these pages are highly relevant, others
    not at all
  • How could we automatically identify the important
    ones?
  • Whats a good definition of importance?
  • Kleinbergs idea there are two kinds of
    important pages
  • authorities highly relevant pages
  • hubs pages that point to lots of relevant pages
  • If you buy this definition, it further stands to
    reason that
  • a good hub should point to lots of good
    authorities
  • a good authority should be pointed to by many
    good hubs
  • this logic is, of course, circular
  • We need some math and an algorithm to sort it out

11
The HITS System(Hyperlink-Induced Topic Search)
  • Given a user-supplied query Q
  • assemble root set S of pages (e.g. first 200
    pages by AltaVista)
  • grow S to base set T by adding all pages linked
    (undirected) to S
  • might bound number of links considered from each
    page in S
  • Now consider directed subgraph induced on just
    pages in T
  • For each page p in T, define its
  • hub weight h(p) initialize all to be 1
  • authority weight a(p) initialize all to be 1
  • Repeat forever
  • a(p) sum of h(q) over all pages q ? p
  • h(p) sum of a(q) over all pages p ? q
  • renormalize all the weights
  • This algorithm will always converge!
  • weights computed related to eigenvectors of
    connectivity matrix
  • further substructure revealed by different
    eigenvectors
  • Here are some examples

12
The PageRank Algorithm
  • Lets define a measure of page importance we will
    call the rank
  • Notation for any page p, let
  • N(p) be the number of forward links (pages p
    points to)
  • R(p) be the (to-be-defined) rank of p
  • Idea important pages distribute importance over
    their forward links
  • So we might try defining
  • R(p) sum of R(q)/N(q) over all pages q ? p
  • can again define iterative algorithm for
    computing the R(p)
  • if it converges, solution again has an
    eigenvector interpretation
  • problem cycles accumulate rank but never
    distribute it
  • The fix
  • R(p) sum of R(q)/N(q) over all pages q ? p
    E(p)
  • E(p) is some external or exogenous measure of
    importance
  • some technical details omitted here (e.g.
    normalization)
  • Lets play with the PageRank calculator

13
The Random Surfer Model
  • Lets suppose that E(p) sums to 1 (normalized)
  • Then the resulting PageRank solution R(p) will
  • also be normalized
  • can be interpreted as a probability distribution
  • R(p) is the stationary distribution of the
    following process
  • starting from some random page, just keep
    following random links
  • if stuck in a loop, jump to a random page drawn
    according to E(p)
  • so surfer periodically gets bored and jumps to
    a new page
  • E(p) can thus be personalized for each surfer
  • An important component of Googles search
    criteria

14
But What About Content?
  • PageRank and Hubs Authorities
  • both based purely on link structure
  • often applied to an pre-computed set of pages
    filtered for content
  • So how do (say) search engines do this filtering?
  • This is the domain of information retrieval

15
Basics of Information Retrieval
  • Represent a document as a bag of words
  • for each word in the English language, count
    number of occurences
  • so di is the number of times the i-th word
    appears in the document
  • usually ignore common words (the, and, of, etc.)
  • usually do some stemming (e.g. washed ? wash)
  • vectors are very long (100Ks) but very sparse
  • need some special representation exploiting
    sparseness
  • Note all that we ignore or throw away
  • the order in which the words appear
  • the grammatical structure of sentences (parsing)
  • the sense in which a word is used
  • firing a gun or firing an employee
  • and much, much more

16
Bag of Words Document Comparison
  • View documents as vectors in a very
    high-dimensional space
  • Can now import geometry and linear algebra
    concepts
  • Similarity between documents d and e
  • S diei over all words i
  • may normalize d and e first
  • this is their projection onto each other
  • Improve by using TF/IDF weighting of words
  • term frequency --- how frequent is the word in
    this document?
  • inverse document frequency --- how frequent in
    all documents?
  • give high weight to words with high TF and low
    IDF
  • Search engines
  • view the query as just another document
  • look for similar documents via above

17
Looking Ahead Left Side vs. Right Side
  • So far we are discussing the left hand search
    results on Google
  • a.k.a organic search
  • Right hand or sponsored search paid
    advertisements in a formal market
  • We will spend a lecture on these markets later in
    the term
  • Same two types of search/results on Yahoo!, MSN,
  • Common perception
  • organic results are objective, based on
    content, importance, etc.
  • sponsored results are subjective advertisements
  • But both sides are subject to gaming (strategic
    behavior)
  • organic invisible terms in the html, link farms
    and web spam, reverse engineering
  • sponsored bidding behavior, jamming
  • optimization of each side has its own industry
    SEO and SEM
  • and perhaps to outright fraud
  • organic typo squatting
  • sponsored click fraud
  • More later
Write a Comment
User Comments (0)
About PowerShow.com