Network Science and the Web - PowerPoint PPT Presentation

About This Presentation

Title:

Network Science and the Web

Description:

Executed by AltaVista in May and October 1999. Details of the crawls: ... assemble root set S of pages (e.g. first 200 pages by AltaVista) ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 18

Provided by: CIS4

Learn more at: https://www.cis.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Network Science and the Web

1
Network Science and the Web

Networked Life
CIS 112
Spring 2008
Prof. Michael Kearns

2
The Web as Network

Consider the web as a network
vertices individual (html) pages
edges hyperlinks between pages
will view as both a directed and undirected graph
What is the structure of this network?
connected components
degree distributions
etc.
What does it say about the people building and
using it?
page and link generation
visitation statistics
What are the algorithmic consequences?
web search
community identification

3
Graph Structure in the WebBroder et al. paper

Report on the results of two massive web crawls
Executed by AltaVista in May and October 1999
Details of the crawls
automated script following hyperlinks (URLs) from
pages found
large set of starting points collected over time
crawl implemented as breadth-first search
have to deal with webspam, infinite paths,
timeouts, duplicates, etc.
May 99 crawl
200 million pages, 1.5 billion links
Oct 99 crawl
271 million pages, 2.1 billion links
Unaudited, self-reported Sep 03 stats
3 major search engines claim gt 3 billion pages
indexed

4
Five Easy Pieces

Authors did two kinds of breadth-first search
ignoring link direction ? weak connectivity
only following forward links ? strong
connectivity
They then identify five different regions of the
web
strongly connected component (SCC)
can reach any page in SCC from any other in
directed fashion
component IN
can reach any page in SCC in directed fashion,
but not reverse
component OUT
can be reached from any page in SCC, but not
reverse
component TENDRILS
weakly connected to all of the above, but cannot
reach SCC or be reached from SCC in directed
fashion (e.g. pointed to by IN)
SCCINOUTTENDRILS form weakly connected
component (WCC)
everything else is called DISC (disconnected from
the above)
here is a visualization of this structure

5
Size of the Five

SCC 56M pages, 28
IN 43M pages, 21
OUT 43M pages, 21
TENDRILS 44M pages, 22
DISC 17M pages, 8
WCC gt 91 of the web --- the giant component
One interpretation of the pieces
SCC the heart of the web
IN newer sites not yet discovered and linked to
OUT insular pages like corporate web sites

6
Diameter Measurements

Directed worst-case diameter of the SCC
at least 28
Directed worst-case diameter of IN ? SCC ? OUT
at least 503
Over 75 of the time, there is no directed path
between a random start and finish page in the WCC
when there is a directed path, average length is
16
Average undirected distance in the WCC is 7
Moral
web is a small world when we ignore direction
otherwise the picture is more complex

7
Degree Distributions

They are, of course, heavy-tailed
Power law distribution of component size
consistent with the Erdos-Renyi model
Undirected connectivity of web not reliant on
connectors
what happens as we remove high-degree vertices?

8
Digression Collective Intelligence Foo
Camp_at_google

Sponsored by OReilly publishers interesting
history
Interesting attendees
Tim OReilly Rod Brooks Larry Page many others
Lots of CI start-ups
Interesting topics
Web 2.0, Wikipedia, recommender systems
Prediction markets and corporate apps
How to design such systems?
How to trick people into working for free?
(ESP Game and CAPTCHAs)
Decomposing more complex problems (see behavioral
experiments to come)
Bad actors and malicious behavior
Ants

9
Beyond Macroscopic Structure

Such studies tell us the coarse overall structure
of the web
Use and construction of the web are more
fine-grained
people browse the web for certain information or
topics
people build pages that link to related or
similar pages
How do we quantify analyze this more detailed
structure?
Well examine two related examples
Kleinbergs hubs and authorities
automatic identification of web communities
PageRank
automatic identification of important pages
one of the main criteria used by Google
both rely mainly on the link structure of the web
both have an algorithm and a theory supporting it

10
Hubs and Authorities

Suppose we have a large collection of pages on
some topic
possibly the results of a standard web search
Some of these pages are highly relevant, others
not at all
How could we automatically identify the important
ones?
Whats a good definition of importance?
Kleinbergs idea there are two kinds of
important pages
authorities highly relevant pages
hubs pages that point to lots of relevant pages
If you buy this definition, it further stands to
reason that
a good hub should point to lots of good
authorities
a good authority should be pointed to by many
good hubs
this logic is, of course, circular
We need some math and an algorithm to sort it out

11
The HITS System(Hyperlink-Induced Topic Search)

Given a user-supplied query Q
assemble root set S of pages (e.g. first 200
pages by AltaVista)
grow S to base set T by adding all pages linked
(undirected) to S
might bound number of links considered from each
page in S
Now consider directed subgraph induced on just
pages in T
For each page p in T, define its
hub weight h(p) initialize all to be 1
authority weight a(p) initialize all to be 1
Repeat forever
a(p) sum of h(q) over all pages q ? p
h(p) sum of a(q) over all pages p ? q
renormalize all the weights
This algorithm will always converge!
weights computed related to eigenvectors of
connectivity matrix
further substructure revealed by different
eigenvectors
Here are some examples

12
The PageRank Algorithm

Lets define a measure of page importance we will
call the rank
Notation for any page p, let
N(p) be the number of forward links (pages p
points to)
R(p) be the (to-be-defined) rank of p
Idea important pages distribute importance over
their forward links
So we might try defining
R(p) sum of R(q)/N(q) over all pages q ? p
can again define iterative algorithm for
computing the R(p)
if it converges, solution again has an
eigenvector interpretation
problem cycles accumulate rank but never
distribute it
The fix
R(p) sum of R(q)/N(q) over all pages q ? p
E(p)
E(p) is some external or exogenous measure of
importance
some technical details omitted here (e.g.
normalization)
Lets play with the PageRank calculator

13
The Random Surfer Model

Lets suppose that E(p) sums to 1 (normalized)
Then the resulting PageRank solution R(p) will
also be normalized
can be interpreted as a probability distribution
R(p) is the stationary distribution of the
following process
starting from some random page, just keep
following random links
if stuck in a loop, jump to a random page drawn
according to E(p)
so surfer periodically gets bored and jumps to
a new page
E(p) can thus be personalized for each surfer
An important component of Googles search
criteria

14
But What About Content?

PageRank and Hubs Authorities
both based purely on link structure
often applied to an pre-computed set of pages
filtered for content
So how do (say) search engines do this filtering?
This is the domain of information retrieval

15
Basics of Information Retrieval

Represent a document as a bag of words
for each word in the English language, count
number of occurences
so di is the number of times the i-th word
appears in the document
usually ignore common words (the, and, of, etc.)
usually do some stemming (e.g. washed ? wash)
vectors are very long (100Ks) but very sparse
need some special representation exploiting
sparseness
Note all that we ignore or throw away
the order in which the words appear
the grammatical structure of sentences (parsing)
the sense in which a word is used
firing a gun or firing an employee
and much, much more

16
Bag of Words Document Comparison

View documents as vectors in a very
high-dimensional space
Can now import geometry and linear algebra
concepts
Similarity between documents d and e
S diei over all words i
may normalize d and e first
this is their projection onto each other
Improve by using TF/IDF weighting of words
term frequency --- how frequent is the word in
this document?
inverse document frequency --- how frequent in
all documents?
give high weight to words with high TF and low
IDF
Search engines
view the query as just another document
look for similar documents via above

17
Looking Ahead Left Side vs. Right Side

So far we are discussing the left hand search
results on Google
a.k.a organic search
Right hand or sponsored search paid
advertisements in a formal market
We will spend a lecture on these markets later in
the term
Same two types of search/results on Yahoo!, MSN,
Common perception
organic results are objective, based on
content, importance, etc.
sponsored results are subjective advertisements
But both sides are subject to gaming (strategic
behavior)
organic invisible terms in the html, link farms
and web spam, reverse engineering
sponsored bidding behavior, jamming
optimization of each side has its own industry
SEO and SEM
and perhaps to outright fraud
organic typo squatting
sponsored click fraud
More later