Web Modeling - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

Web Modeling

Description:

The degree sequence of Web pages has a power-law distribution, Pr(k) ~ k-? where ... produces power-law degree distributions. Cons: ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 103
Provided by: markj79
Category:
Tags: modeling | web

less

Transcript and Presenter's Notes

Title: Web Modeling


1
Web Modeling Mark Joseph SI 767 2-15-06
2
Why do we want to know what the Web "looks like"?
Why do we care how it grows? Why should we care
how pages are connected? Simple reason as NLP
and IR professionals, this knowledge can help us
in designing techniques to retrieve information.
3
I am going to be building off the foundation
that The web can be modeled as a mathematical
graph by considering its pages to be nodes
connected by arcs corresponding to hyperlinks
(ThelwallWilkinson well see this later)
4
  • The Web is the largest repository of data and it
    grows exponentially.
  • - 320 Million Web pages Lawrence Giles 1998
  • - 800 Million Web pages, 15 TB Lawrence Giles
    1999
  • 8 Billion Web pages indexed Google 2005
  • Amount of data
  • roughly 200 TB Lyman et al. 2003
  • In order to search the web, and because the web
    is ever changing, we need to create a model to
    explain how the web grows.

(Radev lecture, 2005)
5
To start Graph Theory
6
There is no central authority over the web. Web
pages are created by individuals, who wish to
then connect their web pages to other web pages
(using hyperlinks). These designers hope to be
linked to themselves to create more traffic to
their sites, and thus justify, inform through, or
profit from their web page. This is all done at
random. Hence, the Random Graph.
7
  • The Random Graph was first defined by Paul Erdos
    and Alfred Renyi in their 1959 paper On Random
    Graphs I. It works something like this
  • Start with n vertices
  • 2. After each time unit, one new edge is created
    at random, with equal probability to any other
    edge

(Wikipedia, http//en.wikipedia.org/wiki/Random_gr
aph)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Random Graph Measurements
24
Diameter The longest shortest path of a network
(excluding backtracks, detours, and loops)
Assuming Im not losing it, the diameter here is
3.
25
Clustering Coefficient A test for a small world
network, introduced by Duncan J. Watts and Steven
Strogatz in 1998. Ci number of connections
between is neighbors/maximum number of possible
connections between those same neighbors
26
Because the web is directed (hyperlinks only go
one way), the calculation is as follows
ejk is number of directed connections between is
neighbors ki(ki -1) is number of potential
connections
27
Clustering Coefficient
28
Clustering Coefficient
Ci ejk / ki(ki -1) ejk 3 ki 6 Ci 3/6(6
-1) 3/30 1/10
i
29
The clustering coefficient for an entire network
is
the average of all clustering coefficients in the
network
30
Degree of a node/vertex number of arcs or edges
connected to that node/vertex In-degree
Out-degree measure of directed arcs going in
and out respectively in a directed graph
31
Degree Distributions Mark Newman in The
Structure and Function of Complex Networks pk
? the fraction of vertices in the network that
have degree k equivalently pk ? the
probability that a vertex chosen uniformly at
random has degree k
32
A plot of pk for any given network can be formed
by making a histogram of the degrees of vertices.
This histogram is known as the degree
distribution. NOTE These are often plotted on a
log-log scale graph to cut down on the noise
33
Another important note when graphing a Web
network Because the representation is a
directed graph, each vertex has both an in-degree
and out-degree, the degree distribution becomes a
function of pjk, representing the fraction of
vertices that simultaneously have in-degree j and
out-degree k.
Mark Newman in The Structure and Function of
Complex Networks
34
  • Two Important Degree Distribution Models
  • Assuming the random graph from above, where
  • Start with n vertices
  • 2. After each time unit, one new edge is created
    at random, with equal probability to any other
    edge

35
Poisson distribution If the adding of new edges
between the vertices is independent of the
presence or absence of any other edge, so that
each edge is present with independent probability
p, you have a Poisson distribution.
Newmanal Random Graphs with Arbitrary Degree
Distribution and Their Applications
36
Problem A number of studies of different
networks have found measurably different degree
distributions from a Poisson distribution one
of which is the Web. In these networks, there
is an exponential fall-off of probability as you
approach larger in- or out-degree. Solution
Power Law Distribution, or Scale Free Graph
37
A Quick Comparison of Poisson vs. Power Law
38
Poisson graph
number of nodes found
93
Thanks Lada!!! Adamic lecture 2006
39
power-law graph
number of nodes found
94
6
2
Thanks Lada!!! Adamic lecture 2006
40
Power law networks
  • Many real world networks contain hubs highly
    connected nodes
  • Usually the distribution of edges is extremely
    skewed

many nodes with few edges
number of nodes with so many edges
fat tail a few nodes with a very large numberof
edges
number of edges
no typical number of edges
Thanks Lada!!! Adamic lecture 2006
41
But is it really a power-law?
  • A power-law will appear as a straight line on a
    log-log plot
  • A deviation from a straight line could indicate a
    different distribution
  • exponential
  • lognormal

log( nodes)
log( edges)
Thanks Lada!!! Adamic lecture 2006
42
Barabasi/Albert Model for Preferential Attachment
43
Easiest way to describe a Power Law
Distribution The Rich Get Richer Applied to
networks, the argument goes that vertices with a
higher degree have a higher probability of being
linked to by newly added vertices.
44
This theory applied to networks first gained
wider acceptance after the publication in 1999 of
a paper by Barabasi and Albert entitled The
Emergence of Scaling in Random Networks. In
this paper they coined the term preferential
attachment to define this weighted probability
of new vertex attachment.
Mark Newman in The Structure and Function of
Complex Networks
45
Barabasi and Alberts model is undirected, which
can be construed as problematic when looking at
the web, but what it sacrifices in realism, it
makes up for in simplicity.
Mark Newman in The Structure and Function of
Complex Networks
46
Broder et al. Bow Tie Structure of the Web
47
Numerous large scale studies have developed the
Bow Tie Structure of the Web a model
initially coined by Andrei Broder, Ravi Kumar,
Farzin Maghoul1, Prabhakar Raghavan, Sridhar
Rajagopalan, Raymie Stata, Andrew Tomkins, and
Janet Wiener in their paper Graph Structure in
the Web. These studies have been duplicated a
number of times.
48
Bow-tie model of the Web
TEND44M
SCC56 M
OUT44 M
IN44 M
DISC17 M
Bröder al. WWW 2000, Dill al. VLDB 2001
Thanks Drago
49
Statistics from Bow Tie Study of Broder et. al.
SCC 27.5 IN and OUT 21.5 Tendrils and
tubes 21.5 Disconnected 8 24 of pages
reachable from a given page
Bröder al. WWW 2000, Dill al. VLDB 2001
Thanks Drago and Lada
50
Now to Get Into the Mathematics of the Power Law
in the Internet
Heuristically Optimized Trade-offs A New
Paradigm for Power Laws in the Internet, by Alex
Fabrikant, Elias Koutsoupias, and Christis H.
Papadimitriou
Fabrikantal
51
First and foremost The Web is scale-free The
model of the web A tree is built as nodes
arrive uniformly at random. When the i-th node
arrives, it attaches itself on one of the
previous nodes.
Fabrikantal
52
We assume this node would like to connect to a
centrally located node a node whose distances to
other nodes is minimized.
dij is the Euclidean distance hj is some measure
of the centrality of node j a is a parameter
a function of the final number n of points,
gauging the relative importance of the two
objectives
Fabrikantal
53
Fabrikant et al. define 3 possible measures of
centrality 1. The average number of hops from
other nodes 2. The maximum number of hops from
another node 3. The number of hops from a fixed
center of the tree
Fabrikantal
54
a is the crux of the theorem! Why? Here are
some examples
Fabrikantal
55
If a is too low, then the Euclidian distances
become unimportant, and the network resembles a
star
Fabrikantal
56
But if a grows at least as fast as vn, where n is
the final number of points, then distance becomes
too important, and minimum spanning trees with
high degree occur, but with exponentially
vanishing probability thus not a power law. if
a is anywhere in between, we have a power law
Through a rather complex and elaborate proof,
Fabrikantal prove this initial assumption will
produce a power law distribution Ill save you
the math!
Fabrikantal
57
Information Retrieval Applications
Growing and navigating the small world Web
by local content Filippo Menczer
58
The degree sequence of Web pages has a power-law
distribution, Pr(k) k-? where k is the degree
of a page (number of in-links or out-links) and ?
is a constant exponent
Menczer
59
The goal of Menczers study
to propose a Web growth model that is shown to
accurately predict the distribution of Web page
degree, based on textual content and assuming
only local knowledge of degree for existing
pages. efficient paths can be discovered by
decentralized Web navigation algorithms based on
textual and/or categorical cues."
So lets step through the argument
Menczer
60
Menczers Model to Explain How Web Pages are
Generated and Why the Popular are Popular.
Menczer
61
To gain insight into the Webs scale-free growth
and mechanisms for efficient navigation, I want
to study the connection between the two
topologies induced over the Web by links and
textual content. Menczer
Menczer
62
start by introducing a distance measure based on
lexical similarity
where (p1,p2) is a pair of web pages and
is the cosine similarity function traditionally
used in information retrieval (wkp is some weight
function for term k in page p, e.g., term
frequency).
Menczer
63
Finding the relationship between lexical topology
(r from above) and the link topology requires
measuring the probability that two pages at a
certain lexical distance have a direct link
between them. But, this measure is extremely
hard to get because the size of the web makes
this probability negligibly small. Instead,
focus on a neighborhood link relation in link
space, which approximates link probability but is
easier to measure and is used to identify Web
communities.
Menczer
64
NOTE A neighborhood is the set of URLs
representing a web page, all of its in-links,
and all of its out-links.
Menczer
65
Measure of frequency of neighbor pairs of pages
as a function of the lexical distance
Up is the URL set representing ps neighborhood
? is the neighborhood threshold it models the
ratio of local versus long-range links
Menczer
66
Here is the plot of Pr(??) against ? over
various ?
Menczer
67
As you can see from the graph, up to about ? lt
1, there is no correlation, but after that point,
the probability that two pages are neighbors
across lexical distance decreases (like a power
law) at
Menczer
68
The conclusion Aside from immediate neighbors
(one step away from the original link), for whom
the relation shows no distinct features, pages
more similar in content have a higher likelihood
to be neighbors.
Menczer
69
next we get Menczers Web Growth Model based on
Generative Models
but first, a short history of proposed Web Growth
Models
70
Barabasi-Albert (BA) Model A Preferential
attachment model node i receives a new edge with
probability proportional to its current degree,
Pr(i) k(i) Pros produces power-law degree
distributions Cons based on unrealistic
assumption that Web authors have complete global
knowledge of the Web degree
Menczer
71
Extension of BA Model Pr(i) ?(i)k(i), where
?(i) is the fitness of page i Pros Still yields
power law degree distributions Cons Over time,
pages with high fitness win out
Menczer
72
Another Extension of BA Model linking to a node
is based on its degree with probability f or to a
uniformly chosen node with probability 1-
f Pros Fits not only power law degree
distributions of the entire web, but also the
unimodal degree distribution of subsets of Web
pages (like universities, companies, or newspaper
homepages) Cons Still relies on global
knowledge of degree
Menczer
73
Menczers Proposal Content-Based Generative
Model Attempts to model the urge of page authors
to link to similar (hence probably related) and
popular (hence probably important) sites Also
makes the assumption that page popularity is
correlated with degree, but that a page author
only has local knowledge of degree.
Menczer
74
At each step t one new page pt is added, and m
new links are created from pt to m existing pages
pi, iltt, each selected with probability
k(i) is the in-degree of pi ? is a lexical
distance threshold c1 and a are constants
Menczer
75
now we have a growth process driven by local link
decisions based on content and that mirrors this
phase transition lexical independence for close
pages and an inverse power-law dependence for
distant pages
Menczer
76
Next Step Define an Optimal Navigation Algorithm
for Small World Networks for Efficient Web
Crawling
77
Given the Webs small world and power-law
topology, its diameter scales as T(logN/loglogN)
therefore, if two pages belong to a connected
component of the Web, some short path exists
between them. Given the need to find unknown
target pages, we are interested only in
decentralized crawling algorithms, which can use
only information available locally about a page
and its neighborhood.
Menczer
78
Starting from some source Web page we aim to
visit a target page by visiting l ltlt N pages,
where N is the size of the Web, several billion
pages. The Web is a small world network, so we
know its diameter, or the diameter of the largest
connected component, scales logarithmically with N
Menczer
79
therefore a short path of length
is likely to exist between some source (a
bookmarked page or a search engine result) and
some unknown relevant target page.
Menczer
80
Simple greedy algorithms that always pick the
neighbor with the highest degree end up being too
costly, so what is the alternative? Use
Kleinbergs Hierarchical Model and knowledge of
Semantic Distance
Menczer
81
First define a semantic distance between topics
p0 is the lowest common ancestor of p1 and
p2 Pr(p) represents the fraction of pages
classified at node p
Menczer
82
The relationship between this measure of semantic
distance and link topology can be analyzed as was
done for lexical distance earlier by measuring
the frequency of neighbor pairs of pages as a
function of semantic distance
Menczer
83
Which yields this plot of Pr(?µ) versus µ for
various values of µ
Menczer
84
Menczer observed a good fit between the data and
the exponential model, and using Kleinbergs
greedy algorithm, the majority of relevant pages
are located based on local content before 104
pages have been crawled.
Menczer
85
Graph Structure in Three National Academic
Webs Power Laws with Anomalies Mike Thelwall
and David Wilkinson
86
What Anomalies can be Found in the Network of the
Web? To start, a summation of the Broderal
2000 study of two complete crawls from AltaVista
and the connectivity of the Web, on which
Thelwall and Wilkinsons work is based.
ThelwallWilkinson
87
ThelwallWilkinson
88
Five parts IN OUT STRONGLY CONNECTED COMPONENT
(SCC) TENDRILS DISCONNECTED The first four had
roughly equal sizes
ThelwallWilkinson
89
An SCC is a collection of pages from which a
crawl following only links in pages could start
anywhere in the set and reach every other page in
the set OUT is set of pages outside SCC that can
be reach from the SCC but are not connected back
to the SCC IN is the set of pages that connect
to the SCC but are not connected to it so a crawl
starting in IN would contain all of the SCC and
OUT TENDRILS is a separate set linked to by a
page in IN or OUT but are not in IN, OUT, or
SCC DISCONNECTED are just that disconnected
from the other four
ThelwallWilkinson
90
  • Methodological Issues that Arise from this Study
  • Without access to a major search engine database,
    researchers may only be able to study SCC and OUT
    systematically
  • A crawler may not be able to retrieve all the
    out-links, especially those created by
    JavaScript, server side image maps, and embedded
    applications. So any of these components could
    be larger
  • Some links may be ignored due to a policy
    decision, such as database queries (including a
    ? in the URL) and frameset pages. Search
    engines may also ban spam sites thus losing out
    on more potential links.
  • The AltaVista data set included duplicates, which
    if eliminated, could have caused a shrinking in
    size for some of the components.
  • 5. The AltaVista data set only included HTML
    pages, thus missing out on other potential
    resources

ThelwallWilkinson
91
Thelwall and Wilkinson then did a study over
three universities publicly indexable Web sites,
with some updates to fix the methodological
issues just mentioned in Broderals 2000 study.
ThelwallWilkinson
92
Their Results
ThelwallWilkinson
93
And here are the logarithmic graphs of in-link
and out-link counts for the schools
ThelwallWilkinson
94
Australia
ThelwallWilkinson
95
New Zealand
ThelwallWilkinson
96
the United Kingdom
ThelwallWilkinson
97
Explanations for the Anomalies In New Zealand,
it was because of set of highly interlinking
software documentation. In Australia, the
biggest came from an online course handbook with
a standard navigation bar. In the UK, the huge
in-link counts also came from a standard
navigation bar. The biggest anomalies are
produced by internal links within data-driven
sites.
ThelwallWilkinson
98
Some other interesting observations 1. all SCC
pages have out-degree of at least 1, by
definition, whereas the median out-degree of OUT
pages is 0. 2. the median for SCC of between 6
and 8 shows that there is significantly more
out-linking from within SCC pages. The same is
true for in-linking, but to a lesser degree. In
fact, in-linking and out-linking from both SCC
and OUT display power laws, so although there are
generally more links within SCC, OUT also has a
spectrum of the more highly connected pages. 3.
as a final point, a median of two or three
in-links for SCC shows that this area is actually
very sparsely interconnected. The average SCC
page can only be reached from two (Australia) or
three (UK, New Zealand) other SCC pages.
ThelwallWilkinson
99
Another Strange Anomaly They also found a huge
number of components of size 1 (2,220,070,
containing 49 of pages for the UK), all of which
must be linked to by pages from the SCC,
indicating that the SCC must be surrounded by a
fuzz of individual pages that do not link to any
other national university pages. Many of these
were non-HTML resources that cannot or do not
contain crawled links (PDF, PPT, Image File).
ThelwallWilkinson
100
The Longest Shortest Paths A shortest path is
the least number of links that need to be
traversed to get from the first to the
second. Australia 362 links New Zealand
1445 links UK 1022 links From these results
it can be seen that very long paths do exist in
the data set, with the end pages being buried
deeply in obscure places.
ThelwallWilkinson
101
Summation of Thelwall and Wilkinson Power laws
are clearly evident in many aspects of the
topology of national university Webs There is
evidence for a rich get richer model of new
links However, there is evidence for a small
degree of linking at random
ThelwallWilkinson
102
Also, anomalies were present, caused by 1.
Automatically generated pages served to the
crawler, and those produced by automatically
fixed link errors 2. The inclusion of non-HTML
Web pages, in particular because these cannot
host links, or the crawler did not extract links
from them 3. Large resource-driven Web
sites These anomalies need to accounted for or
segregated to get the most meaningful result
about the Web.
ThelwallWilkinson
Write a Comment
User Comments (0)
About PowerShow.com