Importance of Web Pages - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Importance of Web Pages

Description:

... random, the probability distribution for walkers is then given by the vector M v. ... are important in proportion to how likely a walker is to be there. ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 78
Provided by: jeff452
Category:

less

Transcript and Presenter's Notes

Title: Importance of Web Pages


1
Importance of Web Pages
  • PageRank
  • Topic-Specific PageRank
  • Hubs and Authorities
  • Combatting Link Spam

2
PageRank
  • Intuition solve the recursive equation a page
    is important if important pages link to it.
  • In high-falutin terms importance the
    principal eigenvector of the transition matrix of
    the Web.
  • A few fixups needed.

3
Transition Matrix of the Web
  • Enumerate pages.
  • Page i corresponds to row and column i.
  • M i, j 1/n if page j links to n pages,
    including page i 0 if j does not link to i.
  • M i, j is the probability well next be at
    page i if we are now at page j.

4
Example Transition Matrix
Suppose page j links to 3 pages, including i
but not x.
j
i
1/3
x
0
5
Random Walks on the Web
  • Suppose v is a vector whose i th component is
    the probability that each random walker is at
    page i at a certain time.
  • If each walker follows a link from i at random,
    the probability distribution for walkers is then
    given by the vector M v.

6
Random Walks (2)
  • Starting from any vector v, the limit M (M
    (M (M v ) )) is the long-term distribution of
    walkers.
  • Intuition pages are important in proportion to
    how likely a walker is to be there.
  • The math limiting distribution principal
    eigenvector of M PageRank.

7
Example The Web in 1839
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
Msoft
Amazon
8
Solving The Equations
  • Because there are no constant terms, the
    equations v M v do not have a unique solution.
  • In Web-sized examples, we cannot solve by
    Gaussian elimination anyway we need to use
    relaxation ( iterative solution).
  • Can work if you start with a fixed v.

9
Simulating a Random Walk
  • Start with the vector v 1, 1,, 1
    representing the idea that each Web page is given
    one unit of importance.
  • Repeatedly apply the matrix M to v, allowing the
    importance to flow like a random walk.
  • About 50 iterations is sufficient to estimate the
    limiting solution.

10
Example Iterating Equations
  • Equations v M v
  • y y /2 a /2
  • a y /2 m
  • m a /2

Note is really assignment.
y a m
1 1 1
1 3/2 1/2
5/4 1 3/4
9/8 11/8 1/2
6/5 6/5 3/5
. . .
11
The Walkers
Yahoo
Msoft
Amazon
12
The Walkers
Yahoo
Msoft
Amazon
13
The Walkers
Yahoo
Msoft
Amazon
14
The Walkers
Yahoo
Msoft
Amazon
15
In the Limit
Yahoo
Msoft
Amazon
16
Real-World Problems
  • Some pages are dead ends (have no links out).
  • Such a page causes importance to leak out.
  • Other groups of pages are spider traps (all
    out-links are within the group).
  • Eventually spider traps absorb all importance.

17
Microsoft Becomes Dead End
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0
Msoft
Amazon
18
Example Effect of Dead Ends
  • Equations v M v
  • y y /2 a /2
  • a y /2
  • m a /2

y a m
1 1 1
1 1/2 1/2
3/4 1/2 1/4
5/8 3/8 1/4
0 0 0
. . .
19
Microsoft Becomes a Dead End
Yahoo
Msoft
Amazon
20
Microsoft Becomes a Dead End
Yahoo
Msoft
Amazon
21
Microsoft Becomes a Dead End
Yahoo
Msoft
Amazon
22
Microsoft Becomes a Dead End
Yahoo
Msoft
Amazon
23
In the Limit
Yahoo
Msoft
Amazon
24
Msoft Becomes Spider Trap
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
25
Example Effect of Spider Trap
  • Equations v M v
  • y y /2 a /2
  • a y /2
  • m a /2 m

y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
26
Microsoft Becomes a Spider Trap
Yahoo
Msoft
Amazon
27
Microsoft Becomes a Spider Trap
Yahoo
Msoft
Amazon
28
Microsoft Becomes a Spider Trap
Yahoo
Msoft
Amazon
29
In the Limit
Yahoo
Msoft
Amazon
30
PageRank Solution to Traps, Etc.
  • Tax each page a fixed percentage at each
    interation.
  • Add a fixed constant to all pages.
  • Models a random walk with a fixed probability of
    leaving the system, and a fixed number of new
    walkers injected into the system at each step.

31
Example Microsoft is a Spider Trap 20 Tax
  • Equations v 0.8(M v ) 0.2
  • y 0.8(y /2 a/2) 0.2
  • a 0.8(y /2) 0.2
  • m 0.8(a /2 m) 0.2

y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
32
General Case
  • In this example, because there are no dead-ends,
    the total importance remains at 3.
  • In examples with dead-ends, some importance leaks
    out, but total remains finite.

33
Solving the Equations
  • Because there are constant terms, we can expect
    to solve small examples by Gaussian elimination.
  • Web-sized examples still need to be solved by
    relaxation.

34
Finding a Good Starting Vector
  • Newton-like prediction of where components of the
    principal eigenvector are heading.
  • Take advantage of locality in the Web.
  • Each technique can reduce the number of
    iterations by 50.
  • Important PageRank takes time!

35
Predicting Component Values
  • Three consecutive values for the importance of a
    page suggests where the limit might be.

36
Exploiting Substructure
  • Pages from particular domains, hosts, or
    directories, like stanford.edu or
    infolab.stanford.edu/ullman tend to have many
    internal links.
  • Initialize PageRank using ranks within your local
    cluster, then ranking the clusters themselves.

37
Strategy
  • Compute local PageRanks (in parallel?).
  • Use local weights to establish weights on edges
    between clusters.
  • Compute PageRank on graph of clusters.
  • Initial rank of a page is the product of its
    local rank and the rank of its cluster.
  • Clusters are appropriately sized regions with
    common domain or lower-level detail.

38
In Pictures
39
Topic-Specific Page Rank
  • Goal Evaluate Web pages not just according to
    their popularity, but by how close they are to a
    particular topic, e.g. sports or history.
  • Allows search queries to be answered based on
    interests of the user.
  • Example Query Maccabi wants different pages
    depending on whether you are interested in sports
    or history.

40
Teleport Sets
  • Assume each walker has a small probability of
    teleporting at any tick.
  • Teleport can go to
  • Any page with equal probability.
  • As in the taxation scheme.
  • A topic-specific set of relevant pages
    (teleport set ).
  • For topic-specific PageRank.

41
Example Topic Software
  • Only Microsoft is in the teleport set.
  • Assume 20 tax.
  • I.e., probability of a teleport is 20.

42
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
43
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
44
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
45
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
46
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
47
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
48
Only Microsoft in Teleport Set
Yahoo
Msoft
Amazon
49
Picking the Teleport Set
  • Choose the pages belonging to the topic in Open
    Directory.
  • Learn from examples the typical words in pages
    belonging to the topic use pages heavy in those
    words as the teleport set.

50
Application Link Spam
  • Spam farmers today create networks of millions of
    pages designed to focus PageRank on a few
    undeserving pages.
  • To minimize their influence, use a teleport set
    consisting of trusted pages only.
  • Example home pages of universities.

51
Hubs and Authorities
  • Mutually recursive definition
  • A hub links to many authorities
  • An authority is linked to by many hubs.
  • Authorities turn out to be places where
    information can be found.
  • Example course home pages.
  • Hubs tell where the authorities are.
  • Example CS Dept. course-listing page.

52
Transition Matrix A
  • HA uses a matrix A i, j 1 if page i links
    to page j, 0 if not.
  • AT, the transpose of A, is similar to the
    PageRank matrix M, but AT has 1s where M has
    fractions.

53
Example HA Transition Matrix
y a m
Yahoo
y 1 1 1 a 1 0 1 m 0 1
0
A
Msoft
Amazon
54
Using Matrix A for HA
  • Powers of A and AT have elements of exponential
    size, so we need scale factors.
  • Let h and a be vectors measuring the hubbiness
    and authority of each page.
  • Equations h ?Aa a µAT h.
  • Hubbiness scaled sum of authorities of
    successor pages (out-links).
  • Authority scaled sum of hubbiness of
    predecessor pages (in-links).

55
Consequences of Basic Equations
  • From h ?Aa a µAT h we can derive
  • h ?µAAT h
  • a ?µATA a
  • Compute h and a by iteration, assuming initially
    each page has one unit of hubbiness and one unit
    of authority.
  • Pick an appropriate value of ?µ.

56
Example Iterating HA
1 1 1 A 1 0 1 0 1 0
1 1 0 AT 1 0 1 1 1 0
3 2 1 AAT 2 2 0 1 0 1
2 1 2 ATA 1 2 1 2 1 2
. . . . . . . . .
1?3 2 1?3

1 1 1
5 4 5
24 18 24
114 84 114
a(yahoo) a(amazon) a(msoft)
. . . . . . . . .
h(yahoo) 1 h(amazon) 1 h(msoft)
1
6 4 2
132 96 36
1.000 0.735 0.268
28 20 8
57
Solving HA in Practice
  • Iterate as for PageRank dont try to solve
    equations.
  • But keep components within bounds.
  • Example scale to keep the largest component of
    the vector at 1.
  • Trick start with h 1,1,,1 multiply by AT
    to get first a scale, then multiply by A to get
    next h,

58
Solving HA (2)
  • You may be tempted to compute AAT and ATA first,
    then iterate these matrices as for PageRank.
  • Bad, because these matrices are not nearly as
    sparse as A and AT.

59
HA Versus PageRank
  • If you talk to someone from IBM, they may tell
    you IBM invented PageRank.
  • What they mean is that HA was invented by Jon
    Kleinberg when he was at IBM.
  • But these are not the same.
  • HA does not appear to be a substitute for
    PageRank.
  • But may be used by Ask.com.

60
Spam on the Web
  • Search has become the default gateway to the web.
  • Very high premium to appear on the first page of
    search results.

61
What is Web Spam?
  • Spamming any action whose purpose is to boost
    a web pages position in search engine results,
    without providing additional value.
  • Spam Web pages used for spamming.
  • Approximately 10-15 of Web pages are spam.

62
Web-Spam Taxonomy
  • Boosting techniques
  • Techniques for increasing the probability a Web
    page will be a highly ranked answer to a search
    query.
  • Hiding techniques
  • Techniques to hide the use of boosting from
    humans and Web crawlers.

63
Hiding techniques
  • Content hiding.
  • Use same color for text and page background.
  • Cloaking.
  • Return different page to crawlers and browsers.
  • Redirection.
  • Redirects are followed by browsers but not
    crawlers.

64
Boosting Techniques
  • Term spamming
  • Manipulating the text of Web pages in order to
    appear relevant to queries.
  • Why? You can run ads that are relevant to the
    query.
  • Link spamming
  • Creating link structures that boost PageRank.

65
Term Spamming (1)
  • Repetition of one or a few specific terms e.g.,
    free, cheap, Viagra.
  • Dumping of a large number of unrelated terms.
  • E.g., copy entire dictionaries.

66
Term Spamming (2)
  • Weaving
  • Copy legitimate pages and insert spam terms at
    random positions.
  • Phrase Stitching
  • Glue together sentences and phrases from
    different sources.
  • E.g., use the top-ranked pages on the topic you
    want to look like.

67
The Google Solution to Term Spamming
  • In addition to PageRank, the original Google
    engine had another innovation it trusted what
    people said about you in preference to what you
    said about yourself.
  • Give more weight to words that appear in or near
    anchor text than to words that appear in the page
    itself.

68
The Google Solution (2)
  • Today, the Google formula for matching terms to
    documents involves over 250 factors.
  • E.g., does the word appear in a header?
  • As closely guarded as the formula for Coke.

69
Link Spam
  • Three kinds of Web pages from a spammers point
    of view
  • Own pages.
  • Completely controlled by spammer.
  • Accessible pages.
  • E.g., Web-log comment pages spammer can post
    links to his pages.
  • Inaccessible pages.

70
Spam Farms (1)
  • Spammers goal
  • Maximize the PageRank of target page t.
  • Technique
  • Get as many links from accessible pages as
    possible to target page t.
  • Construct link farm to get PageRank multiplier
    effect.

71
Spam Farms (2)
Accessible
Own
Inaccessible
1
2
t
M
Goal boost PageRank of page t. One of the most
common and effective organizations for a spam
farm.
72
Analysis (1)
Own
Accessible
Inaccessible
1
2
t
M
  • Suppose rank from accessible pages x.
  • PageRank of target page y.
  • Taxation rate 1-b.
  • Rank of each farm page by/M (1-b)/N.

73
Analysis (2)
Own
Accessible
Inaccessible
1
2
t
M
  • y x ?Mby/M (1-b)/N (1-b)/N
  • y x b2y b(1-b)M/N
  • y x/(1-b2) cM/N where c ?/(1?)

74
Analysis (3)
Own
Accessible
Inaccessible
1
2
t
M
  • y x/(1-b2) cM/N where c ?/(1?).
  • For b 0.85, 1/(1-b2) 3.6.
  • Multiplier effect for acquired page rank.
  • By making M large, we can make y as large as we
    want.

75
Detecting Link-Spam
  • Topic-specific PageRank, with a set of trusted
    pages as the teleport set is called TrustRank.
  • Spam Mass (PageRank TrustRank)/PageRank.
  • High spam mass means most of your PageRank comes
    from untrusted sources you may be link-spam.

76
Picking the Trusted Set
  • Two conflicting considerations
  • Human has to inspect each seed page, so seed set
    must be as small as possible.
  • Must ensure every good page gets adequate
    TrustRank, so all good pages should be reachable
    from the trusted set by short paths.

77
Approaches to Picking the Trusted Set
  • Pick the top k pages by PageRank.
  • It is almost impossible to get a spam page to the
    very top of the PageRank order.
  • Pick the home pages of universities.
  • Domains like .edu are controlled.
Write a Comment
User Comments (0)
About PowerShow.com