Crawling and Ranking - PowerPoint PPT Presentation

About This Presentation

Title:

Crawling and Ranking

Description:

... etc.) to text browsers (lynx, links, w3m, etc.) to all other user agents including Web crawlers The HTML language Text and tags Tags define structure Used for ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 58

Provided by: ams61

Category:

more less

Transcript and Presenter's Notes

Title: Crawling and Ranking

1
Crawling and Ranking
2
HTML (HyperText Markup Language)

Described the structure and content of a (web)
document
HTML 4.01 most common version, W3C standard
XHTML 1.0 XML-ization of HTML 4.01, minor
differences
Validation (http//validator.w3.org/) against a
schema. Checks the conformity of a Web page with
respect to recommendations, for accessibility
to all graphical browsers (IE, Firefox, Safari,
Opera, etc.)
to text browsers (lynx, links, w3m, etc.)
to all other user agents including Web crawlers

3
The HTML language

Text and tags
Tags define structure
Used for instance by a browser to lay out the
document.
Header and Body

4
HTML structure

lt!DOCTYPE html gt
lthtml lang"en"gt
ltheadgt
lt!-- Header of the document --gt
lt/headgt
ltbodygt
lt!-- Body of the document --gt
lt/bodygt
lt/htmlgt

lt!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN "http//www.w3.org/TR/xhtml1/DTD/xhtml
1-strict.dtd"gt
lthtml xmlnshttp//www.w3.org/1999/xhtml
lang"en" xmllang"en"gt
ltheadgt
ltmeta http-equiv"Content-Type
content"text/html charsetutf-8" /gt
lttitlegtExample XHTML documentlt/titlegt
lt/headgt
ltbodygt
ltpgtThis is a lta href"http//www.w3.org/"gtlink to
the
W3Clt/agtlt/pgt
lt/bodygt
lt/htmlgt

6
Header

Appears between the tags ltheadgt ... lt/headgt
Includes meta-data such as language, encoding
Also include document title
Used by (e.g.) the browser to decipher the body

7
Body

Between ltbodygt ... lt/bodygt tags
The body is structured into sections, paragraphs,
lists, etc.
lth1gtTitle of the pagelt/h1gt
lth2gtTitle of a main sectionlt/h2gt
lth3gtTitle of a subsectionlt/h3gt
. . .
ltpgt ... lt/pgt define paragraphs
More block elements such as table, list

8
HTTP

Application protocol
Client request
GET /MarkUp/ HTTP/1.1
Host www.google.com
Server response
HTTP/1.1 200 OK
Two main HTTP methods GET and POST

9
GET

URL http//www.google.com/search?qBGU
Corresponding HTTP GET request
GET /search?qBGU HTTP/1.1
Host www.google.com

10
POST

Used for submitting forms
POST /php/test.php HTTP/1.1
Host www.bgu.ac.il
Content-Type application/x-www-formurlencoded
Content-Length 100

11
Status codes

HTTP response always starts with a status code
followed by a human-readable message (e.g., 200
OK)
First digit indicates the class of the response
1 Information
2 Success
3 Redirection
4 Client-side error
5 Server-side error

12
Authentication

HTTPS is a variant of HTTP that includes
encryption, cryptographic authentication, session
tracking, etc.
It can be used instead to transmit sensitive data
GET ... HTTP/1.1
Authorization Basic dG90bzp0aXRp

13
Cookies

Key/value pairs, that a server asks a client to
store and retransmit with each HTTP request (for
a given domain name).
Can be used to keep information on users between
visits
Often what is stored is a session ID
Connected, on the server side, to all session
information

14
Crawling
15
Basics of Crawling

Crawlers, (Web) spiders, (Web) robots autonomous
agents that retrieve pages from the Web
Basics crawling algorithm
1. Start from a given URL or set of URLs
2. Retrieve and process the corresponding
page
3. Discover new URLs (next slide)
4. Repeat on each found URL
Problem The web is huge!

16
Discovering new URLs

Browse the "internet graph" (following e.g.
hyperlinks)
Site maps (sitemap.org)

17
The internet graph

At least 14.06 billion nodes pages
At least 140 billion edges links
Lots of "junk"

18
Graph-browsing algorithms

Depth-?rst
Breath-first
Combinations..
Parallel crawling

19
Duplicates

Identifying duplicates or near-duplicates on the
Web to prevent multiple indexing
Trivial duplicates same resource at the same
canonized URL
http//example.com80/toto
http//example.com/titi/../toto
Exact duplicates identi?cation by hashing
near-duplicates (timestamps, tip of the day,
etc.) more complex!

20
Near-duplicate detection

Edit distance
Good measure of similarity,
Does not scale to a large collection of documents
(unreasonable to compute
the edit distance for every pair!).
Shingles two documents similar if they mostly
share the same succession of k-grams

21
Crawling ethics

robots.txt at the root of a Web server
User-agent
Allow /searchhistory/
Disallow /search
Per-page exclusion (de facto standard).
ltmeta name"ROBOTS" content"NOINDEX,NOFOLLOW"
gt
Per-link exclusion (de facto standard).
lta href"toto.html" rel"nofollow"gtTotolt/agt
Avoid Denial Of Service (DOS), wait 100ms/1s
between two
repeated requests to the same Web server

22
Overview

Crawl
Retrieve relevant documents
How?
To define relevance, to find relevant docs..
Rank
How?

23
Relevance

Input keyword (or set of keywords), the web
First question how to define the relevance of a
page with respect to a key word?
Second question how to store pages such that the
relevant ones for a given keyword are easily
retrieved?

24
Relevance definition

Boolean based on existence of a word in the
document
Synonyms
Disadvantages?
Word count
Synonyms
Disadvantages?
Can we do better?

25
TF-IDF
26
Storing pages

Offline pre-processing can help online search
Offline preprocessing includes stemming, stop
words removal
As well as the creation of an index

27
Inverted Index
28
More advanced text analysis

N-grams
HMM language models
PCFG langage models
We will discuss all that later in the course!

29
Ranking
30
Why Ranking?

Huge number of pages
Huge even if we filter according to relevance
Keep only pages that include the keywords
A lot of the pages are not informative
And anyway it is impossible for users to go
through 10K results

31
When to rank?

Before retrieving results
Advantage offline!
Disadvantage huge set
After retrieving results
Advantage smaller set
Disadvantage online, user is waiting..

32
How to rank?

Observation links are very informative!
Not just for discovering new sites, but also for
estimating the importance of a site
CNN.com has more links to it than my homepage
Quality and Efficiency are key factors

33
Authority and Hubness

Authority a site is very authoritative if it
receives many citations. Citation from
important sites has more weight than citations
from less-important sites
A(v) The authority of v
Hubness A good hub is a site that links to many
authoritative sites
H(v) The hubness of v

34
HITS

Recursive dependency
a(v) S(u,v) h(u)
h(v) S(v,u) a(u)
Normalize (when?) according to square root of
sum of squares of authorities \ hubness values
Start by setting all values to 1
We could also add bias
We can show that a(v) and h(v) converge

35
HITS (cont.)

Works rather well if applied only on relevant web
pages
E.g. pages that include the input keywords
The results are less satisfying if applied on
the whole web
On the other hand, online ranking is a problem

36
Google PageRank

Works offline, i.e. computes for every web-site a
score that can then be used online
Extremely efficient and high-quality
The PageRank algorithm that we will describe here
appears in Brin Page, 1998

37
Random Surfer Model

Consider a "random surfer"
At each point chooses a link and clicks on it
A link is chosen with uniform distribution
A simplifying assumption..
What is the probability of being, at a random
time, at a web-page W?

38
Recursive definition

If PageRank reflects the probability of being in
a web-page (PR(w) P(w)) then
PR(W) PR(W1) (1/O(W1))
PR(Wn) (1/O(Wn))
Where O(W) is the out-degree of W

39
Problems

A random surfer may get stuck in one component of
the graph
May get stuck in loops
Rank Sink Problem
Many Web pages have no inlinks/outlinks

40
Damping Factor

Add some probability d for "jumping" to a random
page
Now PR(W) (1-d) PR(W1) (1/O(W1))
PR(Wn) (1/O(Wn)) d 1/N
Where N is the number of pages in the index

41
How to compute PR?

Simulation
Analytical methods
Can we solve the equations?

42
Simulation A random surfer algorithm

Start from an arbitrary page
Toss a coin to decide if you want to follow a
link or to randomly choose a new page
Then toss another coin to decide which link to
follow \ which page to go to
Keep record of the frequency of the web-pages
visited

43
Convergence

Not guaranteed without the damping factor!
(Partial) intuition if unlucky, the algorithm
may get stuck forever in a connected component
Claim with damping, the probability of getting
stuck forever is 0
More difficult claim with damping, convergence
is guaranteed

44
Markov Chain Monte Carlo (MCMC)

A class of very useful algorithms for sampling a
given distribution
We first need to know what is a Markov Chain

45
Markov Chain

A finite or countably infinite state machine
We will consider the case of finitely many states
Transitions are associated with probabilities
Markovian property given the present state,
future choices are independent from the past

46
MCMC framework

Construct (explicitly or implicitly) a Markov
Chain (MC) that describes the desired
distribution
Perform a random walk on the MC, keeping track of
the proportion of state visits
Discard samples made before Mixing
Return proportion as an approximation of the
correct distribution

47
Properties of Markov Chains

A Markov Chain defines a distribution on the
different states (P(state) probability of being
in the state at a random time)
We want conditions on when this distribution is
unique, and when will a random walk approximate
it

48
Properties

Periodicity
A state i has period k if any return to state i
must occur in multiples of k time steps
Aperiodic period 1 for all states
Reducibility
An MC is irreducible if there is a probability 1
of (eventually) getting from every state to every
state
Theorem A finite-state MC has a unique
stationary distribution if it is aperiodic and
irreducible

49
Back to PageRank