IRLbot: Scaling to 6 Billion Pages and Beyond - PowerPoint PPT Presentation

About This Presentation

Title:

IRLbot: Scaling to 6 Billion Pages and Beyond

Description:

... Conclusion Introduction Search engines consist of two fundamental components web crawlers data miners What is a web crawler? ... PowerPoint Presentation ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 25

Provided by: Goog6450

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: IRLbot: Scaling to 6 Billion Pages and Beyond

1
IRLbot Scaling to 6 Billion Pages and Beyond

Presented by rohit tummalapalli
sashank jupudi

2
Agenda

Introduction
Challenges
Overcoming the challenges
Scalability
Reputation and spam
Politeness
Experimental results
Conclusion

Introduction
Search engines consist of two fundamental
components
web crawlers
data miners
What is a web crawler?
Challenges
scalability
spam avoidance
politeness

4
Challenges - Scalability

Internet trade off between scalability(N) ,
performance(S) and resource usage(E).
Previous works?
Our goal

5
Challenges - spam avoidance

Crawler is bogged down in synchronized delay
attacks of certain spammers
Prior research?
Our goal

6
Challenges - politeness

Web crawlers often get in trouble with webmasters
for slowing down their servers.
Even with spam avoidance, the entire RAM
eventually gets filled with URLs from a small set
of hosts and the crawler simply chokes on its
politeness.
Previous algorithms No
Our goal

7
Scalability

Bottleneck caused by complexity of verifying
uniqueness of URL's and compliance with
robots.txt .
Need for an algorithm that can perform very
fast checks for large N
Authors introduced DRUM(Disk Repository with
Update Management)
DRUM can store large volumes of arbitrary data
on disk to implement very fast checks using
bucket sort.
Efficient and faster than prior disk based
methods

8
Disk Repository with Update Mngmt

The purpose of DRUM is to allow for efficient
storage of large collections of ltkey, valuegt
pairs
?Key is a unique identifier of some data
?Value is arbitrary information attached to keys
Three supported operations
check, update, checkupdate

9
(No Transcript)
10
cont..

input Continuous tuples of ltkey,value,auxgt
DRUM spreads ltkey,valuegt pairs in K disk buckets
according to key.
All buckets pre-allocated on disk before usage.

11
cont..

Once largest bucket reaches size r lt R foll.
process repeated for all i 1..k
1) Bucket QiH read into bucket buffer
2) Disk Cache Z sequentially read and compared
with bucket buffer.
3) Updating disk cache Z

12
DRUM used for storing crawler data

DRUM objects
URLseen
checkupdate operation
RobotsCache
check and update operations
caching robots.txt file
RobotsRequested
stores hashes of sites for which robots.txt
been requested

14
Organisation of IRLbot

URLseen discards duplicate URL's , unique one's
are sent for BUDGET and STAR check.
passing or failing based on matching Robots.txt
sent back to Qr and host names passed through
RobotsReq
Sites hashes not already present are fed into Qd
which performs DNS lokups

Overhead metric ? of bytes written to/read
from disk during uniqueness checks of lN URLs
?a is the number of times they are written
to/read from disk
?blN is the number of bytes in all parsed
URLs
Maximum download rate (in pages/s) supported by
the disk portion of URL uniqueness

16
spam avoidance

BFS is a poor technique in presence of spam farms
and infinite webs.
Simply restricting the branching factor or the
maximum number of pages/hosts per domain is not a
viable solution.
Computing traditional PageRank for each page
could be prohibitively expensive in large crawls.
Spam can be effectively deterred by budgeting the
number of allowed pages per pay-level domain
(PLD).

17
cont..

This solution we call Spam Tracking and Avoidance
through Reputation (STAR).
Each PLD x has budget Bx that represents the
number of pages that are allowed to pass from x
(including all subdomains) to crawling threads
every T time units.
Each PLD x starts with a default budget B0 ,
which is then dynamically adjusted as its
in-degree dx changes over time
DRUM is used to store PLD budgets and aggregate
PLD-PLD link information.

19
Politeness

Prior work has only enforced a certain per-host
access delay th seconds
?Easy to crash servers that co-locate 1000s
of virtual hosts.
To admit URLs into RAM we have a method called
Budget Enforcement with Anti-Spam Tactics
(BEAST).
BEAST does not discard URLs, but rather delays
their download until it knows more about the PLD
they belong to.

21
cont..

A naive implementation is to maintain two queues
?Q contains URLs that passed the budget
check
?QF contains those that failed
After Q is emptied, QF is read and again split
into two queues Q and QF checks at regular
increasing intervals.
For N ?8, disk speed ? ? 2Sb(2L 1) constant
It is roughly four times the speed needed to
write all unique URLs to disk as they are
discovered during the crawl

22
Experiments - summary

Active crawling period of 41 days in summer 2007.
IRLbot attempted 7.6 billion connections and
received 7.4 billion valid HTTP replies.
Average download rate 319 mb/s (1,789 pages/s).

23
Conclusion

This paper tackled the issue of scaling web
crawlers to billions and even trillions of pages
?Single server with constant CPU, disk, and
memory speed
We identified several bottlenecks in building an
efficient large-scale crawler and presented our
solution to these problems
?Low-overhead disk-based data structures
?Non-BFS crawling order ?Real-time
reputation to guide the crawling rate

24
Future work