IRLbot: Scaling to 6 Billion Pages and Beyond - PowerPoint PPT Presentation

About This Presentation
Title:

IRLbot: Scaling to 6 Billion Pages and Beyond

Description:

... Conclusion Introduction Search engines consist of two fundamental components web crawlers data miners What is a web crawler? ... PowerPoint Presentation ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 25
Provided by: Goog6450
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: IRLbot: Scaling to 6 Billion Pages and Beyond


1
IRLbot Scaling to 6 Billion Pages and Beyond
  • Presented by rohit  tummalapalli
  •                   sashank jupudi

2
Agenda
  • Introduction  
  • Challenges
  • Overcoming the challenges
  • Scalability
  • Reputation and spam
  • Politeness
  • Experimental results
  • Conclusion

3
  •    Introduction
  • Search engines consist of two fundamental
    components
  • web crawlers
  • data miners
  • What is a web crawler?
  • Challenges
  • scalability
  • spam avoidance
  • politeness 
  •      

4
Challenges - Scalability
  • Internet trade off between scalability(N) ,
    performance(S) and resource usage(E).
  • Previous works?
  • Our goal

5
Challenges - spam avoidance
  • Crawler is bogged down in synchronized delay
    attacks of certain spammers
  • Prior research?
  • Our goal

6
Challenges - politeness
  • Web crawlers often get in trouble with webmasters
    for slowing down their servers.
  • Even with spam avoidance, the entire RAM
    eventually gets filled with URLs from a small set
    of hosts and the crawler simply chokes on its
    politeness.
  • Previous algorithms No
  • Our goal

7
Scalability
  • Bottleneck caused by complexity of verifying
    uniqueness of        URL's and compliance with
    robots.txt .
  • Need for an algorithm that can perform very
    fast checks for    large N
  • Authors introduced DRUM(Disk Repository with
    Update    Management)
  • DRUM can store large volumes of arbitrary data
    on disk to implement very fast checks using
    bucket sort.
  • Efficient and faster than prior disk based
    methods 

8
Disk Repository with Update Mngmt
  • The purpose of DRUM is to allow for efficient
    storage of large collections of ltkey, valuegt
    pairs
  • ?Key is a unique identifier of some data
  • ?Value is arbitrary information attached to keys 
  • Three supported operations
  •    check, update, checkupdate

9
(No Transcript)
10
cont..
  •  input  Continuous tuples of ltkey,value,auxgt
  • DRUM spreads ltkey,valuegt pairs in K disk buckets
    according to key.
  • All buckets pre-allocated on disk before usage.

11
cont..
  • Once largest bucket reaches size r lt R foll.
    process repeated for all i 1..k
  • 1) Bucket QiH read into bucket buffer
  • 2) Disk Cache Z sequentially read and compared
    with bucket buffer.
  • 3) Updating disk cache Z

12
DRUM used for storing crawler data
  • DRUM objects 
  • URLseen
  •     checkupdate operation
  •     
  • RobotsCache
  •     check and update operations
  •     caching robots.txt file
  • RobotsRequested
  •     stores hashes of sites for which robots.txt
    been requested

13
 
  •  

14
Organisation of IRLbot
  • URLseen discards duplicate URL's , unique one's
    are sent for BUDGET and STAR check.
  • passing or failing based on matching Robots.txt
  • sent back to Qr and host names passed through
    RobotsReq
  • Sites hashes not already present are fed into Qd
    which performs DNS lokups

15
    
  • Overhead metric ? of bytes written to/read
    from disk during uniqueness checks of lN URLs 
  •    
  •     ?a is the number of times they are written
    to/read from disk
  •     ?blN is the number of bytes in all parsed
    URLs
  • Maximum download rate (in pages/s) supported by
    the disk portion of URL uniqueness
  •             
  •     

16
spam avoidance
  • BFS is a poor technique in presence of spam farms
    and infinite webs.
  • Simply restricting the branching factor or the
    maximum number of pages/hosts per domain is not a
    viable solution.
  • Computing traditional PageRank for each page
    could be prohibitively expensive in large crawls.
  • Spam can be effectively deterred by budgeting the
    number of allowed pages per pay-level domain
    (PLD).

17
cont..
  • This solution we call Spam Tracking and Avoidance
    through Reputation (STAR).
  • Each PLD x has budget Bx that represents the
    number of pages that are allowed to pass from x
    (including all subdomains) to crawling threads
    every T time units.
  • Each PLD x starts with a default budget B0 ,
    which is then dynamically adjusted as its
    in-degree dx changes over time
  • DRUM is used to store PLD budgets and aggregate
    PLD-PLD link information.

18
  
  •   

19
Politeness
  • Prior work has only enforced a certain per-host
    access delay th seconds
  •     ?Easy to crash servers that co-locate 1000s
    of virtual hosts.
  • To admit URLs into RAM we have a method called
    Budget Enforcement with Anti-Spam Tactics
    (BEAST).
  • BEAST does not discard URLs, but rather delays
    their download until it knows more about the PLD
    they belong to.

20
  
  •   

21
cont..
  • A naive implementation is to maintain two queues
  •      ?Q contains URLs that passed the budget
    check
  •     ?QF contains those that failed
  • After Q is emptied, QF is read and again split
    into two queues Q and QF checks at regular
    increasing intervals.
  • For N ?8, disk speed ? ? 2Sb(2L 1) constant
  • It is roughly four times the speed needed to
    write all unique URLs to disk as they are
    discovered during the crawl
  •  

22
Experiments - summary
  • Active crawling period of 41 days in summer 2007.
  • IRLbot attempted 7.6 billion connections and
    received 7.4 billion valid HTTP replies.
  • Average download rate 319 mb/s (1,789 pages/s).

23
Conclusion
  • This paper tackled the issue of scaling web
    crawlers to billions and even trillions of pages
  •     ?Single server with constant CPU, disk, and
    memory speed
  •  
  • We identified several bottlenecks in building an
    efficient large-scale crawler and presented our
    solution to these problems
  •     ?Low-overhead disk-based data structures  
     ?Non-BFS crawling order    ?Real-time
    reputation to guide the crawling rate

24
Future work
  • Refining reputation algorithms, accessing their
    performance.
  • Mining the collected data.
Write a Comment
User Comments (0)
About PowerShow.com