Web Crawling for Search: Whats hard after ten years - PowerPoint PPT Presentation

About This Presentation
Title:

Web Crawling for Search: Whats hard after ten years

Description:

Web 'crawling' is the primary means of obtaining data for Search Engines ... Async I/O vs. threads. Clustering/distribution. Non-conformance. Politeness. 7 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 17
Provided by: Yah955
Learn more at: http://www.fdis.org
Category:
Tags: crawling | hard | search | ten | threads | web | whats | years

less

Transcript and Presenter's Notes

Title: Web Crawling for Search: Whats hard after ten years


1
Web Crawling for SearchWhats hard after ten
years?
  • Raymie Stata
  • Chief Architect, Yahoo! Search and Marketplace

2
Agenda
  • Introduction
  • What makes crawling hard for beginners
  • What remains hard for experts

3
Introduction
  • Web crawling is the primary means of obtaining
    data for Search Engines
  • Tens of billions of pages downloaded
  • Hundreds of billions of pages known
  • Average page lt10 days old
  • Web crawling is as old as the Web
  • Large scale crawling is about ten-years old
  • Lots published, but still exists secret sauce
  • Must support RCF
  • Relevance, comprehensiveness, freshness

4
Components of a crawler
Page processing
Inet
Page storage
Downloaders
Prioritization
Enrichment
Web DB
Feeds
Click streams
Internet DNS as well as HTTP
5
Baseline challenges overall scale
  • 100s machines dedicated to each component
  • Must be good at logistics (purchasing and
    deployment), operations, distributed programming
    (fault tolerance included),

6
Baseline challenges downloaders
  • DNS scaling (multi-threading)
  • Bandwidth
  • Async I/O vs. threads
  • Clustering/distribution
  • Non-conformance
  • Politeness

7
Baseline challenges page processing
  • File-cracking
  • HTML, Word, PDF, JPG, MPEG,
  • Non-conformance
  • Higher-level processing
  • JavaScript, sessions, information extraction,

8
Baseline challengesWeb DB and enrichment
  • Scale
  • Update rate
  • Extraction rate
  • Duplication detection
  • Alias detection
  • Checkpoints

9
Baseline challenges prioritization
  • Quality ranking
  • Spam and crawler traps

10
Evergreen problems
  • Relevance
  • Page quality, spam
  • Page processing, prioritization techniques
  • Comprehensiveness
  • Sheer scale
  • Sheer machine count (expensive)
  • Scaling of the Web DB
  • Deep Web, information extraction
  • Page processing
  • Freshness
  • Discovery, frequency, long tail

11
Web DB more details
  • For each URL, the Web DB contains
  • In- and outlinks
  • Anchor text
  • Various dates last downloaded, last changed,
  • Decorations from various processors
  • Language, topic, spam scores, term-vectors,
    fingerprints, shingleprints, many more
  • Subset of the above stored for several instances
  • That is, we keep track of the history of a page

12
Web DB update volume
  • When a page is downloaded, we need to update
    inlink and anchor-text info for each page it
    points to
  • A page has 20 outlinks on it
  • We download 1,000s pages per second
  • At peak, need well over 100K updates/sec

13
Web DB scaling techniques
  • Perform updates in large batches
  • Solves bandwidth problems
  • but introduces latency problems
  • In particular time to discover new links
  • Solve latency with short-circuit for discovery
  • But this by-passes the full prioritization logic,
    which introduces quality problems that need to be
    solved with more special solutions and before
    long, Oi, its all getting very complicated

14
DHTML the enemy of crawling
  • Increasing use of client-side scripting (aka,
    DHTML) is making more of the Web opaque to
    crawlers
  • AJAX Asynchronous JavaScript and XML
  • (The end of crawling?)
  • Not (yet) a major barrier to Web search, but is a
    barrier to shopping and other specialized search,
    where we also have to deal with
  • Form-filling and sessions
  • Information extraction

15
Conclusions
  • Large-scale Web crawling not trivial
  • Smart, well-funded people could figure it out
    from the literature
  • But secret sauce remains in
  • Prioritization
  • Scaling the Web DB
  • JavaScript, form-filling, information extraction

16
The future
  • Will life get easier?
  • Ping plus feeds
  • Will life get harder?
  • DHTML -gt Ajax -gt Avalon
  • A little bit of both?
  • Publishers regain control
  • But, net, comprehensiveness improves
Write a Comment
User Comments (0)
About PowerShow.com