Web Crawling for Search: Whats hard after ten years

About This Presentation

Title:

Web Crawling for Search: Whats hard after ten years

Description:

Web 'crawling' is the primary means of obtaining data for Search Engines ... Async I/O vs. threads. Clustering/distribution. Non-conformance. Politeness. 7 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 17

Provided by: Yah955

Learn more at: http://www.fdis.org

Category:

more less

Transcript and Presenter's Notes

Title: Web Crawling for Search: Whats hard after ten years

1
Web Crawling for SearchWhats hard after ten
years?

Raymie Stata
Chief Architect, Yahoo! Search and Marketplace

2
Agenda

Introduction
What makes crawling hard for beginners
What remains hard for experts

3
Introduction

Web crawling is the primary means of obtaining
data for Search Engines
Tens of billions of pages downloaded
Hundreds of billions of pages known
Average page lt10 days old
Web crawling is as old as the Web
Large scale crawling is about ten-years old
Lots published, but still exists secret sauce
Must support RCF
Relevance, comprehensiveness, freshness

4
Components of a crawler
Page processing
Inet
Page storage
Downloaders
Prioritization
Enrichment
Web DB
Feeds
Click streams
Internet DNS as well as HTTP
5
Baseline challenges overall scale

100s machines dedicated to each component
Must be good at logistics (purchasing and
deployment), operations, distributed programming
(fault tolerance included),

6
Baseline challenges downloaders

DNS scaling (multi-threading)
Bandwidth
Async I/O vs. threads
Clustering/distribution
Non-conformance
Politeness

7
Baseline challenges page processing

File-cracking
HTML, Word, PDF, JPG, MPEG,
Non-conformance
Higher-level processing
JavaScript, sessions, information extraction,

8
Baseline challengesWeb DB and enrichment

Scale
Update rate
Extraction rate
Duplication detection
Alias detection
Checkpoints

9
Baseline challenges prioritization

Quality ranking
Spam and crawler traps

10
Evergreen problems

Relevance
Page quality, spam
Page processing, prioritization techniques
Comprehensiveness
Sheer scale
Sheer machine count (expensive)
Scaling of the Web DB
Deep Web, information extraction
Page processing
Freshness
Discovery, frequency, long tail

11
Web DB more details

For each URL, the Web DB contains
In- and outlinks
Anchor text
Various dates last downloaded, last changed,
Decorations from various processors
Language, topic, spam scores, term-vectors,
fingerprints, shingleprints, many more
Subset of the above stored for several instances
That is, we keep track of the history of a page

12
Web DB update volume

When a page is downloaded, we need to update
inlink and anchor-text info for each page it
points to
A page has 20 outlinks on it
We download 1,000s pages per second
At peak, need well over 100K updates/sec

13
Web DB scaling techniques

Perform updates in large batches
Solves bandwidth problems
but introduces latency problems
In particular time to discover new links
Solve latency with short-circuit for discovery
But this by-passes the full prioritization logic,
which introduces quality problems that need to be
solved with more special solutions and before
long, Oi, its all getting very complicated

14
DHTML the enemy of crawling

Increasing use of client-side scripting (aka,
DHTML) is making more of the Web opaque to
crawlers
AJAX Asynchronous JavaScript and XML
(The end of crawling?)
Not (yet) a major barrier to Web search, but is a
barrier to shopping and other specialized search,
where we also have to deal with
Form-filling and sessions
Information extraction

15
Conclusions

Large-scale Web crawling not trivial
Smart, well-funded people could figure it out
from the literature
But secret sauce remains in
Prioritization
Scaling the Web DB
JavaScript, form-filling, information extraction

16
The future

Will life get easier?
Ping plus feeds
Will life get harder?
DHTML -gt Ajax -gt Avalon
A little bit of both?
Publishers regain control
But, net, comprehensiveness improves

Write a Comment

User Comments (0)

About PowerShow.com

Web Crawling for Search: Whats hard after ten years - PowerPoint PPT Presentation

Web Crawling for Search: Whats hard after ten years

Web 'crawling' is the primary means of obtaining data for Search Engines ... Async I/O vs. threads. Clustering/distribution. Non-conformance. Politeness. 7 ... – PowerPoint PPT presentation