Intro to Web Search - PowerPoint PPT Presentation

About This Presentation

Title:

Intro to Web Search

Description:

Title: All About Nutch Author: cse Last modified by: Michael Cafarella Created Date: 2/3/2004 4:48:51 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 33

Provided by: CSE128

Learn more at: https://courses.cs.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Intro to Web Search

1
Intro to Web Search

Michael J. Cafarella
December 5, 2007

2
Searching the Web

Web Search is basically a database problem, but
no one uses SQL databases
Every query is a top-k query
Every query plan is the same
Massive numbers of queries and data
Read-only data
A search query can be thought of in SQL terms,
but the engineered system is completely different

3
Nutch Hadoop A case study

Open-source, free to use and change
Nutch is a search engine, can handle 200M pages
Hadoop is backend infrastructure, biggest
deployment is a 2000 machine cluster
There have been many different search engine
designs, but Nutch is pretty standard and easy to
learn from

4
Outline

Search basics
What are the elementary steps?
Nutch design
Link database, fetcher, indexer, etc
Hadoop support
Distributed filesystem, job control

5
Search document model

Think of a web document as a tuple with several
columns
Incoming link text
Title
Page content (maybe many sub-parts)
Unique docid
A web search is really SELECT FROM docs WHERE
docs.text LIKE userquery AND docs.title LIKE
userquery AND ORDER BY relevance
Where relevance is very complicated

6
Search document model (2)

Three main challenges to processing a query
Processing speed
Result relevance
Scaling to many documents

7
Processing speed

You could grep, but each query will need to touch
each document
Key to fast processing is the inverted index
Basic idea is for each word, list all the
documents where that word can be found

8
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
9
Query such as
docs
docid0
docid1
docid2
dociddocs-1

as
billy
cities
friendly
give
mayors
nickels
seattle
such
words

Test for equality
Advance smaller pointer
Abort when a list is exhausted

docs
docid0
docid1
docid2
dociddocs-1

322
Returned docs
10
Result Relevance

Modern search engines use hundreds of different
clues for ranking
Page title, meta tags, bold text, text position
on page, word frequency,
A few are usually considered standard
tfidf(t, d) freq(t-in-d) / freq(t-in-corpus)
Link analysis link counting, PageRank, etc
Incoming anchor text
Big gains are now hard to find

11
Scaling to many documents

Not even the inverted index can handle billions
of docs on a single machine
Need to parallelize query
Segment by document
Segment by search term

12
Scaling (2) doc segmenting
Docs 1-2M
Docs 2-3M
Docs 3-4M
Docs 4-5M
Docs 0-1M
Ds 1.2M, 1.7M
britney
Ds 2.3M, 2.9M
Ds 3.1M, 3.2M
britney
Ds 4.4M, 4.5M
britney
britney
britney
Ds 1, 29
1.2M, 4.4M, 29,
britney
13
Scaling (3)

Segment by document, pros/cons
Easy to partition (just MOD the docid)
Easy to add new documents
If machine fails, quality goes down but queries
dont die
Segment by term, pros/cons
Harder to partition (terms uneven)
Trickier to add a new document (need to touch
many machines)
If machine fails, search term might disappear,
but not critical pages (e.g., yahoo.com/index.html
)

14
Intro to Nutch

A search engine is more than just the query
system. Simply obtaining the pages and
constructing the index is a lot of work

15
WebDB
16
Moving Parts

Acquisition cycle
WebDB
Fetcher
Index generation
Indexing
Link analysis (maybe)
Serving results

17
WebDB

Contains info on all pages, links
URL, last download, failures, link score,
content hash, ref counting
Source hash, target URL
Must always be consistent
Designed to minimize disk seeks
19ms seek time x 200m new pages/mo
44 days of disk seeks!

18
Fetcher

Fetcher is very stupid. Not a crawler
Divide to-fetch list into k pieces, one for
each fetcher machine
URLs for one domain go to same list, otherwise
random
Politeness w/o inter-fetcher protocols
Can observe robots.txt similarly
Better DNS, robots caching
Easy parallelism
Two outputs pages, WebDB edits

19
WebDB/Fetcher Updates
URL http//www.about.com/index.html
LastUpdated 3/22/05
ContentHash MD5_sdflkjweroiwelksd
Edit DOWNLOAD_CONTENT
URL http//www.yahoo/index.html
ContentHash MD5_toewkekqmekkalekaa
URL http//www.cnn.com/index.html
LastUpdated Never
ContentHash None
Edit DOWNLOAD_CONTENT
URL http//www.cnn.com/index.html
ContentHash MD5_balboglerropewolefbag
URL http//www.cnn.com/index.html
LastUpdated Today!
ContentHash MD5_balboglerropewolefbag
URL http//www.flickr/com/index.html
LastUpdated Never
ContentHash None
URL http//www.yahoo/index.html
LastUpdated 4/07/05
ContentHash MD5_toewkekqmekkalekaa
Edit NEW_LINK
URL http//www.flickr.com/index.html
ContentHash None
URL http//www.yahoo.com/index.html
LastUpdated Today!
ContentHash MD5_toewkekqmekkalekaa
Fetcher edits
WebDB
2. Sort edits (externally, if necessary)
1. Write down fetcher edits
3. Read streams in parallel, emitting new database
4. Repeat for other tables
20
Indexing

Iterate through all k page sets in parallel,
constructing inverted index
Creates a searchable document like we saw
earlier
URL text
Content text
Incoming anchor text
Inverted index provided by the Lucene open source
project

21
Administering Nutch

Admin costs are critical
Its a hassle when you have 25 machines
Google has maybe gt100k
Files
WebDB content, working files
Fetchlists, fetched pages
Link analysis outputs, working files
Inverted indices
Jobs
Emit fetchlists, fetch, update WebDB
Run link analysis
Build inverted indices

22
Administering Nutch (2)

Admin sounds boring, but its not!
Really
I swear
Large-file maintenance
Google File System (Ghemawat, Gobioff, Leung)
Nutch Distributed File System
Job Control
Map/Reduce (Dean and Ghemawat)
Result Hadoop, a Nutch spinoff

23
Nutch Distributed File System

Similar, but not identical, to GFS
Requirements are fairly strange
Extremely large files
Most files read once, from start to end
Low admin costs per GB
Equally strange design
Write-once, with delete
Single file can exist across many machines
Wholly automatic failure recovery

24
NDFS (2)

Data divided into blocks
Blocks can be copied, replicated
Datanodes hold and serve blocks
Namenode holds metainfo
Filename ? block list
Block ? datanode-location
Datanodes report in to namenode every few seconds

25
NDFS File Read
Datanode 0
Datanode 1
Datanode 2
Namenode
Datanode 3
Datanode 4
Datanode 5
crawl.txt
(block-33 / datanodes 1, 4) (block-95 / datanodes
0, 2) (block-65 / datanodes 1, 4, 5)

Client asks datanode for filename info
Namenode responds with blocklist, and location(s)
for each block
Client fetches each block, in sequence, from a
datanode

26
NDFS Replication
Datanode 0 (33, 95)
Datanode 1 (46, 95)
Datanode 2 (33, 104)
(Blk 90 to dn 0)
Namenode
Datanode 3 (21, 33, 46)
Datanode 4 (90)
Datanode 5 (21, 90, 104)

Always keep at least k copies of each blk
Imagine datanode 4 dies blk 90 lost
Namenode loses heartbeat, decrements blk 90s
reference count. Asks datanode 5 to replicate
blk 90 to datanode 0
Choosing replication target is tricky

27
Map/Reduce

Map/Reduce is programming model from Lisp (and
other places)
Easy to distribute across nodes
Nice retry/failure semantics
map(key, val) is run on each item in set
emits key/val pairs
reduce(key, vals) is run for each unique key
emitted by map()
emits final output
Many problems can be phrased this way

28
Map/Reduce (2)

Task count words in docs
Input consists of (url, contents) pairs
map(keyurl, valcontents)
For each word w in contents, emit (w, 1)
reduce(keyword, valuesuniq_counts)
Sum all 1s in values list
Emit result (word, sum)

29
Map/Reduce (3)

Task grep
Input consists of (urloffset, single line)
map(keyurloffset, valline)
If contents matches regexp, emit (line, 1)
reduce(keyline, valuesuniq_counts)
Dont do anything just emit line
We can also do graph inversion, link analysis,
WebDB updates, etc

30
Map/Reduce (4)

How is this distributed?
Partition input key/value pairs into chunks, run
map() tasks in parallel
After all map()s are complete, consolidate all
emitted values for each unique emitted key
Now partition space of output map keys, and run
reduce() in parallel
If map() or reduce() fails, reexecute!

31
Map/Reduce Job Processing
TaskTracker 0
TaskTracker 1
TaskTracker 2
JobTracker
TaskTracker 3
TaskTracker 4
TaskTracker 5
grep

Client submits grep job, indicating code and
input files
JobTracker breaks input file into k chunks, (in
this case 6). Assigns work to ttrackers.
After map(), tasktrackers exchange map-output to
build reduce() keyspace
JobTracker breaks reduce() keyspace into m chunks
(in this case 6). Assigns work.
reduce() output may go to NDFS

32
Conclusion