Web Archaeology - PowerPoint PPT Presentation

About This Presentation

Title:

Web Archaeology

Description:

Quality of almost-breadth-first crawling. Structure of the Web ... The Mercator web crawler. A high-performance web crawler. Downloads and processes web pages ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 37

Provided by: joyda

Learn more at: https://dsf.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Archaeology

1
Web Archaeology

Raymie Stata
Compaq Systems Research Center
raymie.stata_at_compaq.com
www.research.compaq.com/SRC

2
What is Web Archaeology?

The study of the content of the Web
exploring the Web
sifting through data
making valuable discoveries
Difficult! Because the Web is
Boundless
Dynamic
Radically decentralized

3
Some recent results

Empirical studies
Quality of almost-breadth-first crawling
Structure of the Web
Random walks (size of search engines)
Improving the Web experience
Better and more precise search results
Surfing assistants and tools
Data mining
Technologies for page scraping

4
Tools for Web scale research

Use data
Search quality
Crawl quality
Duplicate elimination
Web characterization

Apps
Feature Databases

Access subset of data fast
Full-text index, shingleprints
Connectivity, Term vectors

Data storage

Store and access web pages
Myriad

Download web pages
Mercator, Web Language

Data collection
5
Web-scale crawling
Mercator
Atrax
6
The Mercator web crawler

A high-performance web crawler
Downloads and processes web pages
Written entirely in Java
Runs on any Java-capable platform
Extensible
Wide range of possible configurations
Users can plug in new modules at run-time

7
Mercator design points

Extensible
Well-chosen extensibility points
Framework for configuration
Multiple threads, synchronous I/O
vs. single thread, asynchronous I/O
Checkpointing
Allows crawls to be restarted
Modules export prepare, commit

8
System Architecture
9
Crawl quality
10
Atrax, a distributed version of Mercator

Distributes load across cluster of crawlers
Partitions data structures across crawlers
No central bottleneck
Network bandwidth is limiting factor

11
Performance of Atrax vs Mercator
12
Myriad -- new project

A very large, archival storage system
Scalable to a petabyte
With function shipping
Supports data mining

13
Myriad Requirements

Large (up to 10K disks)
Commodity hardware (low cost)
Easy to manage
Easy to use (queries vs. code)
Fault tolerance containment
No backups, tape or otherwise

14
Two phases of Myriad project

Define service-level interface
Implemented to run on collections of files
Testing and tuning
Build scalable implementation
Cluster storage and processing
Designing now, prototype in summer
Wont describe today

15
New service level interface
file systems, databases, Myriad

Better suited to this problem and scale
Supports function shipping

16
Myriad interface

Single table database
Stored vs. virtual columns
Virtual columns computed by injected code
Bulk input of new records
Management of code defining virtual columns
Output via select/project queries
User-defined code run implicitly
Support for repeatable random sampling

17
Example Myriad query

samplingprob0.1,
samplingseed321223421332
select name, length where
insertionDate lt Date(00/01/01)
mimeType text/html

18
Model for large-scale data mining

Step 1 make an extract
Do data-parallel select and project
Dont do any sorts, joins, groupings
Step 2 put extract into high-power analysis tool
Joins, sorts, joins, groupings

19
Feature Databases

URL DB
URL ? pgid
Host DB
pgid ? hostid
Link DB
out pgid ? pgid
in pgid ? pgid
Term vector DB
pgid ? term vector

20
URL database prefix compression
http//kiva.net/markh/surnames.html http//kiwi-u
s.com/amigo/links/index.htm http//kiwi.emse.fr/
http//kiwi.etri.re.kr/khshim/internet/bookmark.h
tml http//kiwi.etri.re.kr/ksw/bookmark http//ki
wi.futuris.net/linen http//kiwi.futuris.net/linen
/special/backiss.html
Prefix compress
0 http//kiva.net/markh/surnames.html 9
wi-us.com/amigo/links/index.htm 11 .emse.fr/ 13
tri.re.kr/khshim/internet/bookmark.html 25
sw/bookmark 12 futuris.net/linen 29
/special/backiss.html
21
URL compression

Prefix compression
44 ? 14.5 bytes/URL
Fast to decompress (10 ?s)
ZIP compression
14.5 ? 9.2 bytes/URL
Slow to decompress (80 ?s)

22
Term vector basics

Basic abstraction for information retrieval
Useful for measuring semantic similarity of text

A row in the above table is a term vector
Columns are word stems and phrases
Trying to capture meaning

23
Compressing term vectors

Sparse representation
Only store columns with non-zero counts
Lossy representation
Only store important columns
Importance determined by
Count of term on page (high gt important)
Number of pages with term (low gt important)

24
TVDB Builder
25
Applications

Categorizing pages
Topic distillation
Filtering pages
Identifying languages
Identifying running text
Relevance feedback (more like this)
Abstracting pages

26
Categorization

Bulls take over

27
How to categorize a page

Off line
Collect training set of pages per category (30K)
Combine training pages into category vectors
10K terms per category vector
On line
Use term vector DB to look up vector of page
Find category vector that best matches this page
vector
Use a Bayesian classifier to match vectors
Give no category if match not definitive

28
Topic drift in topic distillation

Some Web IR algorithms have this structure
Compute a seed set on a query
Find neighborhood by following links
Rank this neighborhood
Topic drift (a problem)
The neighborhood graph includes off-topic nodes
Download Microsoft Explorer ? MS Home page

29
Avoid topic drift with term vectors

Combine term vectors of seed set into topic
vector
Detecting topic drift in neighboring nodes
Combine topic vector with nodes term vector
Inner product works fine
Expunge or weight
Integration of feature databases helps!!

30
Link database

Goals
Fit links into RAM (fast lookup)
Build in 24 hours
Applications
Page ranking
Web structure
Mirror site detection
Related page detection

31
Link storage baseline design
Links
Starts
...
104 105 106 107 108
...
Id
106 115 101 72 208 111
...
...
32
Link storage deltas
Link Deltas
Starts
...
104 105 106 107 108
...
Id
106 115 101 72 208 111
2 9 -4 -31 136 4
...
...
33
Link storage compression
Link Deltas
Variable-length encoding 1.7 bytes/link
Starts
...
104 105 106 107 108
...
Id
106 115 101 72 208 111
2 9 -4 -31 136 4
4 bits 8 bits 8 bits 8 bits 12 bits
8 bits
...
...
34
LDBng
35
The future of Web Archaeology