Title: Web Archaeology
1Web Archaeology
- Raymie Stata
- Compaq Systems Research Center
- raymie.stata_at_compaq.com
- www.research.compaq.com/SRC
2What is Web Archaeology?
- The study of the content of the Web
- exploring the Web
- sifting through data
- making valuable discoveries
- Difficult! Because the Web is
- Boundless
- Dynamic
- Radically decentralized
3Some recent results
- Empirical studies
- Quality of almost-breadth-first crawling
- Structure of the Web
- Random walks (size of search engines)
- Improving the Web experience
- Better and more precise search results
- Surfing assistants and tools
- Data mining
- Technologies for page scraping
4Tools for Web scale research
- Use data
- Search quality
- Crawl quality
- Duplicate elimination
- Web characterization
Apps
Feature Databases
- Access subset of data fast
- Full-text index, shingleprints
- Connectivity, Term vectors
Data storage
- Store and access web pages
- Myriad
- Download web pages
- Mercator, Web Language
Data collection
5Web-scale crawling
Mercator
Atrax
6The Mercator web crawler
- A high-performance web crawler
- Downloads and processes web pages
- Written entirely in Java
- Runs on any Java-capable platform
- Extensible
- Wide range of possible configurations
- Users can plug in new modules at run-time
7Mercator design points
- Extensible
- Well-chosen extensibility points
- Framework for configuration
- Multiple threads, synchronous I/O
- vs. single thread, asynchronous I/O
- Checkpointing
- Allows crawls to be restarted
- Modules export prepare, commit
8System Architecture
9Crawl quality
10Atrax, a distributed version of Mercator
- Distributes load across cluster of crawlers
- Partitions data structures across crawlers
- No central bottleneck
- Network bandwidth is limiting factor
11Performance of Atrax vs Mercator
12Myriad -- new project
- A very large, archival storage system
- Scalable to a petabyte
- With function shipping
- Supports data mining
13Myriad Requirements
- Large (up to 10K disks)
- Commodity hardware (low cost)
- Easy to manage
- Easy to use (queries vs. code)
- Fault tolerance containment
- No backups, tape or otherwise
14Two phases of Myriad project
- Define service-level interface
- Implemented to run on collections of files
- Testing and tuning
- Build scalable implementation
- Cluster storage and processing
- Designing now, prototype in summer
- Wont describe today
15New service level interface
file systems, databases, Myriad
- Better suited to this problem and scale
- Supports function shipping
16Myriad interface
- Single table database
- Stored vs. virtual columns
- Virtual columns computed by injected code
- Bulk input of new records
- Management of code defining virtual columns
- Output via select/project queries
- User-defined code run implicitly
- Support for repeatable random sampling
17Example Myriad query
- samplingprob0.1,
- samplingseed321223421332
- select name, length where
- insertionDate lt Date(00/01/01)
- mimeType text/html
18Model for large-scale data mining
- Step 1 make an extract
- Do data-parallel select and project
- Dont do any sorts, joins, groupings
- Step 2 put extract into high-power analysis tool
- Joins, sorts, joins, groupings
19Feature Databases
- URL DB
- URL ? pgid
- Host DB
- pgid ? hostid
- Link DB
- out pgid ? pgid
- in pgid ? pgid
- Term vector DB
- pgid ? term vector
20URL database prefix compression
http//kiva.net/markh/surnames.html http//kiwi-u
s.com/amigo/links/index.htm http//kiwi.emse.fr/
http//kiwi.etri.re.kr/khshim/internet/bookmark.h
tml http//kiwi.etri.re.kr/ksw/bookmark http//ki
wi.futuris.net/linen http//kiwi.futuris.net/linen
/special/backiss.html
Prefix compress
0 http//kiva.net/markh/surnames.html 9
wi-us.com/amigo/links/index.htm 11 .emse.fr/ 13
tri.re.kr/khshim/internet/bookmark.html 25
sw/bookmark 12 futuris.net/linen 29
/special/backiss.html
21URL compression
- Prefix compression
- 44 ? 14.5 bytes/URL
- Fast to decompress (10 ?s)
- ZIP compression
- 14.5 ? 9.2 bytes/URL
- Slow to decompress (80 ?s)
22Term vector basics
- Basic abstraction for information retrieval
- Useful for measuring semantic similarity of text
- A row in the above table is a term vector
- Columns are word stems and phrases
- Trying to capture meaning
23Compressing term vectors
- Sparse representation
- Only store columns with non-zero counts
- Lossy representation
- Only store important columns
- Importance determined by
- Count of term on page (high gt important)
- Number of pages with term (low gt important)
24TVDB Builder
25Applications
- Categorizing pages
- Topic distillation
- Filtering pages
- Identifying languages
- Identifying running text
- Relevance feedback (more like this)
- Abstracting pages
26Categorization
27How to categorize a page
- Off line
- Collect training set of pages per category (30K)
- Combine training pages into category vectors
- 10K terms per category vector
- On line
- Use term vector DB to look up vector of page
- Find category vector that best matches this page
vector - Use a Bayesian classifier to match vectors
- Give no category if match not definitive
28Topic drift in topic distillation
- Some Web IR algorithms have this structure
- Compute a seed set on a query
- Find neighborhood by following links
- Rank this neighborhood
- Topic drift (a problem)
- The neighborhood graph includes off-topic nodes
- Download Microsoft Explorer ? MS Home page
29Avoid topic drift with term vectors
- Combine term vectors of seed set into topic
vector - Detecting topic drift in neighboring nodes
- Combine topic vector with nodes term vector
- Inner product works fine
- Expunge or weight
- Integration of feature databases helps!!
30Link database
- Goals
- Fit links into RAM (fast lookup)
- Build in 24 hours
- Applications
- Page ranking
- Web structure
- Mirror site detection
- Related page detection
31Link storage baseline design
Links
Starts
...
104 105 106 107 108
...
Id
106 115 101 72 208 111
...
...
32Link storage deltas
Link Deltas
Starts
...
104 105 106 107 108
...
Id
106 115 101 72 208 111
2 9 -4 -31 136 4
...
...
33Link storage compression
Link Deltas
Variable-length encoding 1.7 bytes/link
Starts
...
104 105 106 107 108
...
Id
106 115 101 72 208 111
2 9 -4 -31 136 4
4 bits 8 bits 8 bits 8 bits 12 bits
8 bits
...
...
34LDBng
35The future of Web Archaeology
- Driving applications
- Web search -- finding things on the web
- Page classification (topic, community, type)
- Purpose-specific search
- Web asset management (whats on my site?)
- Automated information extraction (price robots)
- Multi-billion page web
- Dynamics
36(No Transcript)