The Web Laboratory presentation

About This Presentation

Transcript and Presenter's Notes

Title: The Web Laboratory

1
The Web Laboratory
Goals, progress report, and research challenges
http//www.cs.cornell.edu/wya/weblab/
A project of Cornell University and the Internet
Archive
2
The Internet Archive
3
The Internet Archive Web Collection
The Data Complete crawls of the Web, every two
months since 1996, with some gaps Range of
formats and depth of crawl have increased with
time. No sites that are protected by robots.txt
or where owners requested not to be
archived Some missing or lost data
4
The Internet Archive Web Collection
Sizes Current crawls are about 40-60 TByte
(compressed) Total archive is about 600 TByte
(compressed) Compression ratio up to
251 best guess of overall average is
101 Rate of increase is about 1 TByte/day
(compressed) Note that total storage requirement
is reduced because much data does not change
between crawls
5
MotivationSocial Science Research
The Web as a social phenomenon Political
campaigns Online retailing The Web as
evidence The spread of urban legends ("Einstein
failed mathematics") Development of legal
concepts across time
6
The Petabyte Data Store

A project of the Cornell CS database group and
the Theory Center
to support research projects that manage large
data sets
Physical Gantry
Measure light-scattering properties of objects
Create accurate physical models for graphical
rendering
Each dataset is 14TB
Arecibo Telescope
Perform surveys of parts of the sky
Analyze the data to find high red-shift pulsars
1TB/day
The Web Laboratory

7
Year One System

2 16-Processor Unisys ES7000 Servers
64 GByte RAM
8 GByte/sec aggregate I/O bandwidth
2 50 TByte RAID Online Storage
ADIC Scalar 10K robotic tape library for archive

8
Unisys Server ES7000/430
9
RAID Storage System
10
Web Laboratory
The petabyte data store will allow us to mount
several very large portions of the Web online for
all types of web research. Copy snapshots of
the Web from the Internet Archive Transport
the data to Cornell on a regular basis Store
it at Cornell and load parts on demand Extract
features sets Present APIs to researchers
(program API, Web Services API)
11
Research Using Web Data
The Web Graph Structure and evolution of the Web
graph Hubs authorities, PageRank, etc. Social
networks Many of the basic studies have only
been done once Few if any large-scale studies
across time Typical research needs graphs of 1
billion pages should be possible in memory (64
Gbytes) algorithms are needed for processing
larger graphs
12
In Memory Web Graph
Representation of graph by its adjacency matrix
using a compressed sparse row representation The
Cuthill-McKee algorithm is used to reorder the
nodes to create dense blocks within the
matrix Work by Karthik Jeyabalan and
Jerrin Kallukalam
13
Research Using Web Data
Pseudo-crawling experiments on crawled
data Focused or selective Web crawling Burst
analysis Digital library selection Crawling the
Web is complex and unsatisfactory Time
consuming and unreliable Experiments cannot be
repeated because data changes Cannot study
changes across time A pseudo-crawl applies the
same algorithms but retrieves the pages from the
Web Laboratory Filters allow experiments on
subsets of the Web
14
Storing the Web Data
SQL Server database of structural metadata and
links
Content Store
Content Content hash(MD5) (zip)
Web graphs
Work by Pavel Dmitriev and Richard Wang
15
(No Transcript)
16
Benchmarking the Synthetic Web
A Synthetic Web is a generated graph with
graphical properties and distributions of domain
names similar to a Web crawl R-MAT algorithm is
used to generate URL1 URL 2 Satisfies Web
graph power laws Used for benchmarking and
experimentation
Work by Pavel Dmitriev and Shantanu Shah
17
Social Science Research
The Web as a social phenomenon Political
campaigns Online retailing The Web as
evidence Urban legends Development of legal
concepts across time Requires access to Web
data by content (Quark?) automated tools to
replace hand-coding (NLP, Machine
Learning) straightforward interfaces for
non-computing specialists (HCI)
18
Work Flow System
Transfer 1 Tbyte per day Internet 2 -- 1
gigabit, off peak Store 10 Gbyte batches of
compressed files Process raw data Uncompress and
unpack ARC and DAT files Create IDs for pages
and content hashes Extract links from html
pages Database load batches of pages, metadata,
and links Compress and store content files Work
by Mayank Gandhi and Jimmy Sun
19
Current Status
Data Capture Delays in connecting Internet
Archive to Internet 2 Testing using 250 Gbyte
test data set Ingest and Workflow Under test --
performance challenges Database and Content
Store Under test for scalability Synthetic
Web 500 million links generated -- data
structure under revision Web Graph Completion
scheduled for end of semester
20
The Cornell Team
Researchers William Arms, Dan Huttenlocher, Jon
Kleinberg, Carl Lagoze Ph.D. Students Pavel
Dmitriev, Selcuk Aya M.Eng. Students Mayank
Gandhi, Karthik Jeyabalan, Jerrin Kallukalam,
Shantanu Shah, Jimmy Yanbo Sun, Richard
Wang Petabyte Data Store Al Demers, Johannes
Gehrke, Dave Lifka, Jai Shanmugasundaram, John
Zollweg
21
Thanks
This work would not be possible without the
forethought and longstanding commitment of the
Internet Archive to capture and preserve the
content of the Web for future generations. The
petabyte data store is funded in part by National
Science Foundation grant 0403340, with equipment
support from Unisys. The Cornell Theory Center's
support for this project is funded in part by
Microsoft, Dell and Intel.

Write a Comment

User Comments (0)

About PowerShow.com

The Web Laboratory PowerPoint PPT Presentation