Title: William Y. Arms
1 Cornell Information Science
Research Seminar The Web Lab http//weblab.infosc
i.cornell.edu/
- William Y. Arms
- Manuel Calimlim
- Lucy Walle
- Felix Weigel
- January 23, 2007
2The Web Lab A Joint Project of Cornell
University and the Internet Archive
Faculty William Arms, Johannes Gehrke, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang,... Researchers Manuel Calimlim, Dave
Lifka, Ruth Mitchell, Lucia Walle, Felix
Weigel,... Students Selcuk Aya, Pavel Dmitriev,
Blazej Kot, with more than 50 M.Eng., and
undergraduate students from Information Science
and Computer Science Internet Archive Brewster
Kahle, Tracey Jacquith, Michael Stack, Kris
Carpenter,...
3Introduction to the Web LabMining the History of
the Web
- The Internet Archive's Web Collection
- Complete crawls of the Web, every two months
since 1996 - Total archive is about 110,000,000,000 pages
(110 billion) - Recent crawls are about 60 TByte (compressed)
- Total archive is about 1,900 TByte (compressed)
- Metadata contains format, links, anchor text
4The Library Stacks the Internet Archive
5The Wayback Machine
- Demo
- http//www.archive.org/
6Research using Metadata about Web Pages
- Current NSF grant
- Research using anchor text
- links to microsoft.com and google.com
- Changes to the link structure of the Web
- differences between crawls
- densification (increases in average node
degree) - Formation of online groups
7Example of Past Work Social and Information
Networks, Joining a Community
- Close to one billion (user, community)
instances -
- Work by Lars Backstrom, Dan Huttenlocher, Jon
Kleinberg, and Xiangyang Lan
8The Never-ending Research Dialog
Here's an analysis we would like to do... Not as
you suggest it, but here's another idea...
We don't know how to do that analysis. Would
this be any use to you? That might be possible,
with the following modification...
INFORMATION SCIENTIST
RESEARCHER
Let's try it and see.
9The Role of Web Data for Social Science Research
- Social networks are an important research topic
- Emergence of global phenomena from local effects
- Viral spreading of rumors
- Behavior of individuals in a community
- Roles in discussion threads, herd behavior in
opinion polls - Network structure and dynamics
- Strength of weak ties, triangle relations,
homophily
10How to Observe a Social Network?
- Social network research before the web
- Talk to people, make notes
- Distribute questionnaires, gather statistics
- Problems with this approach
- Tedious task
- Small scale
- The Internet Archive is a great resource for
research - Contains web pages with social networks
- Records the history of the pages
11Social Networks on the Web
- The web contains many social networks
- Sites for social networking, social bookmarking,
file sharing - MySpace, Facebook, Flickr, Delicious
- Community portals
- Yahoo Groups, DBLife
- Encyclopedia and folksonomy projects
- Wikipedia, Wikia
- Review sites and customer comments
- Amazon, Netflix
- Blogs, web forums, Usenet
12The Bliss and Curse of Digital Data
- Opportunities
- Collecting network data at an unprecedented scale
- Verifying hypotheses in many different networks
- Monitoring communities at a finer granularity
- Mining and searching social networks
- Challenges
- Finding suitable information on the web
- Extracting information from web pages
- Making web data persistent
- Processing very large data sets
- Access rights and privacy
13Web Lab and Social Science Research
- Collaboration with Cornells Institute for the
Social Sciences - Our goal Make data available to researchers
- Large web graph database with multiple crawls
- Packaged subsets of crawls for analysis
- Visual extraction tool for creating new data sets
(ongoing) - Small-scale crawling for adding new web sites
(starting) - Full-text indexing (planned)
- Demo of the extraction tool available at
- http//www.cs.cornell.edu/weigel/WrapperDemo/
14Web Data Extraction
- Researchers often dont care about web pages, but
specific substructures inside the pages - Blog postings
- Web forums
- Social tagging
- News headlines
- Tables of content
- Bibliographies
- Product details
- Customer reviews
15Web Data Collaboration Server
- Data extraction
- Writing extraction code is a tedious task
- Create tools to make the data easily accessible
in a structured format (e.g., tables in a
database) - Data sharing
- Extracting the same data repeatedly is a waste
of time and storage space - Let users share their data and extraction rules
- Data curation
- Web data is often incomplete and erroneous
- Let users collaborate to correct and complete
the data
16Demonstration
- Demo of the extraction tool available at
- http//www.cs.cornell.edu/weigel/WrapperDemo/
17The Web Lab System
INTERNET ARCHIVE
Web Collection
Wayback Machine
Text indexes
File server
Computer cluster
National super-computers
Structure database
Text indexes
Page store
CORNELL UNIVERSITY
18Technical Processing the Web Lab
Networking Internet 2, National Lambda
Rail Wayback Machine Commodity computers
with local file systems Structure
database Relational database system on large
shared memory computer Data analysis Specialized
Linux cluster with Hadoop distributed file
system and MapReduce programming
Different types of computer for different
functions
19The Research Process
- Select a sub-set for analysis
- SQL query the relational database directly
- Use the GetPages tool on the Web site to send
an SQL query - Download the sub-set
- To the researcher's computer
- To the Web Lab file server
- Clean-up the data
- MapReduce tasks on the Hadoop cluster
- Data analysis
- MapReduce tasks on the Hadoop cluster
20Selection Methods
- By known identifier (Wayback Machine)
- web pages with the URL http//www.nsf.gov/
- By character string (full text indexing) --
future - all pages containing, "Internet is doubling
every six months" - all page containing the SARS-CoV genetic
sequence - By metadata criteria
- all web pages that link to microsoft.com but not
to google.com - all email addresses that I used to receive mail
from but have not had mail from recently - Example provided by Marc Smith
21Benefits of Using a Relational Database
- Simple query language for retrieving data
- Transaction support
- Concurrency control for parallel queries
- Multiple indices for high performance
- Reliability since databases have built-in
recovery functionality
22Metadata Loading
- The crawler outputs compressed metadata files
(DAT files). - Each DAT file has a set of crawled pages with
page metadata, including things like crawl time,
IP address, mime type, language encoding, etc. - Most importantly, the outgoing links from each
page are parsed, including the full URL and
associated anchor text.
23Database Schema
- Crawl Name of the crawl from which data is
loaded - Page Metadata about each webpage plus fields to
help find and extract the full html text - Link The outgoing links from crawled pages
- Url Lookup table for unique URLs
- Host Lookup table for unique hostnames
24Crawls Loaded Into SQL DB
25Selection from the Database
- SQL query the relational database directly
- (Contact Manuel Calimlim)
- Use the GetPages tool on the Web site to send
an SQL query -- work in progress -
26Demonstration
- Demonstration of the Web Lab web site
- http//weblab.infosci.cornell.edu/
- and the GetPages tool
27Massive Data Analysis by Non-Specialists
- A typical scientist or social scientist
- Has deep domain knowledge
- Has good algorithmic understanding
- Is often a competent computer user or has a
research assistant who is familiar with languages
such as Fortran, Python, and Matlab, or
applications packages such as SAS and Excel. - But...
- Has limited understanding of large-scale data
analysis - Is not skilled at any form of computing that
requires parallel computing or concurrency - Typical problem of scale Given 100 billion URLs,
how do you - identify duplicates?
-
28Hadoop and MapReduce Programming
Hadoop An open source distributed file system
similar to the Google File System. It supports
MapReduce programming. http//lucene.apache.org/h
adoop/ MapReduce A functional programming style
to support large-scale data analysis without the
need for global data structures. In the
1960s, Fortran gave scientists a simple way to
translate mathematical problems into efficient
computer codes. MapReduce programming gives
researchers a simple way to run massive data
analysis on large computer clusters.
29The MapReduce Paradigm
M map tasks
R reduce tasks
Input data split into files
Output files
Intermediate files
Output 0
split 0 split 1 split 2 split 3 split 4
Output 1
Each intermediate file is divided into R
partitions
Each reduce task corresponds to one partition
30A Web Graph Example
2
1
4
3
6
5
31Building the Web Graph
URLs, pages, and links URLs contained in Web
pages may link to pages never crawled URLs not
canonicalized different URLs may refer to same
page Links are from a page to a URL Web graph
from crawl data Nodes are union of pages
crawled and URLs seen Each node and edge has
time interval(s) over which it exists
32Web Graph Example
Problem Given a set of URL pairs in
uncanonicalized form (u0, v0), create a list of
all the edges that point to each node of the web
graph Replace each u0 or v0 with its
canonicalized form u or v. Create a list of
all nodes of the graph, i.e., the set of unique
u. Discard all (u, v) pairs, where u v, or
v is not a node of the graph. Discard all
duplicate edges. For each node v, create a list
(v, u), where u is the set of nodes that have
edges to node v. Each step is a simple
programming task for a small numbers of links on
a single computer. How can this simplicity be
retained with huge numbers of links on a very
large computer cluster?
33MapReduce Example
Map task Input (u0, v0) Output (u, d) //
Indicate that u is a from-URL (v, u) //
Indicate that v is a to-URL with link from u d
is a dummy marker. Do not output if u v. This
is simple application code to write.
34A MapReduce Example
Merge The input to the reduce process merges the
output values from the map task that correspond
to each URL. For each URL, w, it creates a
list w, d, ... , d, u1, ..., uk This merge is
performed automatically by the system libraries.
35A MapReduce Example
Reduce Input w, d, ... , d, u1, ..., uk,
where w is any URL. Output If there is no
marker d in the list, discard and do not output.
This corresponds to a URL that never appears only
as the first element of a (u, v) pair. Otherwise
remove duplicates from u1, ..., uk and output.
The output is a to-URL and a list of the nodes
that link to it v, u1, ..., uk This is
simple application code to write.
36For the FutureExamples of Tools and Services
- The Web Lab is steadily building a set of tools
for researchers - API and Web services
- GetPages Web forms to select dataset by query
of a relational database with indexes by date,
URL, domain name, file type, anchor text, etc. - Focused Web crawling (modification of Heritrix
crawler) - Extraction of Web graph from subset and
calculations, e.g., PageRank, hubs and
authorities - Graph visualization
- Natural language processing of anchor text
37The Web Lab is Ready for Use
- We are ready to work with a number of
researchers - Systems
- Relational database operational
- Hadoop pilot cluster (large cluster soon)
- File server and web server operational
- People
- Manuel Calimlim (database)
- Lucy Walle (Hadoop MapReduce)
- Tools
- A variety of tools in prototype
- Experience with large volumes of anchor text and
URLs
38Thanks
This work would not be possible without the
forethought and long standing commitment of
Brewster Kahle and the Internet Archive to
capture and preserve the content of the Web for
future generations. This work has been funded in
part by the National Science Foundation, grants
CNS-0403340, DUE-0127308, SES-0537606,
IIS-0634677, and IIS-0705774.
39 Cornell Information Science
Research Seminar The Web Lab http//weblab.infosc
i.cornell.edu/
- William Y. Arms
- Manuel Calimlim
- Lucy Walle
- Felix Weigel
- January 23, 2007