Title: Web Crawling/Collection Aggregation
1Web Crawling/Collection Aggregation
- CS431, Spring 2004, Carl Lagoze
- April 5 Lecture 19
2The Web is a BIG Graph
- Diameter of the Web
- Cannot crawl even the static part, completely
- New technology the focused crawl
3Crawling and Crawlers
- Web overlays the internet
- A crawl overlays the web
seed
4Crawler Issues
- System Considerations
- The URL itself
- Politeness
- Visit Order
- Robot Traps
- The hidden web
5Standard for Robot Exclusion
- Martin Koster (1994)
- http//any-server80/robots.txt
- Maintained by the webmaster
- Forbid access to pages, directories
- Commonly excluded /cgi-bin/
- Adherence is voluntary for the crawler
- Specification http//www.robotstxt.org/wc/norobot
s.html
6Visit Order
- The frontier
- Breadth-first FIFO queue
- Depth-first LIFO queue
- Best-first Priority queue
- Random
- Refresh rate
7Robot Traps
- Cycles in the Web graph
- Infinite links on a page
- Traps set out by the Webmaster
8The Hidden Web
- Dynamic pages increasing
- Subscription pages
- Username and password pages
- Research in progress on how crawlers can get
into the hidden web
9Redefining Order Making for Networked Information
- Challenge Accommodate not impose ordering
mechanisms - Ordering mechanisms should be independent of
- Physical location
- Who owns the content
- Who manages the content
10Tools for Order Making
- Better search engines
- google
- Better metadata
- Dublin Core, INDECS, IMS
- Tools for selection and specialization
- Collection Services
11Collections in the Traditional Library
- Selection defining the resources
- Specialization defining the mechanisms
- Management defining the policies.
- http//campusgw.library.cornell.edu/about/spcollec
tions.html - http//scriptorium.lib.duke.edu/
12Traditional Model Doesnt Map
- Irrelevance of locality both among and within
resources - Blurring of containment inter-resource linkages
- Loss of permanence ephemeral resources are the
norm
13Defining a Digital Collection
- A criterion for selecting a set of resources
possibly distributed across multiple distributed
repositories
14Collection Synthesis
- The NSDL
- National Scientific Digital Library
- Educational materials for K-thru-grave
- A collection of digital collections
- Collection (automatically derived)
- 20-50 items on a topic, represented by their
URLs, expository in nature, precision trumps
recall. - Collection description (automatically derived)
15Crawler is the Key
- A general search engine is good for precise
results, few in number - A search engine must cover all topics, not just
scientific - For automatic collection assembly, a Web crawler
is needed - A focused crawler is the key
16Focused Crawling
17Focused Crawling
1
4
3
2
7
6
5
R
Focused crawl
Breadth-first crawl
1
18Collections and Clusters
- Traditional document universe is divided into
clusters, or collections - Each collection represented by its centroid
- Web size of document universe is infinite
- Agglomerative clustering is used instead
- Two aspects
- Collection descriptor
- Rule for when items belong to that Collection
19Q 0.2
Q 0.6
20The Setup
A virtual collection of items about Chebyshev
Polynomials
21Adding a Centroid
An empty collection of items about Chebyshev
Polynomials
22Document Vector Space
- Classic information retrieval technique
- Each word is a dimension in N-space
- Each document is a vector in N-space
- Example lt0, 0.003, 0,0,.01, .984,0,.001gt
- Normalize the weights
- Both the centroid and the downloaded document
are term vectors
23Agglomerate
A collection with 3 items about Ch. Polys.
24Where does the Centroid come from?
?
Chebyshev Polynomials
A really good centroid for a collection about
C.P.s
25Building a Centroid
1. Google(Chebyshev Polynomials) ? url1, url2,
2. Let H be a hash (k,v) where kword, valuefreq
3. For each url in url1, url2, do
D ? download(url) V ? term vector(d) For
each term t in V do
If t not in H add it with value 0 H(t)
4. Compute tf-idf weights. C ? top 20 terms (by
weight).
26Dictionary
- Given centroids C1, C2, C3
- Dictionary is C1 C2 C3
- Terms are union of terms in Ci
- Term Frequencies are total frequency in Ci
- Document Frequency is how many Cs have t
- Term IDF is based on Berkeleys DocFreqs
- Dictionary is 300-500 terms
27Tunneling with Cutoff
- Nugget dud dud dud nugget
- Notation 0 X X - X 0
- Fixed cutoff 0 X1 X2 - Xc
- Adaptive cutoff 0 X1 X2 - X?
28Statistics Collected
- 500,000 documents
- Number of seeds 4
- Path data for all but seeds
- 6620 completed paths (0-xx-0)
- 100,000s incomplete paths (0-xx..)
29Nuggets that are x steps from a nugget
30Nuggets that are x steps from a seed and/or a
nugget
31Better parents have better children.
32NSDL
http//www.nsdl.org
33Metadata Repository
- Central storage of all metadata about all
resources in the NSDL - Defines the extent of NSDL collection
- Metadata includes collections, items,
annotations, etc. - MR main functions
- Aggregation
- Normalization
- redistribution
- Ingest of metadata by various means
- Harvesting, manual, automatic, cross-walking
- Open access to MR contents for service builders
via OAI-PMH
34Metadata Strategy
- Collect and redistribute any native (XML)
metadata format - Provide crosswalks to Dublin Core from eight
standard formats - Dublin Core, DC-GEM, LTSC (IMS), ADL (SCORM),
MARC, FGCD, EAD - Concentrate on collection-level metadata
- Use automatic generation to augment item-level
metadata
35Importing metadata into the MR
36Exporting metadata from the MR
37NSDL Data WarehouseA Web of Entities and
Relationships
38Portals
Diverse Network of Specialized Partners (retail)
SpecializedMining
AnnotationAugmentation
NSDL Data WarehouseEntities and their
Relationships(wholesale)
Harvesting Gathering Normalization
Digital Sources