Web Crawling/Collection Aggregation - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Web Crawling/Collection Aggregation

Description:

Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19 – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 39

Provided by: berg114

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Crawling/Collection Aggregation

1
Web Crawling/Collection Aggregation

CS431, Spring 2004, Carl Lagoze
April 5 Lecture 19

2
The Web is a BIG Graph

Diameter of the Web
Cannot crawl even the static part, completely
New technology the focused crawl

3
Crawling and Crawlers

Web overlays the internet
A crawl overlays the web

seed
4
Crawler Issues

System Considerations
The URL itself
Politeness
Visit Order
Robot Traps
The hidden web

5
Standard for Robot Exclusion

Martin Koster (1994)
http//any-server80/robots.txt
Maintained by the webmaster
Forbid access to pages, directories
Commonly excluded /cgi-bin/
Adherence is voluntary for the crawler
Specification http//www.robotstxt.org/wc/norobot
s.html

6
Visit Order

The frontier
Breadth-first FIFO queue
Depth-first LIFO queue
Best-first Priority queue
Random
Refresh rate

7
Robot Traps

Cycles in the Web graph
Infinite links on a page
Traps set out by the Webmaster

8
The Hidden Web

Dynamic pages increasing
Subscription pages
Username and password pages
Research in progress on how crawlers can get
into the hidden web

9
Redefining Order Making for Networked Information

Challenge Accommodate not impose ordering
mechanisms
Ordering mechanisms should be independent of
Physical location
Who owns the content
Who manages the content

10
Tools for Order Making

Better search engines
google
Better metadata
Dublin Core, INDECS, IMS
Tools for selection and specialization
Collection Services

11
Collections in the Traditional Library

Selection defining the resources
Specialization defining the mechanisms
Management defining the policies.
http//campusgw.library.cornell.edu/about/spcollec
tions.html
http//scriptorium.lib.duke.edu/

12
Traditional Model Doesnt Map

Irrelevance of locality both among and within
resources
Blurring of containment inter-resource linkages
Loss of permanence ephemeral resources are the
norm

13
Defining a Digital Collection

A criterion for selecting a set of resources
possibly distributed across multiple distributed
repositories

14
Collection Synthesis

The NSDL
National Scientific Digital Library
Educational materials for K-thru-grave
A collection of digital collections
Collection (automatically derived)
20-50 items on a topic, represented by their
URLs, expository in nature, precision trumps
recall.
Collection description (automatically derived)

15
Crawler is the Key

A general search engine is good for precise
results, few in number
A search engine must cover all topics, not just
scientific
For automatic collection assembly, a Web crawler
is needed
A focused crawler is the key

16
Focused Crawling
17
Focused Crawling
1
4
3
2
7
6
5
R
Focused crawl
Breadth-first crawl
1
18
Collections and Clusters

Traditional document universe is divided into
clusters, or collections
Each collection represented by its centroid
Web size of document universe is infinite
Agglomerative clustering is used instead
Two aspects
Collection descriptor
Rule for when items belong to that Collection

19
Q 0.2
Q 0.6
20
The Setup
A virtual collection of items about Chebyshev
Polynomials
21
Adding a Centroid
An empty collection of items about Chebyshev
Polynomials
22
Document Vector Space

Classic information retrieval technique
Each word is a dimension in N-space
Each document is a vector in N-space
Example lt0, 0.003, 0,0,.01, .984,0,.001gt
Normalize the weights
Both the centroid and the downloaded document
are term vectors

23
Agglomerate
A collection with 3 items about Ch. Polys.
24
Where does the Centroid come from?
?
Chebyshev Polynomials
A really good centroid for a collection about
C.P.s
25
Building a Centroid
1. Google(Chebyshev Polynomials) ? url1, url2,
2. Let H be a hash (k,v) where kword, valuefreq
3. For each url in url1, url2, do
D ? download(url) V ? term vector(d) For
each term t in V do
If t not in H add it with value 0 H(t)
4. Compute tf-idf weights. C ? top 20 terms (by
weight).
26
Dictionary

Given centroids C1, C2, C3
Dictionary is C1 C2 C3
Terms are union of terms in Ci
Term Frequencies are total frequency in Ci
Document Frequency is how many Cs have t
Term IDF is based on Berkeleys DocFreqs
Dictionary is 300-500 terms

27
Tunneling with Cutoff

Nugget dud dud dud nugget
Notation 0 X X - X 0
Fixed cutoff 0 X1 X2 - Xc
Adaptive cutoff 0 X1 X2 - X?

28
Statistics Collected

500,000 documents
Number of seeds 4
Path data for all but seeds
6620 completed paths (0-xx-0)
100,000s incomplete paths (0-xx..)

29
Nuggets that are x steps from a nugget
30
Nuggets that are x steps from a seed and/or a
nugget
31
Better parents have better children.
32
NSDL
http//www.nsdl.org
33
Metadata Repository

Central storage of all metadata about all
resources in the NSDL
Defines the extent of NSDL collection
Metadata includes collections, items,
annotations, etc.
MR main functions
Aggregation
Normalization
redistribution
Ingest of metadata by various means
Harvesting, manual, automatic, cross-walking
Open access to MR contents for service builders
via OAI-PMH

34
Metadata Strategy

Collect and redistribute any native (XML)
metadata format
Provide crosswalks to Dublin Core from eight
standard formats
Dublin Core, DC-GEM, LTSC (IMS), ADL (SCORM),
MARC, FGCD, EAD
Concentrate on collection-level metadata
Use automatic generation to augment item-level
metadata

35
Importing metadata into the MR
36
Exporting metadata from the MR
37
NSDL Data WarehouseA Web of Entities and
Relationships
38
Portals
Diverse Network of Specialized Partners (retail)
SpecializedMining
AnnotationAugmentation
NSDL Data WarehouseEntities and their
Relationships(wholesale)
Harvesting Gathering Normalization
Digital Sources

Write a Comment

User Comments (0)