Web Crawling/Collection Aggregation - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Web Crawling/Collection Aggregation

Description:

Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19 – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 39
Provided by: berg114
Category:

less

Transcript and Presenter's Notes

Title: Web Crawling/Collection Aggregation


1
Web Crawling/Collection Aggregation
  • CS431, Spring 2004, Carl Lagoze
  • April 5 Lecture 19

2
The Web is a BIG Graph
  • Diameter of the Web
  • Cannot crawl even the static part, completely
  • New technology the focused crawl

3
Crawling and Crawlers
  • Web overlays the internet
  • A crawl overlays the web

seed
4
Crawler Issues
  • System Considerations
  • The URL itself
  • Politeness
  • Visit Order
  • Robot Traps
  • The hidden web

5
Standard for Robot Exclusion
  • Martin Koster (1994)
  • http//any-server80/robots.txt
  • Maintained by the webmaster
  • Forbid access to pages, directories
  • Commonly excluded /cgi-bin/
  • Adherence is voluntary for the crawler
  • Specification http//www.robotstxt.org/wc/norobot
    s.html

6
Visit Order
  • The frontier
  • Breadth-first FIFO queue
  • Depth-first LIFO queue
  • Best-first Priority queue
  • Random
  • Refresh rate

7
Robot Traps
  • Cycles in the Web graph
  • Infinite links on a page
  • Traps set out by the Webmaster

8
The Hidden Web
  • Dynamic pages increasing
  • Subscription pages
  • Username and password pages
  • Research in progress on how crawlers can get
    into the hidden web

9
Redefining Order Making for Networked Information
  • Challenge Accommodate not impose ordering
    mechanisms
  • Ordering mechanisms should be independent of
  • Physical location
  • Who owns the content
  • Who manages the content

10
Tools for Order Making
  • Better search engines
  • google
  • Better metadata
  • Dublin Core, INDECS, IMS
  • Tools for selection and specialization
  • Collection Services

11
Collections in the Traditional Library
  • Selection defining the resources
  • Specialization defining the mechanisms
  • Management defining the policies.
  • http//campusgw.library.cornell.edu/about/spcollec
    tions.html
  • http//scriptorium.lib.duke.edu/

12
Traditional Model Doesnt Map
  • Irrelevance of locality both among and within
    resources
  • Blurring of containment inter-resource linkages
  • Loss of permanence ephemeral resources are the
    norm

13
Defining a Digital Collection
  • A criterion for selecting a set of resources
    possibly distributed across multiple distributed
    repositories

14
Collection Synthesis
  • The NSDL
  • National Scientific Digital Library
  • Educational materials for K-thru-grave
  • A collection of digital collections
  • Collection (automatically derived)
  • 20-50 items on a topic, represented by their
    URLs, expository in nature, precision trumps
    recall.
  • Collection description (automatically derived)

15
Crawler is the Key
  • A general search engine is good for precise
    results, few in number
  • A search engine must cover all topics, not just
    scientific
  • For automatic collection assembly, a Web crawler
    is needed
  • A focused crawler is the key

16
Focused Crawling
17
Focused Crawling
1
4
3
2
7
6
5
R
Focused crawl
Breadth-first crawl
1
18
Collections and Clusters
  • Traditional document universe is divided into
    clusters, or collections
  • Each collection represented by its centroid
  • Web size of document universe is infinite
  • Agglomerative clustering is used instead
  • Two aspects
  • Collection descriptor
  • Rule for when items belong to that Collection

19
Q 0.2
Q 0.6
20
The Setup
A virtual collection of items about Chebyshev
Polynomials
21
Adding a Centroid
An empty collection of items about Chebyshev
Polynomials
22
Document Vector Space
  • Classic information retrieval technique
  • Each word is a dimension in N-space
  • Each document is a vector in N-space
  • Example lt0, 0.003, 0,0,.01, .984,0,.001gt
  • Normalize the weights
  • Both the centroid and the downloaded document
    are term vectors

23
Agglomerate
A collection with 3 items about Ch. Polys.
24
Where does the Centroid come from?
?
Chebyshev Polynomials
A really good centroid for a collection about
C.P.s
25
Building a Centroid
1. Google(Chebyshev Polynomials) ? url1, url2,
2. Let H be a hash (k,v) where kword, valuefreq
3. For each url in url1, url2, do
D ? download(url) V ? term vector(d) For
each term t in V do
If t not in H add it with value 0 H(t)
4. Compute tf-idf weights. C ? top 20 terms (by
weight).
26
Dictionary
  • Given centroids C1, C2, C3
  • Dictionary is C1 C2 C3
  • Terms are union of terms in Ci
  • Term Frequencies are total frequency in Ci
  • Document Frequency is how many Cs have t
  • Term IDF is based on Berkeleys DocFreqs
  • Dictionary is 300-500 terms

27
Tunneling with Cutoff
  • Nugget dud dud dud nugget
  • Notation 0 X X - X 0
  • Fixed cutoff 0 X1 X2 - Xc
  • Adaptive cutoff 0 X1 X2 - X?

28
Statistics Collected
  • 500,000 documents
  • Number of seeds 4
  • Path data for all but seeds
  • 6620 completed paths (0-xx-0)
  • 100,000s incomplete paths (0-xx..)

29
Nuggets that are x steps from a nugget
30
Nuggets that are x steps from a seed and/or a
nugget
31
Better parents have better children.
32
NSDL
http//www.nsdl.org
33
Metadata Repository
  • Central storage of all metadata about all
    resources in the NSDL
  • Defines the extent of NSDL collection
  • Metadata includes collections, items,
    annotations, etc.
  • MR main functions
  • Aggregation
  • Normalization
  • redistribution
  • Ingest of metadata by various means
  • Harvesting, manual, automatic, cross-walking
  • Open access to MR contents for service builders
    via OAI-PMH

34
Metadata Strategy
  • Collect and redistribute any native (XML)
    metadata format
  • Provide crosswalks to Dublin Core from eight
    standard formats
  • Dublin Core, DC-GEM, LTSC (IMS), ADL (SCORM),
    MARC, FGCD, EAD
  • Concentrate on collection-level metadata
  • Use automatic generation to augment item-level
    metadata

35
Importing metadata into the MR
36
Exporting metadata from the MR
37
NSDL Data WarehouseA Web of Entities and
Relationships
38
Portals
Diverse Network of Specialized Partners (retail)
SpecializedMining
AnnotationAugmentation
NSDL Data WarehouseEntities and their
Relationships(wholesale)
Harvesting Gathering Normalization
Digital Sources
Write a Comment
User Comments (0)
About PowerShow.com