Focused Crawling A New Approach to TopicSpecific Web Resource Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Focused Crawling A New Approach to TopicSpecific Web Resource Discovery

Description:

Popular search portals and directories. Useful for generic needs ... Typical Alta Vista queries are much simpler (Silverstein, Henzinger, Marais and Moricz) ... – PowerPoint PPT presentation

Number of Views:214

Avg rating:3.0/5.0

Slides: 35

Provided by: sou59

Category:

more less

Transcript and Presenter's Notes

Title: Focused Crawling A New Approach to TopicSpecific Web Resource Discovery

1
Focused CrawlingA New Approach to
Topic-SpecificWeb Resource Discovery

Soumen Chakrabarti
Martin van Den Berg
Byron Dom

2
Portals and portholes

Popular search portals and directories
Useful for generic needs
Difficult to do serious research
Information needs of net-savvy users are getting
very sophisticated
Relatively little business incentive
Need handmade specialty sites portholes
Resource discovery must be personalized

3
Quote

The emergence of portholes will be one of the
major Internet trends of 1999. As people become
more savvy users of the Net, they want things
which are better focused on meeting their
specific needs. We're going to see a whole lot
more of this, and it's going to potentially erode
the user base of some of the big portals.
Jim Hake(Founder, Global Information
Infrastructure Awards)

4
Scenario

Disk drive research group wants to track magnetic
surface technologies
Compiler research group wants to trawl the web
for graduate student resumés
____ wants to enhance his/her collection of
bookmarks about ____ with prominent and relevant
links
Virtual libraries like the Open Directory Project
and the Mining Co.

5
Structured web queries

How many links were found from an environment
protection agency site to a site about oil and
natural gas in the last year?
Apart from cycling, what is the most common topic
cited by pages on cycling?
Find Web research pages which are widely cited by
Hawaiian vacation pages

6
Goal

Automatically construct a focused portal
(porthole) containing resources that are
Relevant to the users focus of interest
Of high influence and quality
Collectively comprehensive
Answer structured web queries by selectively
exploring the topics involved in the query

7
Tools at hand

Keyword search engines
Synonymy, polysemy
Abundance, lack of quality
Hand compiled topic directories
Labor intensive, subjective judgements
Resources automatically located using keyword
search and link graph distillation
Dependence on large crawls and indices

8
Estimating popularity

Extensive research on social network theory
Wasserman and Faust
Hyperlink based
Large in-degree indicates popularity/authority
Not all votes are worth the same
Several similar ideas and refinements
Googol (Page and Brin) and HITS (Kleinberg)
Resource compilation (Chakrabarti et al)
Topic distillation (Bharat and Henzinger)

9
Topic distillation overview

Given web graph and query
Search engine selects sub-graph
Expansion, pruning and edge weights
Nodes iteratively transfer authority to cited
neighbors

The Web
Search Engine
Query
Selected subgraph
10
Preliminary distillation-based approach

Design a keyword query to represent a topic
Run topic distillation periodically
Refine query through trial-and-error
Works well if answer is partially known, e.g.,
European airlines
swissair iberia klm

11
(No Transcript)
12
Problems with preliminary approach

Dependence on large web crawl and index
System crawler index distiller
Unreliability of keyword match
Engines differ significantly on a given query due
to small overlap Bharat and Bröder
Narrow, arbitrary view of relevant subgraph
Topic model does not improve over time
Difficulty of query construction
Lack of output sensitivity

13
Query construction
/Companies/Electronics/Power_Supply
power suppl
switch mode smps
-multiprocessor
uninterrupt power suppl ups
-parcel
14
Query complexity

Complex queries (966 trials)
Average words 7.03
Average operators (") 4.34
Typical Alta Vista queries are much simpler
Silverstein, Henzinger, Marais and Moricz
Average query words 2.35
Average operators (") 0.41
Forcibly adding a hub or authority node helped in
86 of the queries

15
Query complexity

Complex queries needed for distillation
Typical Alta Vista queries are much simpler
(Silverstein, Henzinger, Marais and Moricz)
Forcing a hub or authority helps 86 of the time

16
Output sensitivity

Say the goal is to find a comprehensive
collection of recreational and competitive
bicycling sites and pages
Ideally effort should scale with size of the
result
Time spent crawling and indexing sites unrelated
to the topic is wasted
Likewise, time that does not improve
comprehensiveness is wasted

17
Proposed solution

Resource discovery system that can be customized
to crawl for any topic by giving examples
Hypertext mining algorithms learn to recognize
pages and sites about the given topic, and a
measure of their centrality
Crawler has guidance hooks controlled by these
two scores

18
Administration scenario
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
19
Relevance
Path nodes
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Good nodes
Subsumed nodes
20
Classification