Focused Crawling A New Approach to TopicSpecific Web Resource Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Focused Crawling A New Approach to TopicSpecific Web Resource Discovery

Description:

Popular search portals and directories. Useful for generic needs ... Typical Alta Vista queries are much simpler (Silverstein, Henzinger, Marais and Moricz) ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 35
Provided by: sou59
Category:

less

Transcript and Presenter's Notes

Title: Focused Crawling A New Approach to TopicSpecific Web Resource Discovery


1
Focused CrawlingA New Approach to
Topic-SpecificWeb Resource Discovery
  • Soumen Chakrabarti
  • Martin van Den Berg
  • Byron Dom

2
Portals and portholes
  • Popular search portals and directories
  • Useful for generic needs
  • Difficult to do serious research
  • Information needs of net-savvy users are getting
    very sophisticated
  • Relatively little business incentive
  • Need handmade specialty sites portholes
  • Resource discovery must be personalized

3
Quote
  • The emergence of portholes will be one of the
    major Internet trends of 1999. As people become
    more savvy users of the Net, they want things
    which are better focused on meeting their
    specific needs. We're going to see a whole lot
    more of this, and it's going to potentially erode
    the user base of some of the big portals.
  • Jim Hake(Founder, Global Information
    Infrastructure Awards)

4
Scenario
  • Disk drive research group wants to track magnetic
    surface technologies
  • Compiler research group wants to trawl the web
    for graduate student resumés
  • ____ wants to enhance his/her collection of
    bookmarks about ____ with prominent and relevant
    links
  • Virtual libraries like the Open Directory Project
    and the Mining Co.

5
Structured web queries
  • How many links were found from an environment
    protection agency site to a site about oil and
    natural gas in the last year?
  • Apart from cycling, what is the most common topic
    cited by pages on cycling?
  • Find Web research pages which are widely cited by
    Hawaiian vacation pages

6
Goal
  • Automatically construct a focused portal
    (porthole) containing resources that are
  • Relevant to the users focus of interest
  • Of high influence and quality
  • Collectively comprehensive
  • Answer structured web queries by selectively
    exploring the topics involved in the query

7
Tools at hand
  • Keyword search engines
  • Synonymy, polysemy
  • Abundance, lack of quality
  • Hand compiled topic directories
  • Labor intensive, subjective judgements
  • Resources automatically located using keyword
    search and link graph distillation
  • Dependence on large crawls and indices

8
Estimating popularity
  • Extensive research on social network theory
  • Wasserman and Faust
  • Hyperlink based
  • Large in-degree indicates popularity/authority
  • Not all votes are worth the same
  • Several similar ideas and refinements
  • Googol (Page and Brin) and HITS (Kleinberg)
  • Resource compilation (Chakrabarti et al)
  • Topic distillation (Bharat and Henzinger)

9
Topic distillation overview
  • Given web graph and query
  • Search engine selects sub-graph
  • Expansion, pruning and edge weights
  • Nodes iteratively transfer authority to cited
    neighbors

The Web
Search Engine
Query
Selected subgraph
10
Preliminary distillation-based approach
  • Design a keyword query to represent a topic
  • Run topic distillation periodically
  • Refine query through trial-and-error
  • Works well if answer is partially known, e.g.,
    European airlines
  • swissair iberia klm

11
(No Transcript)
12
Problems with preliminary approach
  • Dependence on large web crawl and index
  • System crawler index distiller
  • Unreliability of keyword match
  • Engines differ significantly on a given query due
    to small overlap Bharat and Bröder
  • Narrow, arbitrary view of relevant subgraph
  • Topic model does not improve over time
  • Difficulty of query construction
  • Lack of output sensitivity

13
Query construction
/Companies/Electronics/Power_Supply
power suppl
switch mode smps
-multiprocessor
uninterrupt power suppl ups
-parcel
14
Query complexity
  • Complex queries (966 trials)
  • Average words 7.03
  • Average operators (") 4.34
  • Typical Alta Vista queries are much simpler
    Silverstein, Henzinger, Marais and Moricz
  • Average query words 2.35
  • Average operators (") 0.41
  • Forcibly adding a hub or authority node helped in
    86 of the queries

15
Query complexity
  • Complex queries needed for distillation
  • Typical Alta Vista queries are much simpler
    (Silverstein, Henzinger, Marais and Moricz)
  • Forcing a hub or authority helps 86 of the time

16
Output sensitivity
  • Say the goal is to find a comprehensive
    collection of recreational and competitive
    bicycling sites and pages
  • Ideally effort should scale with size of the
    result
  • Time spent crawling and indexing sites unrelated
    to the topic is wasted
  • Likewise, time that does not improve
    comprehensiveness is wasted

17
Proposed solution
  • Resource discovery system that can be customized
    to crawl for any topic by giving examples
  • Hypertext mining algorithms learn to recognize
    pages and sites about the given topic, and a
    measure of their centrality
  • Crawler has guidance hooks controlled by these
    two scores

18
Administration scenario
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
19
Relevance
Path nodes
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Good nodes
Subsumed nodes
20
Classification
  • How relevant is a document w.r.t. a class?
  • Supervised learning, filtering, classification,
    categorization
  • Many types of classifiers
  • Bayesian, nearest neighbor, rule-based
  • Hypertext
  • Both text and links are class-dependent clues
  • How to model link-based features?

21
The bag-of-words document model
  • Decide topic topic c is picked with prior
    probability ?(c) ?c?(c) 1
  • Each c has parameters ?(c,t) for terms t
  • Coin with face probabilities ?t ?(c,t) 1
  • Fix document length and keep tossing coin
  • Given c, probability of document is

22
Exploiting link features
  • cclass, ttext, Nneighbors
  • Text-only model Prtc
  • Using neighbors textto judge my topicPrt,
    t(N) c
  • Better modelPrt, c(N) c
  • Non-linear relaxation

?
23
Improvement using link features
  • 9600 patents from 12 classes marked by USPTO
  • Patents have text and cite other patents
  • Expand test patent to include neighborhood
  • Forget fraction of neighbors classes

24
Putting it together
25
Monitoring the crawler
One URL
Relevance
Moving Average
Time
26
Measures of success
  • Harvest rate
  • What fraction of crawled pages are relevant
  • Robustness across seed sets
  • Separate crawls with random disjoint samples
  • Measure overlap in URLs and servers crawled
  • Measure agreement in best-rated resources
  • Evidence of non-trivial work
  • Links from start set to the best resources

27
Harvest rate
Unfocused
Focused
28
Crawl robustness
URL Overlap
Server Overlap
Crawl 1
Crawl 2
29
Top resources after one hour
  • Recreational and competitive cycling
  • http//www.truesport.com/Bike/links.htm
  • http//reality.sgi.com/billh_hampton/jrvs/links.ht
    ml
  • http//www.acs.ucalgary.ca/bentley/mark_links.htm
    l
  • HIV/AIDS research and treatment
  • http//www.stopaids.org/Otherorgs.html
  • http//www.iohk.com/UserPages/mlau/aidsinfo.html
  • http//www.ahandyguide.com/cat1/a/a66.htm
  • Purer and better than root set

30
(No Transcript)
31
(No Transcript)
32
Distance to best resources
33
Robustness of resource discovery
  • Sample disjoint sets of starting URLs
  • Two separate crawls
  • Find best authorities
  • Order by rank
  • Find overlap in the top-rated resources

34
Related work
  • WebWatcher, HotList and ColdList
  • Filtering as post-processing, not acquisition
  • ReferralWeb
  • Social network on the Web
  • Ahoy!, Cora
  • Hand-crafted to find home pages and papers
  • WebCrawler, Fish, Shark, Fetuccino, agents
  • Crawler guided by query keyword matches

35
Comparison with agents
  • Agents usually look for keywords and hand-crafted
    patterns
  • Cannot learn new vocabulary dynamically
  • Do not use distance-2 centrality information
  • Client-side assistant
  • We use taxonomy with statistical topic models
  • Models can evolve as crawl proceeds
  • Combine relevance and centrality
  • Broader scope inter-community linkage analysis
    and querying

36
Conclusion
  • New architecture for example-driven
    topic-specific web resource discovery
  • No dependence on full web crawl and index
  • Modest desktop hardware adequate
  • Variable radius goal-directed crawling
  • High harvest rate
  • High quality resources found far from keyword
    query response nodes
Write a Comment
User Comments (0)
About PowerShow.com