Strategies For An Intelligent Search Tool - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Strategies For An Intelligent Search Tool

Description:

Strategies For An Intelligent Search Tool – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 14
Provided by: ipipa
Category:

less

Transcript and Presenter's Notes

Title: Strategies For An Intelligent Search Tool


1
Strategies For An Intelligent Search Tool
  • Mieczyslaw Klopotek1, Mariusz Sado2 , Arkadiusz
    Dzierzanowski3, Marcin Pionnier 2,
  • 1 IPI PAN, 2 MINI PW
  • 3 Ministry of Economy, Emplyment and Social
    Affairs

2
Motivation
  • Internet is growing in information content
  • It becomes more difficult to find information one
    needs at a time.
  • Search engines help, but only primitive queries
    are possible
  • Personal search tools have to be developed to
    match needs of a particular needs.
  • A goal-oriented crawler

3
Crawlers
  • A general-purpose crawler collects any page it
    encounters.
  • It has been guessed that pages related to a
    particular topic may be linked together
  • Linkage Locality Web pages on a given topic are
    more likely to link to those on the same topic.
  • Sibling Locality If a web page points to certain
    web pages on a given topic, then it is likely to
    point to other pages on the same topic.
  • gt The focused crawling strategy
  • It does not need to be always
  • gt The focussed crawler assumption is too crisp

4
Learning and Ranking Crawlers
  • An intermediate approach between general purpose
    and focussed crawler
  • Rank pages to be visited first
  • More promosing first
  • Learn which pages are more promissing

5
Ranking Strategies
  • similarity-based
  • PageRank-based
  • QDPageRank-based and Personalized PageRank-based
  • PHITS/PLSA- based
  • ArbitraryPredicate based
  • Content Based Learning
  • URL token based learning
  • Link Based Learning (from focussed crawl)
  • Sibling Based Learning (from focussed crawl)
  • Effects combined

6
Arbitrary predicates versus random harvest rate
7
Arbitrary predicates contributions to
combined effect
Combined
Content
URLToken
Sibling
Link
8
Arbitrary predicates information reuse
9
Comparison of ArbitraryPredicates with other
ranking strategies
good pages for a set of queries
10
Combining ranking strategies
ArbitraryPredicates
Combined by the sum of weights
Combined by rotating algs (if performance above
average, alg stayed)
good pages for a set of queries
11
Spoearman rank Correlation between page
evaluations for various ranking strategies
Random Similarity
12
Diversity of Servers to draw pages for various
queries
Random Similarity
Common Intersection
13
Results - Summary
  • The resulting harvest rate for the crawl it is
    generally at least twice higher than the
    performance of a random crawler
  • usually between 20 and 50 with the average of
    40 while the random crawler gives only 15.
  • Such a harvest rate illustrates that the
    intelligent crawler is an effcient option for
    finding highly specific resources
  • if one reuses data collected from previous
    crawls (similar query), the crawler's initial
    learning effort is reduced
Write a Comment
User Comments (0)
About PowerShow.com