Title: Strategies For An Intelligent Search Tool
- Mieczyslaw Klopotek1, Mariusz Sado2 , Arkadiusz
Dzierzanowski3, Marcin Pionnier 2, - 1 IPI PAN, 2 MINI PW
- 3 Ministry of Economy, Emplyment and Social
- Internet is growing in information content
- It becomes more difficult to find information one
needs at a time. - Search engines help, but only primitive queries
are possible - Personal search tools have to be developed to
match needs of a particular needs. - A goal-oriented crawler
- A general-purpose crawler collects any page it
encounters. - It has been guessed that pages related to a
particular topic may be linked together - Linkage Locality Web pages on a given topic are
more likely to link to those on the same topic. - Sibling Locality If a web page points to certain
web pages on a given topic, then it is likely to
point to other pages on the same topic. - gt The focused crawling strategy
- It does not need to be always
- gt The focussed crawler assumption is too crisp
4Learning and Ranking Crawlers
- An intermediate approach between general purpose
and focussed crawler - Rank pages to be visited first
- More promosing first
- Learn which pages are more promissing
5Ranking Strategies
- similarity-based
- PageRank-based
- QDPageRank-based and Personalized PageRank-based
- PHITS/PLSA- based
- ArbitraryPredicate based
- Content Based Learning
- URL token based learning
- Link Based Learning (from focussed crawl)
- Sibling Based Learning (from focussed crawl)
- Effects combined
6Arbitrary predicates versus random harvest rate
7Arbitrary predicates contributions to
combined effect
8Arbitrary predicates information reuse
9Comparison of ArbitraryPredicates with other
ranking strategies
good pages for a set of queries
10Combining ranking strategies
Combined by the sum of weights
Combined by rotating algs (if performance above
average, alg stayed)
good pages for a set of queries
11Spoearman rank Correlation between page
evaluations for various ranking strategies
Random Similarity
12Diversity of Servers to draw pages for various
Random Similarity
Common Intersection
13Results - Summary
- The resulting harvest rate for the crawl it is
generally at least twice higher than the
performance of a random crawler - usually between 20 and 50 with the average of
40 while the random crawler gives only 15. - Such a harvest rate illustrates that the
intelligent crawler is an effcient option for
finding highly specific resources - if one reuses data collected from previous
crawls (similar query), the crawler's initial
learning effort is reduced