Title: Strategies For An Intelligent Search Tool
1Strategies For An Intelligent Search Tool
- Mieczyslaw Klopotek1, Mariusz Sado2 , Arkadiusz
Dzierzanowski3, Marcin Pionnier 2, - 1 IPI PAN, 2 MINI PW
- 3 Ministry of Economy, Emplyment and Social
Affairs
2Motivation
- Internet is growing in information content
- It becomes more difficult to find information one
needs at a time. - Search engines help, but only primitive queries
are possible - Personal search tools have to be developed to
match needs of a particular needs. - A goal-oriented crawler
3Crawlers
- A general-purpose crawler collects any page it
encounters. - It has been guessed that pages related to a
particular topic may be linked together - Linkage Locality Web pages on a given topic are
more likely to link to those on the same topic. - Sibling Locality If a web page points to certain
web pages on a given topic, then it is likely to
point to other pages on the same topic. - gt The focused crawling strategy
- It does not need to be always
- gt The focussed crawler assumption is too crisp
4Learning and Ranking Crawlers
- An intermediate approach between general purpose
and focussed crawler - Rank pages to be visited first
- More promosing first
- Learn which pages are more promissing
5Ranking Strategies
- similarity-based
- PageRank-based
- QDPageRank-based and Personalized PageRank-based
- PHITS/PLSA- based
- ArbitraryPredicate based
- Content Based Learning
- URL token based learning
- Link Based Learning (from focussed crawl)
- Sibling Based Learning (from focussed crawl)
- Effects combined
6Arbitrary predicates versus random harvest rate
7Arbitrary predicates contributions to
combined effect
Combined
Content
URLToken
Sibling
Link
8Arbitrary predicates information reuse
9Comparison of ArbitraryPredicates with other
ranking strategies
good pages for a set of queries
10Combining ranking strategies
ArbitraryPredicates
Combined by the sum of weights
Combined by rotating algs (if performance above
average, alg stayed)
good pages for a set of queries
11Spoearman rank Correlation between page
evaluations for various ranking strategies
Random Similarity
12Diversity of Servers to draw pages for various
queries
Random Similarity
Common Intersection
13Results - Summary
- The resulting harvest rate for the crawl it is
generally at least twice higher than the
performance of a random crawler - usually between 20 and 50 with the average of
40 while the random crawler gives only 15. - Such a harvest rate illustrates that the
intelligent crawler is an effcient option for
finding highly specific resources - if one reuses data collected from previous
crawls (similar query), the crawler's initial
learning effort is reduced