Strategies For An Intelligent Search Tool

About This Presentation

Title:

Strategies For An Intelligent Search Tool

Description:

Strategies For An Intelligent Search Tool – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 14

Provided by: ipipa

Category:

more less

Transcript and Presenter's Notes

Title: Strategies For An Intelligent Search Tool

1
Strategies For An Intelligent Search Tool

Mieczyslaw Klopotek1, Mariusz Sado2 , Arkadiusz
Dzierzanowski3, Marcin Pionnier 2,
1 IPI PAN, 2 MINI PW
3 Ministry of Economy, Emplyment and Social
Affairs

2
Motivation

Internet is growing in information content
It becomes more difficult to find information one
needs at a time.
Search engines help, but only primitive queries
are possible
Personal search tools have to be developed to
match needs of a particular needs.
A goal-oriented crawler

3
Crawlers

A general-purpose crawler collects any page it
encounters.
It has been guessed that pages related to a
particular topic may be linked together
Linkage Locality Web pages on a given topic are
more likely to link to those on the same topic.
Sibling Locality If a web page points to certain
web pages on a given topic, then it is likely to
point to other pages on the same topic.
gt The focused crawling strategy
It does not need to be always
gt The focussed crawler assumption is too crisp

4
Learning and Ranking Crawlers

An intermediate approach between general purpose
and focussed crawler
Rank pages to be visited first
More promosing first
Learn which pages are more promissing

5
Ranking Strategies

similarity-based
PageRank-based
QDPageRank-based and Personalized PageRank-based
PHITS/PLSA- based
ArbitraryPredicate based
Content Based Learning
URL token based learning
Link Based Learning (from focussed crawl)
Sibling Based Learning (from focussed crawl)
Effects combined

6
Arbitrary predicates versus random harvest rate
7
Arbitrary predicates contributions to
combined effect
Combined
Content
URLToken
Sibling
Link
8
Arbitrary predicates information reuse
9
Comparison of ArbitraryPredicates with other
ranking strategies
good pages for a set of queries
10
Combining ranking strategies
ArbitraryPredicates
Combined by the sum of weights
Combined by rotating algs (if performance above
average, alg stayed)
good pages for a set of queries
11
Spoearman rank Correlation between page
evaluations for various ranking strategies
Random Similarity
12
Diversity of Servers to draw pages for various
queries
Random Similarity
Common Intersection
13
Results - Summary

The resulting harvest rate for the crawl it is
generally at least twice higher than the
performance of a random crawler
usually between 20 and 50 with the average of
40 while the random crawler gives only 15.
Such a harvest rate illustrates that the
intelligent crawler is an effcient option for
finding highly specific resources
if one reuses data collected from previous
crawls (similar query), the crawler's initial
learning effort is reduced