Towards a Highly-Scalable and Effective Metasearch Engine

About This Presentation

Title:

Towards a Highly-Scalable and Effective Metasearch Engine

Description:

A Metasearch Engine is a System that supports unified access to multiple local search engines. ... Logical Ordering of Steps to Build a Meta Search Engine ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 23

Provided by: irI9

Category:

more less

Transcript and Presenter's Notes

Title: Towards a Highly-Scalable and Effective Metasearch Engine

1
Towards a Highly-Scalable and Effective
Metasearch Engine

Authors Zonghuan Wu, Weiyi Meng, Clement Yu,
Zhuogang Li

Presented By Suganya Ravikumar
Mira Chokshi Jitendra
Bethina Vijayaram Bethina
2
What is a Metasearch Engine ?

A Metasearch Engine is a System that supports
unified access to multiple local search engines.

3
Purpose of Metasearch Engine

General-purpose search engines focus on finding
all pages on the web.
Special purpose search engines focus on confined
domains.
Hence several search engines can be combined to
increase the coverage of documents over the web
and retrieve more useful documents.

4
How to Build a Metasearch Engine ?

A Metasearch engine does not maintain its own
index on web pages but maintains specific
information about each underlying local search
engine.

5
How does a Metasearch Engine work ?
6
Challenges

Database Selection
Collection Fusion

Goal of Paper

Development and experiments of a new technique
to rank databases optimally.

7
Related Work

Existing Approaches for Database Selection
Rank the databases for a given query on certain
usefulness measure. gGIOSS CORI
Net D-Wise Q-Pilot

Existing Approaches for Collection
FusionRetrieve more documents from database that
have higher ranking scores, and use adjusted
local similarities of documents to merge
retrieved documents. CORI Net D-Wise Pro-Fusion
MRDD

8
Framework

References Methodology for Retrieving Text
Documents from Multiple Databases and Efficient
and Effective Metasearch for large number of Text
Databases serve as a Framework for this
approach.
Approach in Framework for Database Selection
The Similarity of the Most Similar Document in
the Database is used as a measure for ranking
databases.
Approach in Framework for Collection Fusion
This approach uses real global similarities of
documents to merge the documents retrieved from
optimally selected databases.

9
Logical Ordering of Steps to Build a Meta Search
Engine

Build an integrated representative of the
databases.
Find the databases with m most similar
documents to a query.
Rank the databases.
Retrieve the top documents from selected
databases.

10
Why do we need a Single Integrated Representative

Scalability.
Less Storage Space.
Computational Efficiency.

11
Contents of the Integrated representative.

Adjusted Maximum Normalized weight for a term i
and Database Dj
gidfimnwi,j

12
Global idfs

The idf weight based upon the global document
frequency of the term

Maximum Normalized Weight

It is nothing but the maximum normalized Term
Frequency of a term in that database.

13
New Ranking Measure Based on the Integrated
Representative.

For query terms i1.k

14
OptDocRetrv Algorithm

Suppose databases (D1 , D2 DN) are optimally
ranked for a given query on basis of global
similarity of the most similar document in
database Di with query.
Select first s databases.
Each selected database sends actual global
similarity of the most similar document to the
metasearch engine which computes a minimum
min_sim

15
OptDocRetrv Algorithm

Each of these s search engines now returns to
the metasearch engine those documents whose
global similarities are greater than or equal to
min_sim
If m or more documents are returned from s
search engines, then they are sorted in
descending order of similarity and first m
documents are returned to the user.

16
Experimental Results

Databases Used 221
Source of Databases 5 TREC document collections
Total Size Of Databases 2 GB
Number of distinct terms Over 1 million
Queries Used 1000 Internet queries

17
Performance Measures

Effectiveness (Quality ) Measures cor_iden_db
Percentage of correctly identified databases

18
Performance Measures

Effectiveness (Quality ) Measures cor_iden_doc
Percentage of correctly identified documents

19
Performance Measures

Efficiency Measuresdb_effort Database search
effort.

20
Performance Measures

Efficiency Measuresdoc_effort Document search
effort.

21
Prototype System

CSams (Computer Science Academic Metasearch
engine)URL http//slate.cs.binghamton.edu8080/
CSams/

Future Improvements

Dependencies of Terms
Query Sampling Techniques

22
Conclusion

Database Selection and Document Retrieval method
can achieve close to ideal performance.
In general, this method Performs well even for
multi-term queries Performs much better for
short-term queries
Improved Scalability over previous systems.

Write a Comment

User Comments (0)