Title: Towards a Highly-Scalable and Effective Metasearch Engine
1Towards a Highly-Scalable and Effective
Metasearch Engine
- Authors Zonghuan Wu, Weiyi Meng, Clement Yu,
Zhuogang Li
Presented By Suganya Ravikumar
Mira Chokshi Jitendra
Bethina Vijayaram Bethina
2What is a Metasearch Engine ?
- A Metasearch Engine is a System that supports
unified access to multiple local search engines.
3Purpose of Metasearch Engine
- General-purpose search engines focus on finding
all pages on the web. - Special purpose search engines focus on confined
domains. - Hence several search engines can be combined to
increase the coverage of documents over the web
and retrieve more useful documents.
4How to Build a Metasearch Engine ?
- A Metasearch engine does not maintain its own
index on web pages but maintains specific
information about each underlying local search
engine.
5How does a Metasearch Engine work ?
6Challenges
- Database Selection
- Collection Fusion
Goal of Paper
- Development and experiments of a new technique
to rank databases optimally.
7Related Work
- Existing Approaches for Database Selection
- Rank the databases for a given query on certain
usefulness measure. gGIOSS CORI
Net D-Wise Q-Pilot
- Existing Approaches for Collection
FusionRetrieve more documents from database that
have higher ranking scores, and use adjusted
local similarities of documents to merge
retrieved documents. CORI Net D-Wise Pro-Fusion
MRDD
8Framework
- References Methodology for Retrieving Text
Documents from Multiple Databases and Efficient
and Effective Metasearch for large number of Text
Databases serve as a Framework for this
approach. - Approach in Framework for Database Selection
- The Similarity of the Most Similar Document in
the Database is used as a measure for ranking
databases. - Approach in Framework for Collection Fusion
- This approach uses real global similarities of
documents to merge the documents retrieved from
optimally selected databases.
9Logical Ordering of Steps to Build a Meta Search
Engine
- Build an integrated representative of the
databases. - Find the databases with m most similar
documents to a query. - Rank the databases.
- Retrieve the top documents from selected
databases.
10Why do we need a Single Integrated Representative
- Scalability.
- Less Storage Space.
- Computational Efficiency.
11Contents of the Integrated representative.
- Adjusted Maximum Normalized weight for a term i
and Database Dj - gidfimnwi,j
12Global idfs
- The idf weight based upon the global document
frequency of the term
Maximum Normalized Weight
- It is nothing but the maximum normalized Term
Frequency of a term in that database.
13New Ranking Measure Based on the Integrated
Representative.
14OptDocRetrv Algorithm
- Suppose databases (D1 , D2 DN) are optimally
ranked for a given query on basis of global
similarity of the most similar document in
database Di with query. - Select first s databases.
- Each selected database sends actual global
similarity of the most similar document to the
metasearch engine which computes a minimum
min_sim
15OptDocRetrv Algorithm
- Each of these s search engines now returns to
the metasearch engine those documents whose
global similarities are greater than or equal to
min_sim - If m or more documents are returned from s
search engines, then they are sorted in
descending order of similarity and first m
documents are returned to the user.
16Experimental Results
- Databases Used 221
- Source of Databases 5 TREC document collections
- Total Size Of Databases 2 GB
- Number of distinct terms Over 1 million
- Queries Used 1000 Internet queries
17Performance Measures
- Effectiveness (Quality ) Measures cor_iden_db
Percentage of correctly identified databases
18Performance Measures
- Effectiveness (Quality ) Measures cor_iden_doc
Percentage of correctly identified documents
19Performance Measures
- Efficiency Measuresdb_effort Database search
effort.
20Performance Measures
- Efficiency Measuresdoc_effort Document search
effort.
21Prototype System
- CSams (Computer Science Academic Metasearch
engine)URL http//slate.cs.binghamton.edu8080/
CSams/
Future Improvements
- Dependencies of Terms
- Query Sampling Techniques
22Conclusion
- Database Selection and Document Retrieval method
can achieve close to ideal performance. - In general, this method Performs well even for
multi-term queries Performs much better for
short-term queries - Improved Scalability over previous systems.