Towards a Highly-Scalable and Effective Metasearch Engine - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Towards a Highly-Scalable and Effective Metasearch Engine

Description:

A Metasearch Engine is a System that supports unified access to multiple local search engines. ... Logical Ordering of Steps to Build a Meta Search Engine ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 23
Provided by: irI9
Category:

less

Transcript and Presenter's Notes

Title: Towards a Highly-Scalable and Effective Metasearch Engine


1
Towards a Highly-Scalable and Effective
Metasearch Engine
  • Authors Zonghuan Wu, Weiyi Meng, Clement Yu,
    Zhuogang Li

Presented By Suganya Ravikumar
Mira Chokshi Jitendra
Bethina Vijayaram Bethina
2
What is a Metasearch Engine ?
  • A Metasearch Engine is a System that supports
    unified access to multiple local search engines.

3
Purpose of Metasearch Engine
  • General-purpose search engines focus on finding
    all pages on the web.
  • Special purpose search engines focus on confined
    domains.
  • Hence several search engines can be combined to
    increase the coverage of documents over the web
    and retrieve more useful documents.

4
How to Build a Metasearch Engine ?
  • A Metasearch engine does not maintain its own
    index on web pages but maintains specific
    information about each underlying local search
    engine.

5
How does a Metasearch Engine work ?
6
Challenges
  • Database Selection
  • Collection Fusion

Goal of Paper
  • Development and experiments of a new technique
    to rank databases optimally.

7
Related Work
  • Existing Approaches for Database Selection
  • Rank the databases for a given query on certain
    usefulness measure. gGIOSS CORI
    Net D-Wise Q-Pilot
  • Existing Approaches for Collection
    FusionRetrieve more documents from database that
    have higher ranking scores, and use adjusted
    local similarities of documents to merge
    retrieved documents. CORI Net D-Wise Pro-Fusion
    MRDD

8
Framework
  • References Methodology for Retrieving Text
    Documents from Multiple Databases and Efficient
    and Effective Metasearch for large number of Text
    Databases serve as a Framework for this
    approach.
  • Approach in Framework for Database Selection
  • The Similarity of the Most Similar Document in
    the Database is used as a measure for ranking
    databases.
  • Approach in Framework for Collection Fusion
  • This approach uses real global similarities of
    documents to merge the documents retrieved from
    optimally selected databases.

9
Logical Ordering of Steps to Build a Meta Search
Engine
  • Build an integrated representative of the
    databases.
  • Find the databases with m most similar
    documents to a query.
  • Rank the databases.
  • Retrieve the top documents from selected
    databases.

10
Why do we need a Single Integrated Representative
  • Scalability.
  • Less Storage Space.
  • Computational Efficiency.

11
Contents of the Integrated representative.
  • Adjusted Maximum Normalized weight for a term i
    and Database Dj
  • gidfimnwi,j

12
Global idfs
  • The idf weight based upon the global document
    frequency of the term

Maximum Normalized Weight
  • It is nothing but the maximum normalized Term
    Frequency of a term in that database.

13
New Ranking Measure Based on the Integrated
Representative.
  • For query terms i1.k

14
OptDocRetrv Algorithm
  • Suppose databases (D1 , D2 DN) are optimally
    ranked for a given query on basis of global
    similarity of the most similar document in
    database Di with query.
  • Select first s databases.
  • Each selected database sends actual global
    similarity of the most similar document to the
    metasearch engine which computes a minimum
    min_sim

15
OptDocRetrv Algorithm
  • Each of these s search engines now returns to
    the metasearch engine those documents whose
    global similarities are greater than or equal to
    min_sim
  • If m or more documents are returned from s
    search engines, then they are sorted in
    descending order of similarity and first m
    documents are returned to the user.

16
Experimental Results
  • Databases Used 221
  • Source of Databases 5 TREC document collections
  • Total Size Of Databases 2 GB
  • Number of distinct terms Over 1 million
  • Queries Used 1000 Internet queries

17
Performance Measures
  • Effectiveness (Quality ) Measures cor_iden_db
    Percentage of correctly identified databases

18
Performance Measures
  • Effectiveness (Quality ) Measures cor_iden_doc
    Percentage of correctly identified documents

19
Performance Measures
  • Efficiency Measuresdb_effort Database search
    effort.

20
Performance Measures
  • Efficiency Measuresdoc_effort Document search
    effort.

21
Prototype System
  • CSams (Computer Science Academic Metasearch
    engine)URL http//slate.cs.binghamton.edu8080/
    CSams/

Future Improvements
  • Dependencies of Terms
  • Query Sampling Techniques

22
Conclusion
  • Database Selection and Document Retrieval method
    can achieve close to ideal performance.
  • In general, this method Performs well even for
    multi-term queries Performs much better for
    short-term queries
  • Improved Scalability over previous systems.
Write a Comment
User Comments (0)
About PowerShow.com