Query Expansion - PowerPoint PPT Presentation

About This Presentation
Title:

Query Expansion

Description:

... User Query: 'car'; Expanded Query: 'car cars automobile automobiles auto' etc... The user then selects a list of documents that are relevant to their search. ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 25
Provided by: seanmcg
Category:
Tags: expansion | query

less

Transcript and Presenter's Notes

Title: Query Expansion


1
Query Expansion
  • By Sean McGettrick

2
What is Query Expansion?
  • Query Expansion is the term given when a search
    engine adding search terms to a users weighted
    search.
  • The goal is to improve precision and/or recall.
  • Example User Query car Expanded Query car
    cars automobile automobiles auto etc

3
Classes of Query Expansion
  • Human and/or computer generated thesauri
  • Relevance feedback
  • Automatic query expansion

4
Query Expansion Issues
  • Two major issues
  • Which terms to include?
  • Which terms to weight more?
  • Concept-Based vs. Term-Based Query Expansion
  • Is it better to expand based upon the individual
    terms in the query, or the overall concept of the
    query?

5
Relevance of Query Expansion
  • Query expansion is very important on the web.
  • The amount of information on the web is always
    increasing.
  • In 1999, Google had 135 million pages. It now
    has over 3 billion.
  • Search engine users follow specific trends with
    their searches.
  • 2-3 words
  • Broad search term
  • Do not like to expand their queries either
    through refining search terms or using Boolean
    operators

6
Thesauri
  • What is a Thesauri in the IR world?
  • Any data structure that defines semantic
    relatedness between words.
  • Schutze and Pedersen (1997)
  • Often more complex than normal Thesauri.
  • Thought to be too broad to be useful.

7
The Need For Thesauri
  • Naturally assumed that pulling words from a
    thesauri would increase
  • The number of documents retrieved.
  • Possibly precision.
  • The car example car vs. car, auto,
    automobile, vehicle, sedan, etc
  • Which would retrieve the largest number of
    documents?
  • Is larger necessarily better?

8
Human Automatically Generated Thesauri
  • Earliest work began in the 1950s.
  • H.P. Luhn
  • Thesaurofacet detailed list of engineering
    terms
  • Largely used in such industries as medicine,
    aerospace, and other technological fields.

9
Drawbacks of Handcrafted Thesauri
  • Cost
  • Development.
  • Maintenance.
  • Cost often outweighs benefit.
  • Time
  • It often takes a long time for thesauri to
    develop.
  • Hard to keep up with the pace of scientific and
    technological development.

10
Automatically Generated Thesauri
  • Need grew from limitations of handcrafted
    thesauri.
  • No longer the cost of experts to generate
    thesauri.

11
Automatically Generated Thesauri
  • 3 Steps.
  • Extract word co-occurrences.
  • Define word similarities.
  • Based upon word co-occurrence or lexical
    relationship.
  • Cluster words based upon their similarities.
  • Not proven very successful.
  • As late as 1990 many industries were still using
    handcrafted thesauri.

12
Relevance Feedback
  • Began in the 1960s.
  • Significant improvement in recall and precision
    over early query expansion work.
  • Basic process as follows.
  • The user creates their initial query which
    returns an initial result set.
  • The user then selects a list of documents that
    are relevant to their search.
  • The system then re-weights and/or expands the
    query based upon the terms in the documents.

13
Relevance Feedback Models
  • Many different types of models.
  • Depend on methods and theories behind them.
  • Vector Space.
  • Probabilistic.
  • Boolean.

14
Ide dec-hi Method
  • In this method, all the top ranked relevant
    documents are used as is the highest ranked
    non-relevant document.
  • The non-relevant document is used a point in the
    vector space from which the feedback query is
    removed.
  • Up to 160 improvement over non-expanded queries.

15
Interactive Query Expansion
  • Uses a thesaurus.
  • After initial query is submitted, the system
    returns a list of associated and relevant words
    derived from both the result set and a thesaurus.
  • Useful, but more research is needed.

16
Pseudo-relevance Feedback
  • Grew from problems involved in implementing
    relevance feedback systems.
  • Users do not like to give manual feedback to the
    system.

17
Pseudo-relevance Feedback Process
  • The system returns an initial set of documents.
  • The system assumes that the top n number of
    documents are relevant to the query.
  • The system takes terms from these documents to
    re-weight the query.
  • Relies largely on the systems ability to
    initially retrieve relevant documents.

18
lol
19
Automatic Query Expansion
  • The process of automatic query expansion using
    computer generated thesauri.
  • Works somewhat like pseudo-relevance feedback.
  • Implementation not as useful, but still widely
    researched.

20
Term Co-occurrence Measures
  • Process of developing relationships between
    words based upon their co-occurrence in
    documents.
  • Clustering
  • Documents that share a significant number of
    terms are grouped together.
  • A thesaurus is then generated from the terms in
    these categories.
  • Categories sometimes too narrow or broad.
  • Does not account for synonyms.

21
Lexical Co-Occurrence Measures
  • Instead of looking at the frequency of terms in
    a document, the proximity of words in a document
    is looked at.
  • Context of words becomes important.
  • Some performance improvement shown in small
    document collections.
  • Not quite as good as relevance feedback, but
    better than pseudo-relevance feedback.

22
Current State of Query Expansion
  • Query Expansion technology has reached somewhat
    of a plateau.
  • This is due to limiting factors of relevance
    feedback and word co-occurrence.
  • Current research attempting to refine previous
    research in the field.

23
Where To Go From Here?
  • Grammatical Based Thesauri
  • Syntactical relationship between words
  • Words placed into classes
  • Some improvement on small document collections.
    Failed on larger ones.
  • AI Searching
  • Mostly theory
  • Intelligent Agents
  • Could be customized reflect specific needs of the
    user
  • Next logical step in IR, but still far off from
    commercial use

24
Works Cited
  • Attardi, G., S. Di Marco and F. Sebastiani. 1998.
    Automated Generation of Category-Specific
    Thesauri for Interactive Query Expansion.
  • Grefenstette, G. 1992. Use of Syntactic Context
    to Produce Term Association Lists for Text
    Retrieval. In Proceedings of the 15th Annual
    International ACM-SIGIR Conference on Research
    and Development in Information Retrieval,
    Copenhagen, Denmark, ed. N. Belkin, P. Ingwersen
    and A. M. Pesjtersen pp. 89-97. New York ACM
    Press.
  • Ide, E. 1971. New Experiments in Relevance
    Feedback. In G. Salton. The SMART Retrieval
    System Experiments in automatic document
    processing. Englewood Cliffs, NJ Prentice-Hall.
  • Qiu, Y., 1993. Concept Based Query Expansion. In
    Proceedings of SIGIR-93, 16th ACM International
    Conference on Research and Development in
    Information Retrieval.
  • Schutze, H. and J. Pederson. 1997. A
    Cooccurance-based Thesaurus and Two Applications
    to Information Retrieval. Information Processing
    and Management 33, no. 3 pp. 307-318.
  • Walker, D. 2001. Query Expansion Using Thesauri.
Write a Comment
User Comments (0)
About PowerShow.com