Query Expansion - PowerPoint PPT Presentation

About This Presentation

Title:

Query Expansion

Description:

... User Query: 'car'; Expanded Query: 'car cars automobile automobiles auto' etc... The user then selects a list of documents that are relevant to their search. ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 25

Provided by: seanmcg

Learn more at: http://faculty.ist.psu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Query Expansion

1
Query Expansion

By Sean McGettrick

2
What is Query Expansion?

Query Expansion is the term given when a search
engine adding search terms to a users weighted
search.
The goal is to improve precision and/or recall.
Example User Query car Expanded Query car
cars automobile automobiles auto etc

3
Classes of Query Expansion

Human and/or computer generated thesauri
Relevance feedback
Automatic query expansion

4
Query Expansion Issues

Two major issues
Which terms to include?
Which terms to weight more?
Concept-Based vs. Term-Based Query Expansion
Is it better to expand based upon the individual
terms in the query, or the overall concept of the
query?

5
Relevance of Query Expansion

Query expansion is very important on the web.
The amount of information on the web is always
increasing.
In 1999, Google had 135 million pages. It now
has over 3 billion.
Search engine users follow specific trends with
their searches.
2-3 words
Broad search term
Do not like to expand their queries either
through refining search terms or using Boolean
operators

6
Thesauri

What is a Thesauri in the IR world?
Any data structure that defines semantic
relatedness between words.
Schutze and Pedersen (1997)
Often more complex than normal Thesauri.
Thought to be too broad to be useful.

7
The Need For Thesauri

Naturally assumed that pulling words from a
thesauri would increase
The number of documents retrieved.
Possibly precision.
The car example car vs. car, auto,
automobile, vehicle, sedan, etc
Which would retrieve the largest number of
documents?
Is larger necessarily better?

8
Human Automatically Generated Thesauri

Earliest work began in the 1950s.
H.P. Luhn
Thesaurofacet detailed list of engineering
terms
Largely used in such industries as medicine,
aerospace, and other technological fields.

9
Drawbacks of Handcrafted Thesauri

Cost
Development.
Maintenance.
Cost often outweighs benefit.
Time
It often takes a long time for thesauri to
develop.
Hard to keep up with the pace of scientific and
technological development.

10
Automatically Generated Thesauri

Need grew from limitations of handcrafted
thesauri.
No longer the cost of experts to generate
thesauri.

11
Automatically Generated Thesauri

3 Steps.
Extract word co-occurrences.
Define word similarities.
Based upon word co-occurrence or lexical
relationship.
Cluster words based upon their similarities.
Not proven very successful.
As late as 1990 many industries were still using
handcrafted thesauri.

12
Relevance Feedback

Began in the 1960s.
Significant improvement in recall and precision
over early query expansion work.
Basic process as follows.
The user creates their initial query which
returns an initial result set.
The user then selects a list of documents that
are relevant to their search.
The system then re-weights and/or expands the
query based upon the terms in the documents.

13
Relevance Feedback Models

Many different types of models.
Depend on methods and theories behind them.
Vector Space.
Probabilistic.
Boolean.

14
Ide dec-hi Method

In this method, all the top ranked relevant
documents are used as is the highest ranked
non-relevant document.
The non-relevant document is used a point in the
vector space from which the feedback query is
removed.
Up to 160 improvement over non-expanded queries.

15
Interactive Query Expansion

Uses a thesaurus.
After initial query is submitted, the system
returns a list of associated and relevant words
derived from both the result set and a thesaurus.
Useful, but more research is needed.

16
Pseudo-relevance Feedback

Grew from problems involved in implementing
relevance feedback systems.
Users do not like to give manual feedback to the
system.

17
Pseudo-relevance Feedback Process

The system returns an initial set of documents.
The system assumes that the top n number of
documents are relevant to the query.
The system takes terms from these documents to
re-weight the query.
Relies largely on the systems ability to
initially retrieve relevant documents.

18
lol
19
Automatic Query Expansion

The process of automatic query expansion using
computer generated thesauri.
Works somewhat like pseudo-relevance feedback.
Implementation not as useful, but still widely
researched.

20
Term Co-occurrence Measures

Process of developing relationships between
words based upon their co-occurrence in
documents.
Clustering
Documents that share a significant number of
terms are grouped together.
A thesaurus is then generated from the terms in
these categories.
Categories sometimes too narrow or broad.
Does not account for synonyms.

21
Lexical Co-Occurrence Measures

Instead of looking at the frequency of terms in
a document, the proximity of words in a document
is looked at.
Context of words becomes important.
Some performance improvement shown in small
document collections.
Not quite as good as relevance feedback, but
better than pseudo-relevance feedback.

22
Current State of Query Expansion

Query Expansion technology has reached somewhat
of a plateau.
This is due to limiting factors of relevance
feedback and word co-occurrence.
Current research attempting to refine previous
research in the field.

23
Where To Go From Here?

Grammatical Based Thesauri
Syntactical relationship between words
Words placed into classes
Some improvement on small document collections.
Failed on larger ones.
AI Searching
Mostly theory
Intelligent Agents
Could be customized reflect specific needs of the
user
Next logical step in IR, but still far off from
commercial use

24
Works Cited

Attardi, G., S. Di Marco and F. Sebastiani. 1998.
Automated Generation of Category-Specific
Thesauri for Interactive Query Expansion.
Grefenstette, G. 1992. Use of Syntactic Context
to Produce Term Association Lists for Text
Retrieval. In Proceedings of the 15th Annual
International ACM-SIGIR Conference on Research
and Development in Information Retrieval,
Copenhagen, Denmark, ed. N. Belkin, P. Ingwersen
and A. M. Pesjtersen pp. 89-97. New York ACM
Press.
Ide, E. 1971. New Experiments in Relevance
Feedback. In G. Salton. The SMART Retrieval
System Experiments in automatic document
processing. Englewood Cliffs, NJ Prentice-Hall.
Qiu, Y., 1993. Concept Based Query Expansion. In
Proceedings of SIGIR-93, 16th ACM International
Conference on Research and Development in
Information Retrieval.
Schutze, H. and J. Pederson. 1997. A
Cooccurance-based Thesaurus and Two Applications
to Information Retrieval. Information Processing
and Management 33, no. 3 pp. 307-318.
Walker, D. 2001. Query Expansion Using Thesauri.