Title: Query Expansion
1Query Expansion
2What is Query Expansion?
- Query Expansion is the term given when a search
engine adding search terms to a users weighted
search. - The goal is to improve precision and/or recall.
- Example User Query car Expanded Query car
cars automobile automobiles auto etc
3Classes of Query Expansion
- Human and/or computer generated thesauri
- Relevance feedback
- Automatic query expansion
4Query Expansion Issues
- Two major issues
- Which terms to include?
- Which terms to weight more?
- Concept-Based vs. Term-Based Query Expansion
- Is it better to expand based upon the individual
terms in the query, or the overall concept of the
query?
5Relevance of Query Expansion
- Query expansion is very important on the web.
- The amount of information on the web is always
increasing. - In 1999, Google had 135 million pages. It now
has over 3 billion. - Search engine users follow specific trends with
their searches. - 2-3 words
- Broad search term
- Do not like to expand their queries either
through refining search terms or using Boolean
operators
6Thesauri
- What is a Thesauri in the IR world?
- Any data structure that defines semantic
relatedness between words. - Schutze and Pedersen (1997)
- Often more complex than normal Thesauri.
- Thought to be too broad to be useful.
7The Need For Thesauri
- Naturally assumed that pulling words from a
thesauri would increase - The number of documents retrieved.
- Possibly precision.
- The car example car vs. car, auto,
automobile, vehicle, sedan, etc - Which would retrieve the largest number of
documents? - Is larger necessarily better?
8Human Automatically Generated Thesauri
- Earliest work began in the 1950s.
- H.P. Luhn
- Thesaurofacet detailed list of engineering
terms - Largely used in such industries as medicine,
aerospace, and other technological fields.
9Drawbacks of Handcrafted Thesauri
- Cost
- Development.
- Maintenance.
- Cost often outweighs benefit.
- Time
- It often takes a long time for thesauri to
develop. - Hard to keep up with the pace of scientific and
technological development.
10Automatically Generated Thesauri
- Need grew from limitations of handcrafted
thesauri. - No longer the cost of experts to generate
thesauri.
11Automatically Generated Thesauri
- 3 Steps.
- Extract word co-occurrences.
- Define word similarities.
- Based upon word co-occurrence or lexical
relationship. - Cluster words based upon their similarities.
- Not proven very successful.
- As late as 1990 many industries were still using
handcrafted thesauri.
12Relevance Feedback
- Began in the 1960s.
- Significant improvement in recall and precision
over early query expansion work. - Basic process as follows.
- The user creates their initial query which
returns an initial result set. - The user then selects a list of documents that
are relevant to their search. - The system then re-weights and/or expands the
query based upon the terms in the documents.
13Relevance Feedback Models
- Many different types of models.
- Depend on methods and theories behind them.
- Vector Space.
- Probabilistic.
- Boolean.
14Ide dec-hi Method
- In this method, all the top ranked relevant
documents are used as is the highest ranked
non-relevant document. - The non-relevant document is used a point in the
vector space from which the feedback query is
removed. - Up to 160 improvement over non-expanded queries.
15Interactive Query Expansion
- Uses a thesaurus.
- After initial query is submitted, the system
returns a list of associated and relevant words
derived from both the result set and a thesaurus. - Useful, but more research is needed.
16Pseudo-relevance Feedback
- Grew from problems involved in implementing
relevance feedback systems. - Users do not like to give manual feedback to the
system.
17Pseudo-relevance Feedback Process
- The system returns an initial set of documents.
- The system assumes that the top n number of
documents are relevant to the query. - The system takes terms from these documents to
re-weight the query. - Relies largely on the systems ability to
initially retrieve relevant documents.
18lol
19Automatic Query Expansion
- The process of automatic query expansion using
computer generated thesauri. - Works somewhat like pseudo-relevance feedback.
- Implementation not as useful, but still widely
researched.
20Term Co-occurrence Measures
- Process of developing relationships between
words based upon their co-occurrence in
documents. - Clustering
- Documents that share a significant number of
terms are grouped together. - A thesaurus is then generated from the terms in
these categories. - Categories sometimes too narrow or broad.
- Does not account for synonyms.
21Lexical Co-Occurrence Measures
- Instead of looking at the frequency of terms in
a document, the proximity of words in a document
is looked at. - Context of words becomes important.
- Some performance improvement shown in small
document collections. - Not quite as good as relevance feedback, but
better than pseudo-relevance feedback.
22Current State of Query Expansion
- Query Expansion technology has reached somewhat
of a plateau. - This is due to limiting factors of relevance
feedback and word co-occurrence. - Current research attempting to refine previous
research in the field.
23Where To Go From Here?
- Grammatical Based Thesauri
- Syntactical relationship between words
- Words placed into classes
- Some improvement on small document collections.
Failed on larger ones. - AI Searching
- Mostly theory
- Intelligent Agents
- Could be customized reflect specific needs of the
user - Next logical step in IR, but still far off from
commercial use
24Works Cited
- Attardi, G., S. Di Marco and F. Sebastiani. 1998.
Automated Generation of Category-Specific
Thesauri for Interactive Query Expansion. - Grefenstette, G. 1992. Use of Syntactic Context
to Produce Term Association Lists for Text
Retrieval. In Proceedings of the 15th Annual
International ACM-SIGIR Conference on Research
and Development in Information Retrieval,
Copenhagen, Denmark, ed. N. Belkin, P. Ingwersen
and A. M. Pesjtersen pp. 89-97. New York ACM
Press. - Ide, E. 1971. New Experiments in Relevance
Feedback. In G. Salton. The SMART Retrieval
System Experiments in automatic document
processing. Englewood Cliffs, NJ Prentice-Hall. - Qiu, Y., 1993. Concept Based Query Expansion. In
Proceedings of SIGIR-93, 16th ACM International
Conference on Research and Development in
Information Retrieval. - Schutze, H. and J. Pederson. 1997. A
Cooccurance-based Thesaurus and Two Applications
to Information Retrieval. Information Processing
and Management 33, no. 3 pp. 307-318. - Walker, D. 2001. Query Expansion Using Thesauri.