Title: CS 430: Information Discovery
1CS 430 Information Discovery
Lecture 21 Interactive Retrieval
2Course Administration
Wireless laptop experiment During the semester,
we have been logging URLs used via the nomad
proxy server. Working with the HCI Group, we
would like to analyze these URLs to study
students' patterns of use of online information.
The analysis will be completely anonymous. This
requires your consent. If you have not signed a
consent form, we have forms here for your
signature. If you do not sign a consent form, the
data will be discarded without being looked at.
3The Human in the Loop
Return objects
Return hits
Browse repository
Search index
4Query Refinement
Query formulation and search
Display number of hits
no hits
Reformulate query or display
new query
Display retrieved information
Decide next step
reformulate query
new query
5Reformulation of Query
Manual Add or remove search terms
Change Boolean operators Change wild
cards Automatic Remove search terms
Change weighting of search terms Add new
search terms
6Query Reformulation Vocabulary Tools
Feedback Information about stop lists,
stemming, etc. Numbers of hits on each term
or phrase Suggestions Thesaurus
Browse lists of terms in the inverted index
Controlled vocabulary
7Query Reformulation Document Tools
Feedback to user consists of document excerpts or
surrogates Shows the user how the system
has interpreted the query Effective at suggesting
how to restrict a search Shows examples of
false hits Less good at suggesting how to expand
a search No examples of missed items
8Example Tilebars
The figure represents a set of hits from a text
search. Each large rectangle represents a
document or section of text. Each row represents
a search term or subquery. The density of each
small square indicates the frequency with which a
term appears in a section of a document.
Hearst 1995
9Document Vectors as Points on a Surface
Normalize all document vectors to be of
length 1 Then the ends of the vectors all
lie on a surface with unit radius
For similar documents, we can represent parts of
this surface as a flat region
Similar document are represented as points that
are close together on this surface
From Lecture 9
10Theoretically Best Query
optimal query
o
x
x
o
x
o
x
x
x
x
x
x
x
o
x
x
o
x
x
o
x
x
x
x
x non-relevant documents o relevant documents
11Theoretically Best Query
For a specific query, Q, let DR be the
set of all relevant documents DN-R be the
set of all non-relevant documents sim (Q,
DR) be the mean similarity between query Q and
documents in DR sim (Q, DN-R)
be the mean similarity between query Q and
documents in DN-R The theoretically best
query would maximize F sim (Q, DR) -
sim (Q, DN-R)
12Estimating the Best Query
In practice, DR and DN-R are not known. (The
objective is to find them.) However, the results
of an initial query can be used to estimate sim
(Q, DR) and sim (Q, DN-R).
13Relevance Feedback (concept)
hits from original search
x
x
o
?
x
x
o
o
x documents identified as non-relevant o
documents identified as relevant ? original
query reformulated query
From Lecture 9
14Rocchio's Modified Query
Modified query vector Original query vector
Mean of relevant documents found by original
query - Mean of non-relevant documents found
by original query
15Query Modification
Q0 vector for the initial query Q1 vector for
the modified query Ri vector for relevant
document i Si vector for non-relevant document
i n1 number of relevant documents n2 number
of non-relevant documents
Rocchio 1971
16Difficulties with Relevance Feedback
optimal query
o
x
Hits from the initial query are contained in the
gray shaded area
x
o
x
o
x
x
x
x
x
x
x
o
?
x
x
o
x
x
o
x
x
x
x
x non-relevant documents o relevant documents ?
original query reformulated query
17Effectiveness of Relevance Feedback
Best when Relevant documents are tightly
clustered (similarities are large)
Similarities between relevant and non-relevant
documents are small
18Positive and Negative Feedback
?, ? and ? are weights that adjust the importance
of the three vectors. If ? 0, the weights
provide positive feedback, by emphasizing the
relevant documents in the initial set. If ? 0,
the weights provide negative feedback, by
reducing the emphasis on the non-relevant
documents in the initial set.
19When to Use Relevance Feedback
Relevance feedback is most important when the
user wishes to increase recall, i.e., it is
important to find all relevant documents. Under
these circumstances, users can be expected to put
effort into searching Formulate queries
thoughtfully with many terms Review results
carefully to provide feedback Iterate
several times Combine automatic query
enhancement with studies of thesauruses
and other manual enhancements