Query operations - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Query operations

Description:

Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local analysis ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 32

Provided by: Moun81

Category:

more less

Transcript and Presenter's Notes

Title: Query operations

1
Query operations

1- Introduction
2- Relevance feedback with user relevance
information
3- Relevance feedback without user relevance
information
- Local analysis (pseudo-relevance feedback)
- Global analysis (thesaurus)
4- Evaluation
5- Issues

2
Introduction (1)

No detailed knowledge of collection and retrieval
environment
difficult to formulate queries well designed for
retrieval
Need many formulations of queries for effective
retrieval
First formulation often naïve attempt to
retrieve relevant information
Documents initially retrieved
Examined for relevance information (user,
automatically)
Improve query formulations for retrieving
additional relevant documents
Query reformulation
Expanding original query with new terms
Reweighting the terms in expanded query

3
Introduction (2)

Approaches based on feedback from users
(relevance feedback)
Approaches based on information derived from set
of initially retrieved documents (local set of
documents)
Approaches based on global information derived
from document collection

4
Relevance feedback with user relevance
information (1)

Most popular query reformulation strategy
Cycle
User presented with list of retrieved documents
User marks those which are relevant
In practice top 10-20 ranked documents are
examined
Incremental
Select important terms from documents assessed
relevant by users
Enhance importance of these terms in a new query
Expected
New query moves towards relevant documents and
away from non-relevant documents

5
Relevance feedback with user relevance
information (2)

Two basic techniques
Query expansion
Add new terms from relevant documents
Term reweighting
Modify term weights based on user relevance
judgements
Advantages
Shield users from details of query reformulation
process
Search broken down in sequence of small steps
Controlled process
Emphasise some terms (relevant ones)
De-emphasise other terms (non-relevant ones)

6
Relevance feedback with user relevance
information (3)

Query expansion and term reweighting in the
vector space model
Term reweighting in the probabilistic model

7
Query expansion and term reweighting in
thevector space model

Term weight vectors of documents assessed
relevant
Similarities among themselves
Term weight vectors of documents assessed
non-relevant
Dissimilar for those of relevant documents
Reformulated query
Closer to term weight vectors of relevant
documents

8
Query expansion and term reweighting in
thevector space model

For query q
Dr set of relevant documents among retrieved
documents
Dn set of non-relevant documents among retrieved
documents
Cr set of relevant documents among all documents
in collection
?,?,? tuning constants
Assume that Cr is known (unrealistic!)
Best query vector for distinguishing relevant
documents from non-relevant documents

9
Query expansion and term reweighting in
thevector space model

Problem Cr is unknown
Approach
Formulate initial query
Incrementally change initial query vector
Use Dr and Dn instead
Rochio formula
Ide formula

10
Rochio formula

Direct application of previous formula add
query
Initial formulation ?1
Usually information in relevant documents more
important than in non-relevant documents (?ltlt?)
Positive relevance feedback (?0)

11
Rochio formula in practice (SMART)

?1
Terms
Original query
Appear in more relevant documents that
non-relevant documents
Appear in more than half the relevant documents
Negative weights ignored

12
Ide formula

Initial formulation ? ? ?1
Same comments as for the Rochio formula
Both Ide and Rochio no optimal criterion

13
Term reweighting for the probabilistic model

(see note on the BIR model)
Use idf to rank documents for original query
Calculate
Predict relevance
Improved (optimal) retrieval function

14
Term reweighting for the probabilistic model

Independence assumptions
I1 distribution of terms in relevant documents
is independent
and their distribution in all documents is
independent
I2 distribution of terms in relevant documents
is independent
and their distribution in irrelevant documents
is independent
Ordering principle
O1 probable relevance based on presence of
search terms in documents
O2 probable relevance based on presence of
search terms in documents
and their absence from documents

15
Term reweighting for the probabilistic model

Various combinations

16
Term reweighting for the probabilistic model

F1 formula
ri number of relevant documents containing ti
ni number of documents containing ti

ratio of the proportion of relevant documents in
which the query term ti occurs to the proportion
of all documents in which the term ti occurs
R number of relevant documents
N number of documents in collection

17
Term reweighting for the probabilistic model

F2 formula
ri number of relevant documents containing ti
ni number of documents containing ti

proportion of relevant documents in which the
term ti occurs to the proportion of all
irrelevant documents in which ti occurs
R number of relevant documents
N number of documents in collection

18
Term reweighting for the probabilistic model

ratio of relevance odds (ratio of relevant
documents containing term ti and non-relevant
documents containing term ti) and collection
odds (ratio of documents containing ti and
documents not containing ti)
ri number of relevant documents containing ti
ni number of documents containing ti

F3 formula
R number of relevant documents
N number of documents in collection

19
Term reweighting for the probabilistic model

ratio of relevance odds and non-relevance
odds (ratio of relevant documents not containing
ti and the non-relevant documents not containing
ti)
ri number of relevant documents containing ti
ni number of documents containing ti

F4 formula
R number of relevant documents
N number of documents in collection

20
Experiments

F1, F2, F3 and F4 outperform no relevance
weighting and idf
F1 and F2 F3 and F4 perform in the same range
F3 and F4 gt F1 and F2
F4 slightly gt F3
O2 is correct (looking at presence and absence of
terms)
No conclusion with respect to I1 and I2, although
I2 seems a more realistic assumption.

21
Relevance feedback without user relevance

Relevance feedback with user relevance
Clustering hypothesis
known relevant documents contain terms which can
be used to describe a larger cluster of relevant
documents
Description of cluster built interactively with
user assistance
Relevance feedback without user relevance
Obtain cluster description automatically
Identify terms related to query terms
(e.g. synonyms, stemming variations, terms close
to query terms in text)
Local strategies
Global strategies

22
Local analysis (pseudo-relevance feedback)

Examine documents retrieved for query to
determine query expansion
No user assistance
Clustering techniques
Query drift

23
Clusters (1)

Synonymy association (one example) terms that
frequently co-occur inside local set of documents
Term-term (e.g., stem-stem) association matrix
(normalised)

24
Clusters (2)

For term ti
Take the n largest values mi,j
The resulting terms tj form cluster for ti
Query q
Finding clusters for the q query terms
Keep clusters small
Expand original query

25
Global analysis

Expand query using information from whole set of
documents in collection
Thesaurus-like structure using all documents
Approach to automatically built thesaurus
(e.g. similarity thesaurus based on co-occurrence
frequency)
Approach to select terms for query expansion

26
Evaluation of relevance feedback strategies

Use qi and compute precision and recall graph
Use qi1 and compute precision recall graph
Use all documents in the collection
Spectacular improvements
Also due to relevant documents ranked higher
Documents known to user
Must evaluate with respect to documents not seen
by user
Three techniques

27
Evaluation of relevance feedback strategies

Freezing
Full-freezing
Top n documents are frozen (ones used in RF)
Remaining documents are re-ranked
Precision-recall on whole ranking
Change in effectiveness thus come from unseen
documents
With many iteration, higher contribution of
frozen documents may lead to decrease in
effectiveness
Modified freezing
Rank position of the last marked relevant document

28
Evaluation of relevance feedback strategies