Modern Information Retrieval Chapter 5 Query Operations - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Modern Information Retrieval Chapter 5 Query Operations

Description:

global information derived from the document collection. User Relevance Feedback ... Consider the expression (su sv) where the symbol stands for disjunction. ... – PowerPoint PPT presentation

Number of Views:403

Avg rating:3.0/5.0

Slides: 34

Provided by: 6649293

Category:

more less

Transcript and Presenter's Notes

Title: Modern Information Retrieval Chapter 5 Query Operations

1
Modern Information Retrieval Chapter 5 Query
Operations

??????
??89522022

2
Introduction

It is difficult to formulate queries which are
well designed for retrieval purposes.
Improving the initial query formulation through
query expansion and term reweighting.
Approaches based on
feedback information from the user
information derived from the set of documents
initially retrieved (called the local set of
documents)
global information derived from the document
collection

3
User Relevance Feedback

User is presented with a list of the retrieved
documents and, after examining them, marks those
which are relevant.
Two basic operation
Query expansion addition of new terms from
relevant document
Term reweighting modification of term weights
based on the user relevance judgement

4
User Relevance Feedback

The usage of user relevance feedback to
expand queries with the vector model
reweight query terms with the probabilistic model
reweight query terms with a variant of the
probabilistic model

5
Vector Model

Define
WeightLet the ki be a generic index term in the
set K k1, , kt. A weight wi,j gt 0 is
associated with each index term ki of a document
dj.
document index term vectorthe document dj is
associated with an index term vector dj
representd by

6
Vector Model (contd)

Define
from the chapter 2the term weighting the
normalized frequency freqi,j be the raw
frequency of ki in the document djnverse
document frequency for ki the query term
weight

7
Vector Model (contd)

Define
query vector query vector q is defined as
Dr set of relevant documents identified by the
user
Dn set of non-relevant documents among the
retrieved documents
Cr set of relevant documents among all documents
in the collection
a,ß,? tuning constants

8
Query Expansion and Term Reweighting for the
Vector Model

ideal caseCr the complete set Cr of relevant
documents to a given query q
the best query vector is presented by
The relevant documents Cr are not known a priori,
should be looking for.

9
Query Expansion and Term Reweighting for the
Vector Model (contd)

3 classic similar way to calculate the modified
query
Standard_Rochio
Ide_Regular
Ide_Dec_Hi
the Dr and Dn are the document sets which the
user judged

10
Term Reweighting for the Probabilistic Model

simialrity the correlation between the vectors
dj andthis correlation can be quantified as
The probabilistic model according to the
probabilistic ranking principle.
p(kiR) the probability of observing the term
ki in the set R of relevant document
p(kiR) the probability of observing the term
ki in the set R of non-relevant document

(5.2)
11
Term Reweighting for the Probabilistic Model

The similarity of a document dj to a query q can
be expressed as
for the initial search
estimated above equation by following
assumptionsni is the number of documents which
contain the index term ki get

12
Term Reweighting for the Probabilistic Model
(contd)

for the feedback search
The P(kiR) and P(kiR) can be approximated
asthe Dr is the set of relevant documents
according to the user judgementthe Dr,i is the
subset of Dr composed of the documents contain
the term ki
The similarity of dj to q
There is no query expansion occurs in the
procedure.

13
Term Reweighting for the Probabilistic Model
(contd)

Adjusment factor
Because of Dr and Dr,i are certain small,
take a 0.5 adjustment factor added to the P(kiR)
and P(kiR)
alternative adjustment factor ni/N

14
A Variant of Probabilistic Term Reweighting

1983, Croft extended above weighting scheme by
suggesting distinct initial search methods and by
adapting the probabilistic formula to include
within-document frequency weights.
The variant of probabilistic term
reweightingthe Fi,j,q is a factor which
depends on the triple ki,dj,q.

15
A Variant of Probabilistic Term Reweighting
(contd)

using disinct formulations for the initial search
and feedback searches
initial searchthe fi,j is a normalized
within-document frequencyC and K should be
adjusted according to the collection.
feedback searches
empty text

16
Automatic Local Analysis

Clustering the grouping of documents which
satisfy a set of common properties.
Attempting to obtain a description for a larger
cluster of relevant documents automatically To
identify terms which are related to the query
terms such as
Synonyms
Stemming
Variations
Terms with a distance of at most k words from a
query term

17
Automatic Local Analysis (contd)

The local strategy is that the documents
retrieved for a given query q are examined at
query time to determine terms for query
expansion.
Two basic types of local strategy
Local clustering
Local context analysis
Local strategies suit for environment of
intranets, not for web documents.

18
Query Expansion Through Local Clustering

Local feedback strategies are that expands the
query with terms correlated to the query
terms.Such correlated terms are those present
in local clusters built from the local document
set.

19
Query Expansion Through Local Clustering (contd)

Definition
Stem
A V(s) be a non-empty subset of words which are
grammatical variants of each other. A canonical
form s of V(s) is called a stem.
Example
If V(s) polish, polishing, polished then
spolish
Dl the local document set, the set of documents
retrieved for a given query q
Strategies for building local clusters
Association clusters
Metric clusters
Scalar clusters

20
Association clusters

An association cluster is based on the
co-occurrence of stems inside the documents
Definition
fsi,j the frequency of a stem si in a document
dj ,
Let m(mij) be an association matrix with Sl
row and Dl columns, where mijfsi,j.
The matrix smm is a local stem-stem
association matrix.
Each element su,v in s expresses a correlation
cu,v between the stems su and sv

21
Association Clusters (contd)

The correlation factor cu,v qunatifies the
absolute frequencies of co-occurrence
The association matrix s unnormalized
Normalized

22
Association Clusters (contd)

Build local association clusters
Consider the u-th row in the association matrix
Let Su(n) be a function which takes the u-th row
and returns the set of n largest values su,v,
where v varies over the set of local stems and
vnotequaltou
Then su(n) defines a local association cluster
around the stem su.

23
Metric Clusters

Two terms which occur in the same sentence seem
more correlated than two terms which occur far
apart in a document.
It migh be worthwhile to factor in the distance
between two terms in the computation of their
correlation factor.

24
Metric Clusters

Let the distance r(ki, kj) between two keywords
ki and kj in a same document.
If ki and kj are in distinct documents we take
r(ki, kj) ?
A local stem-stem metric correlation matrix s is
defined as Each element su,v of s expresses a
metric correlation cu,v between the setms su,
and sv

25
Metric Clusters

Given a local metric matrix s , to build local
metric clusters
Consider the u-th row in the association matrix
Let Su(n) be a function which takes the u-th row
and returns the set of n largest values su,v,
where v varies over the set of local stems and v
Then su(n) defines a local association cluster
around the stem su.

26
Scalar Clusters

Two stems with similar neighborhoods have some
synonymity relationship.
The way to quantify such neighborhood
relationships is to arrange all correlation
values su,i in a vector su, to arrange all
correlation values sv,i in another vector sv, and
to compare these vectors through a scalar measure.

27
Scalar Clusters

Let su(su1, su2, ,sun ) and sv (sv1,
sv2, svn) be two vectors of correlation values
for the stems su and sv.
Let s(su,v ) be a scalar association matrix.
Each su,v can be defined as
Let Su(n) be a function which returns the set of
n largest values su,v , vu . Then Su(n) defines
a scalar cluster around the stem su.

28
Interactive Search Formulation

Stems(or terms) that belong to clusters
associated to the query stems(or terms) can be
used to expand the original query.
A stem su which belongs to a cluster (of size n)
associated to another stem sv ( i.e.
) is said to be a neighbor of sv .

29
Interactive Search Formulation (contd)

figure of stem su as a neighbor of the stem sv

30
Interactive Search Formulation (contd)

For each stem , select m neighbor stems from the
cluster Sv(n) (which might be of type
association, metric, or scalar) and add them to
the query.
Hopefully, the additional neighbor stems will
retrieve new relevant documents.????????????releva
nt documents.
Sv(n) may composed of stems obtained using
correlation factors normalized and unnormalized.
normalized cluster tends to group stems which are
more rare.
unnormalized cluster tends to group stems due to
their large frequencies.

31
Interactive Search Formulation (contd)

Using information about correlated stems to
improve the search.
Let two stems su and sv be correlated with a
correlation factor cu,v.
If cu,v is larger than a predefined threshold
then a neighbor stem of su can also be
interpreted as a neighbor stem of sv and vice
versa.
This provides greater flexibility, particularly
with Boolean queries.
Consider the expression (su sv) where the
symbol stands for disjunction.
Let su' be an neighbor stem of su.
Then one can try both(su'sv) and (susu) as
synonym search expressions, because of the
correlation given by cu,v.

32
Query Expansion Through Local Context Analysis

The local context analysis procedure operates in
three steps
1. retrieve the top n ranked passages using the
original query.This is accomplished by breaking
up the doucments initially retrieved by the query
in fixed length passages (for instance, of size
300 words) and ranking these passages as if they
were documents.
2. for each concept c in the top ranked passages,
the similarity sim(q, c) between the whole query
q (not individual query terms) and the concept c
is computed using a variant of tf-idf ranking.

33
Query Expansion Through Local Context Analysis

3. the top m ranked concepts(accroding to sim(q,
c) ) are added to the original query q. To each
added concept is assigned a weight given by 1-0.9
i/m where i is the position of the concept in
the final concept ranking . The terms in the
original query q might be stressed by assigning a
weight equal to 2 to each of them.

Write a Comment

User Comments (0)