IST2140 Information Storage and Retrieval - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

IST2140 Information Storage and Retrieval

Description:

Robertson and colleagues (Okapi) Probabilistic Model ... article, and success of probabilistic systems such as InQuery and Okapi at TREC. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 36

Provided by: eras

Category:

more less

Transcript and Presenter's Notes

Title: IST2140 Information Storage and Retrieval

1
IST2140Information Storage and Retrieval

Week 4
Information Retrieval Models II

2
Information Retrieval Models

A model is an embodiment of the theory in which
we define a set of objects about which assertions
can be made and restrict the ways in which
classes of objects can interact
A retrieval model specifies the representations
used for documents and information needs, and how
they are compared.
(Turtle Croft, 1992)

3
Information Retrieval Models

An IR model specifies the representations used
for documents and information needs, and how they
are compared
Three classic models
Boolean Model
Vector Space Model
Probabilistic Model

4
Boolean Model

Exact match system
Document features are binary variables 0,1
Features can be document terms, also more complex
information (dates, source, authors, etc.)
Query is a Boolean expression using feature
variables related by AND, OR, NOT operators
All documents which match query are retrieved and
equal
Extensions to model
Add additional operators, e.g. positional
Introduce term weights

5
Vector space model

Documents and queries are represented as vectors
in l-dimensional hyperspace
Each dimension corresponds to possible document
feature
Vector elements are usually weighted 0 to 1
Matching function is a distance metric that
operates on document and query vectors
Documents can be ranked in order of distance from
query

6
Probabilistic Model

First proposed by Maron Kuhns (1960)
Further developed by (inter alia)
Maron Cooper
Van Rijsbergen and colleagues
Croft and Turtle (InQuery)
Robertson and colleagues (Okapi)

7
Probabilistic Model

Views retrieval as an attempt to answer a basic
question What is the probability that this
document is relevant to this query?
expressed as
P(RELD)
ie. Probability of x given y (Probability that
of relevance given a particular document D)

8
Assumptions

document here means the content representation or
description, i.e. surrogate
relevance is binary
relevance of a document is independent of
relevance of other documents
terms are independent of one another

9
Independence Assumption

General model is developed by making a strong
Independence Assumption
Given relevance, the attributes are statistically
independent
I.e.. Within each class of documents, relevant or
non-relevant, the attributes are statistically
independent
Simplifies theory

10
Probability Ranking Principle

If retrieved documents are ordered by decreasing
probability of relevance on the data available,
then the systems effectiveness is the best to be
gotten for the data

11
Formal Model

Given a document D and query Q, there are two
possible events
REL, that D is relevant to Q
notREL, that D is not relevant to Q
We want to calculate the probability
P(RELD),
i.e. the probability that a document is relevant
given its content

12
Formal Model

Assuming relevance is binary
P(notREL) 1 P(REL)
Probability that a document is relevant is
P(REL) n no. of relevant docs
N no. of docs
Probability that a document is not relevant
P(notREL) 1 P(REL)
N-n
N

Assuming term independence, then
P(DREL) ?(aiREL)
i.e. probability that a document is relevant is
based on the probability of relevance of
individual terms
Can derive a matching score of a document which
is the sum of the weights of the matching
(present) terms
For simplicity use log-odds rather than odds

14
What term weights to use?

What contribution does presence of a term in
specific documents make to documents probability
of relevance?
E.g.
CFW collection frequency weight (log N/ni)
RW relevance weight
CW combined weight (using in-document frequency)

15
Formula for relevance weights

(r0.5)(N-n-Rr0.5)
RW log ---------------------------
(R-40.5)(n-r0.5)
Where r no. of rel docs containing term i
n no of docs containing term i
N number of docs in collection

16
Probabilistic Model

Extensively tested
See for instance Sparck Jones et al, A
probabilistic model of information retrieval
development and comparative experiments. Parts I
and II. Information Processing Management 36
779-840.
-based on performance figures from Cranfield,
UKCIS, NPL, TREC

17
Sparck Jones Evaluation

CWgtRW gt CFW (idf) gt UW
I.e. performance gains from
Unweighted terms ?
Collection frequency weights ?
Relevance weights ?
Combined weights (using in-document frequency)
Largest single gains from using term frequency
information
Also very noticeable gain from using relevance
information

18
Classic IR Models

Boolean considered weakest
Vector vs. probabilistic
Through several different measures, Salton and
Buckley showed that the vector model is expected
to outperform the probabilistic model with
general collections. This also seems to be the
dominant thought among researchers,
practitioners, and the Web community, where the
popularity of the vector model runs high.
Baeza-Yates Ribeiro, p. 34

19
Classic IR Models

Vector vs. probabilistic
Numerous experiments demonstrate that
probabilistic retrieval procedures yield good
results. However, the results have not been
sufficiently better than those obtained using
Boolean or vector techniques to convince system
developers to move heavily in this
direction---Korfhage, p. 92
However, see recent Sparck Jones et al article,
and success of probabilistic systems such as
InQuery and Okapi at TREC.

20
Latest Model Language Model

a statistical language model is a probabilistic
model for generating text
In late 90s the language model was proposed for
information retrieval
(not a new model but new for IR)
generates scores for probability that query is
generated from the document model

21
Language Models in IR

Views documents as models and considers queries
as strings of texts randomly sampled from models
documents are ranked by probability that a query
Q would be observed during repeated random
sampling from the document model MD
P(QMD)

22
Other IR Models

Extended Boolean
Fuzzy Set
Cluster-based Retrieval

23
Cluster-based retrieval

Cluster analysis a technique in multivariate
analysis used to generate a category structure
which fits a set of observations
Form of automatic classification but classes
formed are not known prior to processing but
defined by the cluster process

24
Clustering of Text

Cluster documents on basis of terms they contain
Cluster documents on basis of co-occurring
citations
Cluster terms on basis of documents they occur in

25
Applications in IR

Cluster based retrieval retrieve documents in
the same cluster or hierarchy
Text categorization use to categorize documents
Use of term clusters build a thesaurus of
like terms use in query expansion
Web IR use post-hoc clustering to group
retrieved web pages (e.g. jaguar car, animal,
software)
Use in text mining to find patterns

26
Practicalities of clustering

Determine attributes clusters will be based on
Choose a clustering method
Choose a similarity function
Create the clusters
Assess validity of result

27
Similarity Matrix

Many similarity measures for text Dice,
Jaccard, cosine are commonly used
Calculate the similarity matrix
S21
S31 S32
S41 S42 S43
.
Sn1 Sn2 Sn3 Sn(n-1)

28
Clustering methods

Non-hierarchical
Single pass
Reallocation
Hierarchical
Single link
Complete link
Wards method

29
Single Pass Method

First document becomes representative for first
cluster
For Di calculate Sim with each cluster
representative
If Simmax gt threshold SimT, add to corresponding
cluster otherwise start a new cluster
If an item Di remains, return to step 2

30
Reallocation Method

Select M cluster representatives or centroids
For I 1 to N, assign Di to most similar
centroid
For j 1 to M, recalculate cluster centroid Cj
Repeat steps 2 and 3 until cluster membership
stabilizes

31
HACM(Hierarchical agglomerative clustering
methods)

Identify two closest points and combine them in a
cluster
Identify and combine the next two closest points,
treating existing clusters as points
If more than one cluster remains, return to step 1

32
HACM

Differ based on
Definition of most similar pair
Cluster representation used
High storage and time requirements
Output presented as a dendrogram

33
HACM

Single link
Joins most similar pair of objects not yet in
same cluster
Complete link
Uses least similar pair between clusters to
measure intercluster similarity
Wards method
Merges pair which minimizes increase in within
group error sum of squares uses Euclidean
distance cluster center is weighted average

34
How many clusters?