Topics in Information RetrievalIR - PowerPoint PPT Presentation

1 / 80

About This Presentation

Title:

Topics in Information RetrievalIR

Description:

Salton from Cornell Univ., SMART system based on VSM Model ... 2001, Li Yanhong, Baidu Inc. TREC Q/A track. TREC Video track. CLEF & NTCIR. 10. Outline ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 81

Provided by: alma75

Category:

more less

Transcript and Presenter's Notes

Title: Topics in Information RetrievalIR

1
Topics in Information Retrieval(IR)

Luo Weihua
MITEL,ICT
2007-10-12
In Reading Group

2
Outline

Background
Evaluation principle measure
IR models
Query expansion
Bibliography

3
Outline

Background
Evaluation principle measure
IR models
Query expansion
Bibliography

4
Background
Query
IR system
Document collection
Retrieval
Answer list
5
Background

The Goal
find documents relevant to an information need
from document repositories

6
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
7
Background

History
1945, The Memex Machine
Vannevar Bush, As We May Think
1948, Information Retrieval
C.N. Mooers from MIT
1960s-1970s, IR models and evaluation
Cleverdon, Cranfield Experiments
Salton from Cornell Univ., SMART system based on
VSM Model
Robertson from London City Univ., Sparck Jones
from Cambridge Univ., Probabilistic Model

8
Background

History(contd)
1980s, RBDMs
1986, ANSI SQL released
1990s, Search Engine
1990, McGill Univ., Archie (ftp search tool)
1992, Donna Harman from NIST, TREC
1994, CMU Univ., Lycos
1995, David Filo Jerry Yang from Stanford
Univ., Yahoo!
1998, Larry Page Sergey Brin from Stanford
Univ., Google
1998, Ponte Croft from Umass, IR model based on
language model

9
Background

History(contd)
2000 - , branches of IR
2001, Li Yanhong, Baidu Inc.
TREC Q/A track
TREC Video track
CLEF NTCIR

10
Outline

Background
Evaluation principle measure
IR models
Query expansion
Bibliography

11
Evaluationprinciple and measure

What to be evaluated in IR?
Effectiveness
Precision
Recall
Precision of ranked list
Efficiency
Time complexity
Space complexity
Response time
Coverage
Frequency of data update

12
Evaluationprinciple and measure

Probability ranking principle(PRP)
ranking documents in order of decreasing
probability of relevance is optimal
Fit for ad-hoc retrieval
Assumption
Documents are independent
A complex information need can be broken up into
a number of queries
The probability of relevance is only estimated

13
Evaluationprinciple and measure
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
retrieved relevant
Document Set
14
Evaluationprinciple and measure

Recall RR/(RRNR)
Precision RR/(RRRN)
F 1/(a/P (1-a)/R)
Pooling for evaluation of large scale data

RR
RN
NR
Correct documents
Returned documents
15
Evaluationprinciple and measure
16
Evaluationprinciple and measure

P_at_N
Precision at a particular cutoff
No recall
Uninterpolated Average Precision
Estimate precision for each recall point, and
compute the average value
E.g. in ranking list 3,
AP (1/22/33/64/75/8) / 5
0.5726

17
Evaluationprinciple and measure

Interpolated average precision
At the point which reaches recalla, compute
precisionb
If precision goes up, take the highest value of
precision anywhere beyond the point where a was
first reached

Recall() Interpolated Precision
0 2/3
10 2/3
20 2/3
30 2/3
40 2/3
50 5/8
60 5/8
70 5/8
80 5/8
90 5/8
5/8
Int. AP 0.6460

Ranking List d6 ? d1 ? d2 ? d10 ? d9
? d3 ? d5 ? d4 ? d7 ? d8 ?
19
Uninterpolated AP
Interpolated AP
20
Evaluationprinciple and measure

TREC(The Text REtrieval Conference)
Established in 1992 to evaluate large-scale IR
Retrieving documents from a gigabyte collection
Has run continuously since then
TREC 2007(16th) meeting is in November
Run by NISTs Information Access Division
Probably most well known IR evaluation setting
Started with 25 participating organizations in
1992 evaluation
Proceedings available on-line (http//trec.nist.go
v)

21
Outline

Background
Evaluation principle measure
IR models
Query expansion
Bibliography

22
IR models

Set Theoretic models
Boolean model
Rough set based model
Extended boolean model
Algebraic models
Vector space model
Latent semantic Indexing model
Probabilistic models
Logistic regression model
Binary independence relevance model
Statistical language model based model

23
IR models

Boolean model
Representation of query and documents
boolean expression
w1,w2,,wn words in document D
D w1 AND w2 AND AND wn
Relevance estimation
R(D,Q) 1 if boolean expression of query match
one of documents (D ?Q)
R(D,Q) 0 otherwise

24
IR models
Doc1 Beijing will take measures to protect
environment in 2008.
match
2008 AND Beijing AND NOT Olympic
Doc2 The 29th Olympic games will be held in
Beijing in fall of 2008.
Not match
25
IR models

Boolean model
Pros
Simple and straightforward
Fit for special situation
Cons
Unordered list
Exact match will return empty or huge result sets
Difficult for users to construct queries

26
IR models

Vector space model
Representation of query and documents
Document
D lt a1, a2, a3, , angt
ai weight of ti in D
Query
Q lt b1, b2, b3, , bngt
bi weight of ti in Q
Term
Character, word, phrase, n-gram, etc
Dimension reduction stop word list, stemming,
word clustering, etc

27
IR models

Vector space model
Relevance estimation
Euclidean
Cosine
Dice
Jaccard

28
Ranking list d2 d1 d3
29
IR models

Vector space model
Term weighting
Goal
determine the most representative terms (words)
for a document (query)
Weight their importance
General idea
More frequent term is more salient
But also need to measure specificity
(discriminative power)

30
IR models

Vector space model
Term weighting
Quantity
Comment
dfi lt cfi

31
IR models

Vector space model
Term weighting (IDF schemes)
tf
f(tf)
f(tf) 1log(tf)
df
f(df) 1log(N/df)
Tfidf
f(tf,df) if tfgt0
0 if tf 0

32
IR models

Vector space model
Pros
Conceptual simplicity spatial proximity for
semantic proximity
partial retrieval and fuzzy retrieval
good performance
Cons
Term independence assumption is not true
Incapable of polysemy (Java as language vs Java
as island), synonymy(computer and pc) , etc

33
IR models

Term distribution model
Estimate distribution of a word
Characterize the importance for ir of a word
Capture regularities of word occurrence in
subunits of a corpus
Distinguish content words from non-content words

34
IR models

Term distribution model
Integration into IR
Replacement for IDF weights
Potential of accessing a terms properties more
accurately
Better estimation of query-document similarity

35
IR models

Term distribution model
Poisson distribution
2-poisson distribution
Katzs K mixture
Residual inverse document frequency

36
IR models

Poisson distribution
?i cfi/N
Pi(k) p(k ?i )

37
IR models

Poisson distribution
Assumption
The probability of one occurrence of the term in
a piece of text is proportional to the length of
text
The probability of more than one occurrence of a
term in a short piece of text is negligible
compared to the probability of one occurrence
Occurrence events in non-overlapping intervals of
text are independent

38
IR models
39
IR models

Problems with Poisson model
Good for non-content words
Assumption of independence is not true for
content words (burstiness)
Documents are not a uniform unit for differing in
size

40
IR models

2-Poisson model
2 classes of documents associated with a term
Non-privileged class
Privileged class

41
IR models

2-Poisson model
Better fit to the frequency distribution of
content words
Ridiculous prediction
Pi(2) lt Pi(3) or Pi(4)
In reality, Pi(0) gt Pi(1) gt Pi(2) gt Pi(3)

42
IR models

Katzs K mixture
?k,0 1 if k0 ?k,0 0 otherwise

43
IR models
44
IR models

Good approximation for non-content words
Derivation pi(k)/Pi(k1) c if k gt1
not hold perfectly for content words

45
IR models

Residual Inverse document frequency
IDF log2(N/df)
A good predictor of the degree to which a word is
a content word

46
IR models

Latent semantic indexing model(LSI)
Motivation
Word co-occurrence implies semantic similarity
LSI projects queries and documents into a space
with latent semantic dimensions
Fewer dimensions ?dimension reduction
Keep k strongest dimensions remove noise

47
IR models

Latent semantic indexing model
Create a new representation space (by SVD)
Linear combination of the original term and
document dimensions
Remove the least weighted dimensions (noise)

48
IR models

Singular Value Decomposition(SVD)
Least squares methods
Given (x1,y1), (x2,y2), , (xn,yn)
Compute F(?) m? b
SS(m,b)

49
(No Transcript)
50
IR models

Singular Value Decomposition(SVD)
Given document-by-term matrix A
Compute latent semantic matrix

51
Singular Value Decomposition
v1T v2T .. vNT
u1 u2 uM
A
VT
S
U

A USVT
UUTI VVTI
S singular values

52
Truncated-SVD

Ak Uk Sk VTk
best approximation of A

53
IR models
54
IR models
55
IR models
56
IR models

Latent semantic indexing model
Pros and Cons
A clean formal framework
Computation cost (SVD)
Setting of k critical (empirically 300-1000)
Effective on small collections (CACM,), but
variable on large collections (TREC)

57
IR models

Binary independence relevance model
For the same query, P(QR,D) P(R1D)
R relevance
D document
Q query

58
IR models

Binary independence relevance model
Ranking function

59
IR models

Binary independence relevance model
Ranking function
Assume D x1, x2,

60
IR models

Binary independence relevance model
Query China economic rapid
Deconomic surprising

61
IR models

Binary independence relevance model
How to estimate pi and qi ?
A set of N relevant and irrelevant samples

62
IR models

Binary independence relevance model
Smoothing (Robertson-Sparck-Jones formula)
When no sample is available
pi0.5,
qi(ni0.5)/(N0.5)?ni/N

63
OKAPI

Binary independence relevance model
frequency, length, heuristics
k1, k2, k3, b parameters
tfi, qtfi doc./query term frequency
dl document length
avdl average document length

Doc. length normalization
TF factors
64
IR models

Binary independence relevance model
Pros cons
Theoretically well founded
Difficulty to implement without simplification
Effectiveness often depends on heuristics (e.g.
Okapi)
Term independence

65
IR models

Statistical language model based model
Motivation
create a statistical model so that one can
calculate the probability of s w1, w2,, wn in
a language

66
IR models

Statistical language model based model
For d w1w2wn
Relevance

67
IR models

Statistical language model based model
General approach

s
Training data
Probabilities of the observed elements
P(s)
68
IR models

Statistical language model based model
Maximum likelihood estimation

69
IR models

Statistical language model based model
Smoothing
If a query word does not appear in a document,
P(QMD)0
General form
?D normalization coefficient

70
Some smoothing methods used in IR

Jelinek-Mercer Interpolation
Dirichlet Prior
Absolute discouting

71
IR models

Statistical language model based model
Final ranking score

72
IR models

Statistical language model based model
Pros Cons
Theoretical formalization
Term independency is unnecessary
Data sparseness
Parameters estimation

73
Outline

Background
Evaluation principle measure
IR models
Query expansion
Bibliography

74
Query expansion

Motivation
A User is clear of his information need,but
cannot construct a good query for it
A user is not clear of his information need, and
wishes to refine it by initial queries

75
Query expansion

Relevance feedback(RF)
User RF
Users judge documents in returned lists manually
Pseudo RF
Select top N documents in a returned list as
answers

76
Query expansion

Query reformulation
Thesaurus
Wordnet
hownet
Co-occurrences
Relevance feedback

77
Query expansion

Query reformulation for VSM
Rocchio formula

78
Query expansion

Query expansion for language model
Bi-grams
Bi-term
Do not consider word order in bi-grams
(analysis, data) (data, analysis)
Term dependency
Determine statistically the strongest
dependencies
parallel computer architecture

79
Bibliography

Christopher. Manning. Foundations of Statistical
Natural Language Processing. 1999
Jian-Yun Nie. IR Models and Some Recent Trends.
2006
Wang Bin. Teaching materials of Modern
Information Retrieval. 2007
Ricardo Baeza-Yates. Modern Information
Retrieval. 1999

80
Thanks!

Write a Comment

User Comments (0)