Title: Topics in Information RetrievalIR
1Topics in Information Retrieval(IR)
- Luo Weihua
- MITEL,ICT
- 2007-10-12
- In Reading Group
2Outline
- Background
- Evaluation principle measure
- IR models
- Query expansion
- Bibliography
3Outline
- Background
- Evaluation principle measure
- IR models
- Query expansion
- Bibliography
4Background
Query
IR system
Document collection
Retrieval
Answer list
5Background
- The Goal
- find documents relevant to an information need
from document repositories
6Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
7Background
- History
- 1945, The Memex Machine
- Vannevar Bush, As We May Think
- 1948, Information Retrieval
- C.N. Mooers from MIT
- 1960s-1970s, IR models and evaluation
- Cleverdon, Cranfield Experiments
- Salton from Cornell Univ., SMART system based on
VSM Model - Robertson from London City Univ., Sparck Jones
from Cambridge Univ., Probabilistic Model
8Background
- History(contd)
- 1980s, RBDMs
- 1986, ANSI SQL released
- 1990s, Search Engine
- 1990, McGill Univ., Archie (ftp search tool)
- 1992, Donna Harman from NIST, TREC
- 1994, CMU Univ., Lycos
- 1995, David Filo Jerry Yang from Stanford
Univ., Yahoo! - 1998, Larry Page Sergey Brin from Stanford
Univ., Google - 1998, Ponte Croft from Umass, IR model based on
language model
9Background
- History(contd)
- 2000 - , branches of IR
- 2001, Li Yanhong, Baidu Inc.
- TREC Q/A track
- TREC Video track
- CLEF NTCIR
10Outline
- Background
- Evaluation principle measure
- IR models
- Query expansion
- Bibliography
11Evaluationprinciple and measure
- What to be evaluated in IR?
- Effectiveness
- Precision
- Recall
- Precision of ranked list
- Efficiency
- Time complexity
- Space complexity
- Response time
- Coverage
- Frequency of data update
12Evaluationprinciple and measure
- Probability ranking principle(PRP)
- ranking documents in order of decreasing
probability of relevance is optimal - Fit for ad-hoc retrieval
- Assumption
- Documents are independent
- A complex information need can be broken up into
a number of queries - The probability of relevance is only estimated
13Evaluationprinciple and measure
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
retrieved relevant
Document Set
14Evaluationprinciple and measure
- Recall RR/(RRNR)
- Precision RR/(RRRN)
- F 1/(a/P (1-a)/R)
- Pooling for evaluation of large scale data
RR
RN
NR
Correct documents
Returned documents
15Evaluationprinciple and measure
16Evaluationprinciple and measure
- P_at_N
- Precision at a particular cutoff
- No recall
- Uninterpolated Average Precision
- Estimate precision for each recall point, and
compute the average value - E.g. in ranking list 3,
- AP (1/22/33/64/75/8) / 5
- 0.5726
17Evaluationprinciple and measure
- Interpolated average precision
- At the point which reaches recalla, compute
precisionb - If precision goes up, take the highest value of
precision anywhere beyond the point where a was
first reached
18- Recall() Interpolated Precision
- 0 2/3
- 10 2/3
- 20 2/3
- 30 2/3
- 40 2/3
- 50 5/8
- 60 5/8
- 70 5/8
- 80 5/8
- 90 5/8
- 5/8
- Int. AP 0.6460
Ranking List d6 ? d1 ? d2 ? d10 ? d9
? d3 ? d5 ? d4 ? d7 ? d8 ?
19Uninterpolated AP
Interpolated AP
20Evaluationprinciple and measure
- TREC(The Text REtrieval Conference)
- Established in 1992 to evaluate large-scale IR
- Retrieving documents from a gigabyte collection
- Has run continuously since then
- TREC 2007(16th) meeting is in November
- Run by NISTs Information Access Division
- Probably most well known IR evaluation setting
- Started with 25 participating organizations in
1992 evaluation - Proceedings available on-line (http//trec.nist.go
v)
21Outline
- Background
- Evaluation principle measure
- IR models
- Query expansion
- Bibliography
22IR models
- Set Theoretic models
- Boolean model
- Rough set based model
- Extended boolean model
- Algebraic models
- Vector space model
- Latent semantic Indexing model
- Probabilistic models
- Logistic regression model
- Binary independence relevance model
- Statistical language model based model
23IR models
- Boolean model
- Representation of query and documents
- boolean expression
- w1,w2,,wn words in document D
- D w1 AND w2 AND AND wn
- Relevance estimation
- R(D,Q) 1 if boolean expression of query match
one of documents (D ?Q) - R(D,Q) 0 otherwise
24IR models
Doc1 Beijing will take measures to protect
environment in 2008.
match
2008 AND Beijing AND NOT Olympic
Doc2 The 29th Olympic games will be held in
Beijing in fall of 2008.
Not match
25IR models
- Boolean model
- Pros
- Simple and straightforward
- Fit for special situation
- Cons
- Unordered list
- Exact match will return empty or huge result sets
- Difficult for users to construct queries
26IR models
- Vector space model
- Representation of query and documents
- Document
- D lt a1, a2, a3, , angt
- ai weight of ti in D
- Query
- Q lt b1, b2, b3, , bngt
- bi weight of ti in Q
- Term
- Character, word, phrase, n-gram, etc
- Dimension reduction stop word list, stemming,
word clustering, etc
27IR models
- Vector space model
- Relevance estimation
- Euclidean
- Cosine
- Dice
- Jaccard
28Ranking list d2 d1 d3
29IR models
- Vector space model
- Term weighting
- Goal
- determine the most representative terms (words)
for a document (query) - Weight their importance
- General idea
- More frequent term is more salient
- But also need to measure specificity
(discriminative power)
30IR models
- Vector space model
- Term weighting
- Quantity
- Comment
- dfi lt cfi
-
31IR models
- Vector space model
- Term weighting (IDF schemes)
- tf
- f(tf)
- f(tf) 1log(tf)
- df
- f(df) 1log(N/df)
- Tfidf
- f(tf,df) if tfgt0
- 0 if tf 0
32IR models
- Vector space model
- Pros
- Conceptual simplicity spatial proximity for
semantic proximity - partial retrieval and fuzzy retrieval
- good performance
- Cons
- Term independence assumption is not true
- Incapable of polysemy (Java as language vs Java
as island), synonymy(computer and pc) , etc
33IR models
- Term distribution model
- Estimate distribution of a word
- Characterize the importance for ir of a word
- Capture regularities of word occurrence in
subunits of a corpus - Distinguish content words from non-content words
34IR models
- Term distribution model
- Integration into IR
- Replacement for IDF weights
- Potential of accessing a terms properties more
accurately - Better estimation of query-document similarity
35IR models
- Term distribution model
- Poisson distribution
- 2-poisson distribution
- Katzs K mixture
- Residual inverse document frequency
36IR models
- Poisson distribution
-
- ?i cfi/N
- Pi(k) p(k ?i )
37IR models
- Poisson distribution
- Assumption
- The probability of one occurrence of the term in
a piece of text is proportional to the length of
text - The probability of more than one occurrence of a
term in a short piece of text is negligible
compared to the probability of one occurrence - Occurrence events in non-overlapping intervals of
text are independent
38IR models
39IR models
- Problems with Poisson model
- Good for non-content words
- Assumption of independence is not true for
content words (burstiness) - Documents are not a uniform unit for differing in
size
40IR models
- 2-Poisson model
-
- 2 classes of documents associated with a term
- Non-privileged class
- Privileged class
41IR models
- 2-Poisson model
- Better fit to the frequency distribution of
content words - Ridiculous prediction
- Pi(2) lt Pi(3) or Pi(4)
- In reality, Pi(0) gt Pi(1) gt Pi(2) gt Pi(3)
42IR models
- Katzs K mixture
-
- ?k,0 1 if k0 ?k,0 0 otherwise
-
43IR models
44IR models
- Good approximation for non-content words
- Derivation pi(k)/Pi(k1) c if k gt1
- not hold perfectly for content words
45IR models
- Residual Inverse document frequency
-
- IDF log2(N/df)
- A good predictor of the degree to which a word is
a content word
46IR models
- Latent semantic indexing model(LSI)
- Motivation
- Word co-occurrence implies semantic similarity
- LSI projects queries and documents into a space
with latent semantic dimensions - Fewer dimensions ?dimension reduction
- Keep k strongest dimensions remove noise
47IR models
- Latent semantic indexing model
- Create a new representation space (by SVD)
- Linear combination of the original term and
document dimensions - Remove the least weighted dimensions (noise)
48IR models
- Singular Value Decomposition(SVD)
- Least squares methods
- Given (x1,y1), (x2,y2), , (xn,yn)
- Compute F(?) m? b
- SS(m,b)
-
-
49(No Transcript)
50IR models
- Singular Value Decomposition(SVD)
- Given document-by-term matrix A
- Compute latent semantic matrix
-
51Singular Value Decomposition
v1T v2T .. vNT
u1 u2 uM
A
VT
S
U
- A USVT
- UUTI VVTI
- S singular values
52Truncated-SVD
- Ak Uk Sk VTk
- best approximation of A
53IR models
54IR models
55IR models
56IR models
- Latent semantic indexing model
- Pros and Cons
- A clean formal framework
- Computation cost (SVD)
- Setting of k critical (empirically 300-1000)
- Effective on small collections (CACM,), but
variable on large collections (TREC)
57IR models
- Binary independence relevance model
- For the same query, P(QR,D) P(R1D)
- R relevance
- D document
- Q query
-
-
58IR models
- Binary independence relevance model
- Ranking function
-
-
59IR models
- Binary independence relevance model
- Ranking function
- Assume D x1, x2,
60IR models
- Binary independence relevance model
- Query China economic rapid
- Deconomic surprising
61IR models
- Binary independence relevance model
- How to estimate pi and qi ?
- A set of N relevant and irrelevant samples
62IR models
- Binary independence relevance model
- Smoothing (Robertson-Sparck-Jones formula)
- When no sample is available
- pi0.5,
- qi(ni0.5)/(N0.5)?ni/N
63OKAPI
- Binary independence relevance model
- frequency, length, heuristics
- k1, k2, k3, b parameters
- tfi, qtfi doc./query term frequency
- dl document length
- avdl average document length
Doc. length normalization
TF factors
64IR models
- Binary independence relevance model
- Pros cons
- Theoretically well founded
- Difficulty to implement without simplification
- Effectiveness often depends on heuristics (e.g.
Okapi) - Term independence
65IR models
- Statistical language model based model
- Motivation
- create a statistical model so that one can
calculate the probability of s w1, w2,, wn in
a language
66IR models
- Statistical language model based model
- For d w1w2wn
- Relevance
67IR models
- Statistical language model based model
- General approach
s
Training data
Probabilities of the observed elements
P(s)
68IR models
- Statistical language model based model
- Maximum likelihood estimation
69IR models
- Statistical language model based model
- Smoothing
- If a query word does not appear in a document,
P(QMD)0 - General form
- ?D normalization coefficient
-
70Some smoothing methods used in IR
- Jelinek-Mercer Interpolation
- Dirichlet Prior
- Absolute discouting
71IR models
- Statistical language model based model
- Final ranking score
72IR models
- Statistical language model based model
- Pros Cons
- Theoretical formalization
- Term independency is unnecessary
- Data sparseness
- Parameters estimation
73Outline
- Background
- Evaluation principle measure
- IR models
- Query expansion
- Bibliography
74Query expansion
- Motivation
- A User is clear of his information need,but
cannot construct a good query for it - A user is not clear of his information need, and
wishes to refine it by initial queries
75Query expansion
- Relevance feedback(RF)
- User RF
- Users judge documents in returned lists manually
- Pseudo RF
- Select top N documents in a returned list as
answers
76Query expansion
- Query reformulation
- Thesaurus
- Wordnet
- hownet
- Co-occurrences
- Relevance feedback
77Query expansion
- Query reformulation for VSM
- Rocchio formula
78Query expansion
- Query expansion for language model
- Bi-grams
- Bi-term
- Do not consider word order in bi-grams
- (analysis, data) (data, analysis)
- Term dependency
- Determine statistically the strongest
dependencies - parallel computer architecture
79Bibliography
- Christopher. Manning. Foundations of Statistical
Natural Language Processing. 1999 - Jian-Yun Nie. IR Models and Some Recent Trends.
2006 - Wang Bin. Teaching materials of Modern
Information Retrieval. 2007 - Ricardo Baeza-Yates. Modern Information
Retrieval. 1999
80Thanks!