Topics in Information RetrievalIR - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

Topics in Information RetrievalIR

Description:

Salton from Cornell Univ., SMART system based on VSM Model ... 2001, Li Yanhong, Baidu Inc. TREC Q/A track. TREC Video track. CLEF & NTCIR. 10. Outline ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 81
Provided by: alma75
Category:

less

Transcript and Presenter's Notes

Title: Topics in Information RetrievalIR


1
Topics in Information Retrieval(IR)
  • Luo Weihua
  • MITEL,ICT
  • 2007-10-12
  • In Reading Group

2
Outline
  • Background
  • Evaluation principle measure
  • IR models
  • Query expansion
  • Bibliography

3
Outline
  • Background
  • Evaluation principle measure
  • IR models
  • Query expansion
  • Bibliography

4
Background
Query
IR system
Document collection
Retrieval
Answer list
5
Background
  • The Goal
  • find documents relevant to an information need
    from document repositories

6
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
7
Background
  • History
  • 1945, The Memex Machine
  • Vannevar Bush, As We May Think
  • 1948, Information Retrieval
  • C.N. Mooers from MIT
  • 1960s-1970s, IR models and evaluation
  • Cleverdon, Cranfield Experiments
  • Salton from Cornell Univ., SMART system based on
    VSM Model
  • Robertson from London City Univ., Sparck Jones
    from Cambridge Univ., Probabilistic Model

8
Background
  • History(contd)
  • 1980s, RBDMs
  • 1986, ANSI SQL released
  • 1990s, Search Engine
  • 1990, McGill Univ., Archie (ftp search tool)
  • 1992, Donna Harman from NIST, TREC
  • 1994, CMU Univ., Lycos
  • 1995, David Filo Jerry Yang from Stanford
    Univ., Yahoo!
  • 1998, Larry Page Sergey Brin from Stanford
    Univ., Google
  • 1998, Ponte Croft from Umass, IR model based on
    language model

9
Background
  • History(contd)
  • 2000 - , branches of IR
  • 2001, Li Yanhong, Baidu Inc.
  • TREC Q/A track
  • TREC Video track
  • CLEF NTCIR

10
Outline
  • Background
  • Evaluation principle measure
  • IR models
  • Query expansion
  • Bibliography

11
Evaluationprinciple and measure
  • What to be evaluated in IR?
  • Effectiveness
  • Precision
  • Recall
  • Precision of ranked list
  • Efficiency
  • Time complexity
  • Space complexity
  • Response time
  • Coverage
  • Frequency of data update

12
Evaluationprinciple and measure
  • Probability ranking principle(PRP)
  • ranking documents in order of decreasing
    probability of relevance is optimal
  • Fit for ad-hoc retrieval
  • Assumption
  • Documents are independent
  • A complex information need can be broken up into
    a number of queries
  • The probability of relevance is only estimated

13
Evaluationprinciple and measure
not retrieved not relevant
not retrieved
not relevant
retrieved
relevant
NN
retrieved not relevant
not retrieved relevant
NR
RN
RR
retrieved relevant
Document Set
14
Evaluationprinciple and measure
  • Recall RR/(RRNR)
  • Precision RR/(RRRN)
  • F 1/(a/P (1-a)/R)
  • Pooling for evaluation of large scale data

RR
RN
NR
Correct documents
Returned documents
15
Evaluationprinciple and measure
16
Evaluationprinciple and measure
  • P_at_N
  • Precision at a particular cutoff
  • No recall
  • Uninterpolated Average Precision
  • Estimate precision for each recall point, and
    compute the average value
  • E.g. in ranking list 3,
  • AP (1/22/33/64/75/8) / 5
  • 0.5726

17
Evaluationprinciple and measure
  • Interpolated average precision
  • At the point which reaches recalla, compute
    precisionb
  • If precision goes up, take the highest value of
    precision anywhere beyond the point where a was
    first reached

18
  • Recall() Interpolated Precision
  • 0 2/3
  • 10 2/3
  • 20 2/3
  • 30 2/3
  • 40 2/3
  • 50 5/8
  • 60 5/8
  • 70 5/8
  • 80 5/8
  • 90 5/8
  • 5/8
  • Int. AP 0.6460

Ranking List d6 ? d1 ? d2 ? d10 ? d9
? d3 ? d5 ? d4 ? d7 ? d8 ?
19
Uninterpolated AP
Interpolated AP
20
Evaluationprinciple and measure
  • TREC(The Text REtrieval Conference)
  • Established in 1992 to evaluate large-scale IR
  • Retrieving documents from a gigabyte collection
  • Has run continuously since then
  • TREC 2007(16th) meeting is in November
  • Run by NISTs Information Access Division
  • Probably most well known IR evaluation setting
  • Started with 25 participating organizations in
    1992 evaluation
  • Proceedings available on-line (http//trec.nist.go
    v)

21
Outline
  • Background
  • Evaluation principle measure
  • IR models
  • Query expansion
  • Bibliography

22
IR models
  • Set Theoretic models
  • Boolean model
  • Rough set based model
  • Extended boolean model
  • Algebraic models
  • Vector space model
  • Latent semantic Indexing model
  • Probabilistic models
  • Logistic regression model
  • Binary independence relevance model
  • Statistical language model based model

23
IR models
  • Boolean model
  • Representation of query and documents
  • boolean expression
  • w1,w2,,wn words in document D
  • D w1 AND w2 AND AND wn
  • Relevance estimation
  • R(D,Q) 1 if boolean expression of query match
    one of documents (D ?Q)
  • R(D,Q) 0 otherwise

24
IR models
Doc1 Beijing will take measures to protect
environment in 2008.
match
2008 AND Beijing AND NOT Olympic
Doc2 The 29th Olympic games will be held in
Beijing in fall of 2008.
Not match
25
IR models
  • Boolean model
  • Pros
  • Simple and straightforward
  • Fit for special situation
  • Cons
  • Unordered list
  • Exact match will return empty or huge result sets
  • Difficult for users to construct queries

26
IR models
  • Vector space model
  • Representation of query and documents
  • Document
  • D lt a1, a2, a3, , angt
  • ai weight of ti in D
  • Query
  • Q lt b1, b2, b3, , bngt
  • bi weight of ti in Q
  • Term
  • Character, word, phrase, n-gram, etc
  • Dimension reduction stop word list, stemming,
    word clustering, etc

27
IR models
  • Vector space model
  • Relevance estimation
  • Euclidean
  • Cosine
  • Dice
  • Jaccard

28
Ranking list d2 d1 d3
29
IR models
  • Vector space model
  • Term weighting
  • Goal
  • determine the most representative terms (words)
    for a document (query)
  • Weight their importance
  • General idea
  • More frequent term is more salient
  • But also need to measure specificity
    (discriminative power)

30
IR models
  • Vector space model
  • Term weighting
  • Quantity
  • Comment
  • dfi lt cfi

31
IR models
  • Vector space model
  • Term weighting (IDF schemes)
  • tf
  • f(tf)
  • f(tf) 1log(tf)
  • df
  • f(df) 1log(N/df)
  • Tfidf
  • f(tf,df) if tfgt0
  • 0 if tf 0

32
IR models
  • Vector space model
  • Pros
  • Conceptual simplicity spatial proximity for
    semantic proximity
  • partial retrieval and fuzzy retrieval
  • good performance
  • Cons
  • Term independence assumption is not true
  • Incapable of polysemy (Java as language vs Java
    as island), synonymy(computer and pc) , etc

33
IR models
  • Term distribution model
  • Estimate distribution of a word
  • Characterize the importance for ir of a word
  • Capture regularities of word occurrence in
    subunits of a corpus
  • Distinguish content words from non-content words

34
IR models
  • Term distribution model
  • Integration into IR
  • Replacement for IDF weights
  • Potential of accessing a terms properties more
    accurately
  • Better estimation of query-document similarity

35
IR models
  • Term distribution model
  • Poisson distribution
  • 2-poisson distribution
  • Katzs K mixture
  • Residual inverse document frequency

36
IR models
  • Poisson distribution
  • ?i cfi/N
  • Pi(k) p(k ?i )

37
IR models
  • Poisson distribution
  • Assumption
  • The probability of one occurrence of the term in
    a piece of text is proportional to the length of
    text
  • The probability of more than one occurrence of a
    term in a short piece of text is negligible
    compared to the probability of one occurrence
  • Occurrence events in non-overlapping intervals of
    text are independent

38
IR models
39
IR models
  • Problems with Poisson model
  • Good for non-content words
  • Assumption of independence is not true for
    content words (burstiness)
  • Documents are not a uniform unit for differing in
    size

40
IR models
  • 2-Poisson model
  • 2 classes of documents associated with a term
  • Non-privileged class
  • Privileged class

41
IR models
  • 2-Poisson model
  • Better fit to the frequency distribution of
    content words
  • Ridiculous prediction
  • Pi(2) lt Pi(3) or Pi(4)
  • In reality, Pi(0) gt Pi(1) gt Pi(2) gt Pi(3)

42
IR models
  • Katzs K mixture
  • ?k,0 1 if k0 ?k,0 0 otherwise

43
IR models
44
IR models
  • Good approximation for non-content words
  • Derivation pi(k)/Pi(k1) c if k gt1
  • not hold perfectly for content words

45
IR models
  • Residual Inverse document frequency
  • IDF log2(N/df)
  • A good predictor of the degree to which a word is
    a content word

46
IR models
  • Latent semantic indexing model(LSI)
  • Motivation
  • Word co-occurrence implies semantic similarity
  • LSI projects queries and documents into a space
    with latent semantic dimensions
  • Fewer dimensions ?dimension reduction
  • Keep k strongest dimensions remove noise

47
IR models
  • Latent semantic indexing model
  • Create a new representation space (by SVD)
  • Linear combination of the original term and
    document dimensions
  • Remove the least weighted dimensions (noise)

48
IR models
  • Singular Value Decomposition(SVD)
  • Least squares methods
  • Given (x1,y1), (x2,y2), , (xn,yn)
  • Compute F(?) m? b
  • SS(m,b)

49
(No Transcript)
50
IR models
  • Singular Value Decomposition(SVD)
  • Given document-by-term matrix A
  • Compute latent semantic matrix

51
Singular Value Decomposition
v1T v2T .. vNT
u1 u2 uM
A
VT
S
U
  • A USVT
  • UUTI VVTI
  • S singular values

52
Truncated-SVD
  • Ak Uk Sk VTk
  • best approximation of A

53
IR models
54
IR models
55
IR models
56
IR models
  • Latent semantic indexing model
  • Pros and Cons
  • A clean formal framework
  • Computation cost (SVD)
  • Setting of k critical (empirically 300-1000)
  • Effective on small collections (CACM,), but
    variable on large collections (TREC)

57
IR models
  • Binary independence relevance model
  • For the same query, P(QR,D) P(R1D)
  • R relevance
  • D document
  • Q query

58
IR models
  • Binary independence relevance model
  • Ranking function

59
IR models
  • Binary independence relevance model
  • Ranking function
  • Assume D x1, x2,

60
IR models
  • Binary independence relevance model
  • Query China economic rapid
  • Deconomic surprising

61
IR models
  • Binary independence relevance model
  • How to estimate pi and qi ?
  • A set of N relevant and irrelevant samples

62
IR models
  • Binary independence relevance model
  • Smoothing (Robertson-Sparck-Jones formula)
  • When no sample is available
  • pi0.5,
  • qi(ni0.5)/(N0.5)?ni/N

63
OKAPI
  • Binary independence relevance model
  • frequency, length, heuristics
  • k1, k2, k3, b parameters
  • tfi, qtfi doc./query term frequency
  • dl document length
  • avdl average document length

Doc. length normalization
TF factors
64
IR models
  • Binary independence relevance model
  • Pros cons
  • Theoretically well founded
  • Difficulty to implement without simplification
  • Effectiveness often depends on heuristics (e.g.
    Okapi)
  • Term independence

65
IR models
  • Statistical language model based model
  • Motivation
  • create a statistical model so that one can
    calculate the probability of s w1, w2,, wn in
    a language

66
IR models
  • Statistical language model based model
  • For d w1w2wn
  • Relevance

67
IR models
  • Statistical language model based model
  • General approach

s
Training data
Probabilities of the observed elements
P(s)
68
IR models
  • Statistical language model based model
  • Maximum likelihood estimation

69
IR models
  • Statistical language model based model
  • Smoothing
  • If a query word does not appear in a document,
    P(QMD)0
  • General form
  • ?D normalization coefficient

70
Some smoothing methods used in IR
  • Jelinek-Mercer Interpolation
  • Dirichlet Prior
  • Absolute discouting

71
IR models
  • Statistical language model based model
  • Final ranking score

72
IR models
  • Statistical language model based model
  • Pros Cons
  • Theoretical formalization
  • Term independency is unnecessary
  • Data sparseness
  • Parameters estimation

73
Outline
  • Background
  • Evaluation principle measure
  • IR models
  • Query expansion
  • Bibliography

74
Query expansion
  • Motivation
  • A User is clear of his information need,but
    cannot construct a good query for it
  • A user is not clear of his information need, and
    wishes to refine it by initial queries

75
Query expansion
  • Relevance feedback(RF)
  • User RF
  • Users judge documents in returned lists manually
  • Pseudo RF
  • Select top N documents in a returned list as
    answers

76
Query expansion
  • Query reformulation
  • Thesaurus
  • Wordnet
  • hownet
  • Co-occurrences
  • Relevance feedback

77
Query expansion
  • Query reformulation for VSM
  • Rocchio formula

78
Query expansion
  • Query expansion for language model
  • Bi-grams
  • Bi-term
  • Do not consider word order in bi-grams
  • (analysis, data) (data, analysis)
  • Term dependency
  • Determine statistically the strongest
    dependencies
  • parallel computer architecture

79
Bibliography
  • Christopher. Manning. Foundations of Statistical
    Natural Language Processing. 1999
  • Jian-Yun Nie. IR Models and Some Recent Trends.
    2006
  • Wang Bin. Teaching materials of Modern
    Information Retrieval. 2007
  • Ricardo Baeza-Yates. Modern Information
    Retrieval. 1999

80
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com