A Formal Study of Information Retrieval Heuristics - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

A Formal Study of Information Retrieval Heuristics

Description:

Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula ... queries, the performance of Okapi is comparable to the other ... – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 31
Provided by: slpCsie
Category:

less

Transcript and Presenter's Notes

Title: A Formal Study of Information Retrieval Heuristics


1
A Formal Study of Information Retrieval Heuristics
  • Hui Fang , Tao Tao , ChengXiang Zhai
  • University of Illinois at Urbana Champaign Urbana
  • SIGIR 2004

Presented by CHU Huei-Ming 2004/01/17
2
Outline
  • Formal Definitions of Heuristic Retrieval
    Constraints
  • Analysis of Three Representative Retrieval
    Formulas
  • Pivoted Normalization Method
  • Okapi Method
  • Dirichlet Prior Method
  • Experiments
  • Conclusion and Future Work

3
Formal Definitions of Heuristic Retrieval
Constraints
  • Six intuitive and desirable constraints
  • Any reasonable retrieval formula should satisfy
  • Term Frequency Constraints (TFCs)
  • Term Discrimination Constraints (TDC)
  • Length Normalization Constraints (LNCs)
  • TF-Length Constraints (TF-LNC)

4
Formal Definitions of Heuristic Retrieval
Constraints
  • Term Frequency Constraints (TFCs)
  • TFC1
  • qw , Assume d1d2. If
    c(w,d1) gt c(w,d2), then f(d1,q) gt f(d2,q)
  • TFC2
  • qw , Assume d1d2d3 , c(w,d1)gt0,
  • If c(w,d2) - c(w,d1) 1 ,
    c(w,d3) - c(w,d2) 1
  • then f(d2,q) - f(d1,q) gt
    f(d3,q) - f(d2,q)

5
Formal Definitions of Heuristic Retrieval
Constraints
  • Term Discrimination Constraints (TDC)
  • Let q be a query , and w1,w2 q be two query term
  • Assume d1d2 , c(w1,d1)
    c(w2,d1) c(w1,d2) c(w2,d2)
  • If idf(w1) idf(w2) and c(w1,d1) gt c(w2,d2) ,
    then f(d1,q) f(d2,q)

6
Formal Definitions of Heuristic Retrieval
Constraints
  • Length Normalization Constraints (LNCs)
  • LNC1
  • Let q be a query , d1 and d2 are two documents
  • If some word w q , c(w,d2) c(w,d1) 1
    but for any query term w, c(w,d2)
    c(w,d1)then f(d1,q) f(d2,q)
  • LNC2
  • Let q be a query ,? k gt1 , d1 and d2 are two
    documents
  • If d1 k d2 and for all terms w , c(w,
    d1) k c(w, d2),
  • then f(d1, q) f(d2, q).

7
Formal Definitions of Heuristic Retrieval
Constraints
  • TF-Length Constraints (TF-LNC)
  • qw, d1 and d2 are two documents
  • If c(w,d1) gt c(w,d2) and d1d2 c(w,d1) -
    c(w,d2)
  • then f(d1,q) gt f(d2,q)

8
Formal Definitions of Heuristic Retrieval
Constraints
9
Analysis of Three Representative Retrieval
Formulas
  • Pivoted Normalization Method
  • Okapi Method
  • Dirichlet Prior Method

10
Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
  • Retrieval function
  • Analyzing

11
Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
  • Check TF-LNC constraint when d1avdl , it is
    equivalent to the
  • TF-LNC is satisfied only if s is blow a certain
    upper bound

12
Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
  • Check the LNC2 constraint

13
Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
  • Consider common case when d2avdl
  • Performance can be bad for a large s

14
Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
  • Check TDC constraint
  • It is equivalent to c(w2,d1) c(w1,d2) this is
    conditional satisfied

15
Analysis of Three Representative Retrieval
Formulas Okapi Method
  • Retrieval function
  • k1 (between 1.02.0 ) b (usually 0.75) and k3
    (between 0 1000)

16
Analysis of Three Representative Retrieval
Formulas Okapi Method
  • Analysis
  • When df(w)gt N/2 , the IDF part in the formula
    will be a negative value
  • When the IDF part is positive (mostly true for
    keyword query)
  • TFC and LNCs are satisfied
  • TF-LNC constraint considering a common case
    when d2avdl the constraint is
    equivalent to b avdl / c(w, d2)
  • TDC is equivalent to c(w2,d1) c(w1,d2) same
    as the formula above

17
Analysis of Three Representative Retrieval
Formulas Okapi Method
  • Modify Okapi Method
  • Solve the problem of negative IDF
  • Replace the original IDF in Okapi with the
    regular IDF in the pivoted normalization formula
  • The performance is better on the verbose queries
  • Analysis result

18
Analysis of Three Representative Retrieval
Formulas Dirichlet Prior Method
  • Retrieval function
  • Use Dirichlet prior smoothing method to smooth a
    document language model
  • Rank the documents according to the likelihood
    of the query according to the estimated language
    model of each document

19
Analysis of Three Representative Retrieval
Formulas Dirichlet Prior Method
  • Analysis
  • LNC2 constraint is equivalent to c(w ,d2) d2
    p(wC)
  • Which is usually satisfied for content-carrying
    words
  • TDC constraint led to some lower bound for
    parameter

20
Analysis of Three Representative Retrieval
Formulas Dirichlet Prior Method
  • Analysis
  • TDC consider a common case of w2 ,
    p(w2C)1/avdl
  • Means for discriminative words with a high term
    frequency in a document , needs to be
    sufficiently large
  • In order to balance the TF and IDF appropriately

21
ExperimentsSetup
  • Document set
  • AP news article , DOE technical report, FR
    government documents,
  • ADF combination of AP, DOE, FR
  • Web web data used in the TREC8
  • Trec7 ad hoc data used in the TREC7
  • Trec8 ad hoc data used in the TREC8

22
ExperimentsSetup
  • Query combination
  • Short-keyword (SK, keyword title)
  • Shot-verbose (SV, one sentence description)
  • Long-keyword (LK, keyword list)
  • Long-verbose (LV, multiple sentences)
  • Preprocessing
  • Only stemming with the Porters stemmer
  • No stop words have been removed

23
ExperimentsParameter Sensitivity
  • Pivoted normalization method
  • The analysis of LNC2 constraint for the pivoted
    normalization methods suggests the s should be
    smaller than 0.4

24
ExperimentsParameter Sensitivity
  • Okapi method k1 1.2, k3 1000, b changes from
    0.1 to 1.0

25
ExperimentsParameter Sensitivity
  • Dirichlet prior method

26
ExperimentsParameter Sensitivity
  • Dirichlet prior method

27
ExperimentsPerformance Comparison
28
ExperimentsPerformance Comparison
  • For any query type, the performance of Dirichlet
    prior method is comparable to pivoted
    normalization method
  • For keyword queries, the performance of Okapi is
    comparable to the other two retrieval formulas
  • For verbose queries, the performance of Okapi may
    be worse than others due to the possible negative
    IDF part in the formula

29
ExperimentsPerformance Comparison
  • Average precision comparison

30
Conclusion and Future Work
  • Define six basic constraints that any reasonable
    retrieval function should satisfy
  • When the constraints is not satisfied, it often
    indicates non-optimality of the method
Write a Comment
User Comments (0)
About PowerShow.com