FieldWeighted Xml Retrieval Based on BM25 - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

FieldWeighted Xml Retrieval Based on BM25

Description:

Revised Okapi's index structure to combine with the path indexing system ... It's a linear-combination of field-weighted tf method rather than combination of ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 16
Provided by: Hai85
Category:

less

Transcript and Presenter's Notes

Title: FieldWeighted Xml Retrieval Based on BM25


1
Field-Weighted Xml Retrieval Based on BM25
  • W. Lu
  • (Reed)
  • Center for Studies of
  • Information Resources
  • Wuhan University, China
  • sa713_at_soi.city.ac.uk

S. E. Robertson Microsoft Research
Cambridge ser_at_microsoft.com
A. Macfarlane Centre for Interactive System
Research City University London andym_at_soi.city.ac
.uk
2
Outline
  • Basic work for INEX 2005
  • Our approach
  • Field-weighted model BM25F
  • Element-weighted model BM25E
  • Experiments
  • Results
  • Future work

3
Basic work for INEX(first year for us)
  • Deveoped a path indexing system
  • Revised Okapis index structure to combine with
    the path indexing system
  • Developed a query parser and bm25E based
    retrieval and output interfaces.

4
Our approach
  • BM25F proposed by Robertson 11
  • Its a linear-combination of field-weighted tf
    method rather than combination of field weight
    score

5
Our approach
  • BM25F
  • This is a field-weighted version BM25
  • The difference lies in that
  • tfj is the weighted tf
  • dl is the weighted document length
  • avdl is the weighted average dl
    across the collection
  • K1 is the weighted free parameter.
  • K1 K1 avdl/avdl

6
Our approach
  • BM25F

Suppose we have nF fields f 1, . . . , nF. In a
given document d, term t has frequency tfd, t ,f
in field f. Then using the number of indexed
terms (tokens), the length of the field in this
document is
where V is the vocabulary, i.e. all indexed terms.
7
Our approach
  • BM25F

With no field weighting, the term frequency of t
in the whole document is
and the document length is
Average document length is
8
Our approach
  • BM25F

With field weights Wf,, these are modified as
follows
9
Our approach
  • BM25E(Applied bm25f to element retrieval)

Where
denotes the weighted term frequency of jth term
t in element e
is the weighted element length
is the weighted average element length across
the collection.
is the weighted free parameter.
10
Our approach
  • BM25E(Applied bm25f to element retrieval)
  • Our basic view is that an element is to be
    treated like a document, except that it may
    inherit information from other elements(atl, abs,
    st) in the document.
  • The key is to tune the parameter Wf for each
    selected field(elements) which contribute to
    specified elements.

11
Our Experiments
  • Assumption 1 elements in one document do not
    have effect on elements in other documents.
    Elements except atl, abs and st also don't have
    effect on other elements which are not their
    ancestors in the same document.
  • Assumption 2 Elements atl and abs contribute to
    the weight of elements bdy, bm and their child
    elements. Elements st contributes to the weight
    of the section it belongs to, and also of the
    sections child elements and article element. All
    st elements have the same Wf without considering
    the level they belong to.
  • Assumption 3 Due to the complexity to compute
    parameters avel and K1, we believe the values
    of the article level can be used instead of them
    for all elements.

12
Our Experiments
Experiment Procedure (1) Select atl, abs
and st as the tuned fields (2) Use INEX
04s data sets, co topics(40) and relevance
assessments to tune the wf at document level for
atl, abs and st. We get the peak value at 2356,
4, 22 for wf(atl, abs, st ) . (Metrics Average
precision) (3) We select 6 groups of tuned
wf values for INEX 05 retrieval and submission
2356, 4, 22 , 1000, 4, 22 and 15, 4, 8
for CO.Thorough runs 1000, 4, 22,
300, 4, 18 and 98, 4, 13 for CO.FetchBrowse
runs Note only article, abs, bdy, bm, bib,
section el. and para. el. are treated as
retrievable elements.
13
Results and evaluation
(1) Our runs for Co.thorough does well
especially for nxCG(25, 50) or
ep/rg, Quantization strict, Overlapoff
But for Quantization generalized, our runs
does normally (2) runs using wf
2356, 4, 22 , 1000, 4, 22 do better than 15,
4, 8 for CO.Thorough runs. (3) Results
show our method is worth to be exploited.
Also shows tuning selected elements atl, abs
and st is really beneficial
14
Future work
(1) Tune wf at element level but not only
at document level (2) Try to
investigate the parameters such as avel and K1
at element level. (3) Upgrade our system to
make sure more runs to be submitted and more
tasks to be involved in next year.
15
Thanks !
Write a Comment
User Comments (0)
About PowerShow.com