Title: An Information Retrieval Approach based on Discourse Type
1An Information Retrieval Approach based on
Discourse Type
NLDB 2006
- D. Y. Wang, R. W. P. Luk, K.F. Wong1 and K.L.
Kwok2
Department of Computing The Hong Kong Polytechnic
University 1Department of Systems Engineering
and Engineering Management The Chinese University
of Hong Kong 2Department of Computer
Science City University of New York
2Content
- Introduction
- Motivation
- Discourse Type
- Information Unit
- Problem Formulation
- Score of topic terms
- Score of discourse type
- Document Re-ranking
- Experimental Results
- Conclusion
3Motivation
- The effectiveness of information retrieval (IR)
systems varies substantially from one topic to
another. - One reason Users Information need is very
diverse - Our approach finding the discourse type of the
topic and adopt appropriate strategy
4Discourse Type
- Definition of discourse type
- The functions (including properties and
relations that cannot exist independently) of the
independent entities
5Performance Difference
Average 0.2768
6Why Choose Advantage / Disadvantage as our
example?
- Its performance is worse than the average
- 0.204 v.s. 0.277
- It is relatively abstract and therefore it is
unlikely to be investigated before. - Compared with concrete things (e.g. people,
country) - It is related to some cue phrases (e.g., more
than) that are composed of stop words. - Conventional IR ignores stop words
7Why Choose Advantage / Disadvantage as example?
(cont.)
- It is a popular discourse type of information
need. - we found that there are at least 40 questions
that are asking about advantages and
disadvantages of something at a website
(http//www.answerbag.com). - It has a reasonable amount (i.e., eight) of TREC
topics for investigation - See next slide
8Eight Queries with discourse type Advantage /
Disadvantage
9Information Unit (IU)
w words
w words
t
A document
........................
term1........................ ..............
...............................................
...................................
term2................. ......
term1.............................................
.
10Why IU?
- Assumption terms inside an IU (around topic
terms) are more important to relevance of
document than the terms outside the IU - Simplify the processing of the documents
- Compute score for each IU
- Aggregate the scores of all IU as the score of
the document
11Score of Topic Terms
- sumtf 4
- Dtf 3
- (d distinct)
- Graph-based
- Model
- atS3
- 1/11/51/3
- atS4
- 1/51/3
1
5
3
12Example Score of Discourse Type
- more (comparative words)3
- support' back ',' confirm ',' contest ','
contrari ',' defend ',' encourag ',' endors ','
object ',' oppon ',' oppos ',' opposit ',' prove
',' quibbl ',' refer ',' sponsor ',' support ' (
from www.answers.com ) - support 2
13Documents Re-ranking
- IU score before re-ranking S0
- S0 similarity score of the document that
contains the IU - IU re-ranking score S
- S S0 score of topic terms
- S S0 score of discourse type
- S S0 score of topic term score of discourse
type - Aggregate the re-ranking score of all IUs in a
document as the final score of the document. - Re-rank the documents by the final score.
14Re-ranking Results in MAP
15Conclusion
- Re-ranking based on topic terms and discourse
type can both improve the retrieval performance. - Combining above two can improve the results most
significantly (at 95 confidence level, already
considering the sample size). - This approach is promising and is worth further
investigation.
Acknowledgement We thank the Center for
Intelligent Information Retrieval, University of
Massachusetts, for facilitating Robert Luk to
develop the basic IR system, when he was on leave
there. This work is supported by the CERG Project
PolyU 5226/05E.
16(No Transcript)