High Accuracy Retrieval from Documents HARD TRACK in TREC2004 - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

High Accuracy Retrieval from Documents HARD TRACK in TREC2004

Description:

Number of Views:22

Avg rating:3.0/5.0

Slides: 21

Provided by: nyu7

Category:

more less

Transcript and Presenter's Notes

Title: High Accuracy Retrieval from Documents HARD TRACK in TREC2004

1
High Accuracy Retrieval from Documents (HARD)
TRACK in TREC2004

2
Background

What is Text REtrieval Conference
Annual Information Retrieval conference
Attended by International researchers from
academic,
commercial, and government institutions (93
groups from 22 countries in 2003)
Each TREC task is called a Track. Tracks varies
almost every year. (Track Timeline)
What is HARD track?
It stands for High Accuracy Retrieval from
Document
It starts from 2003 as an evaluation track
Goal to achieve high accuracy retrieval from
documents by leveraging additional information
about the searcher and/or the search context,
through techniques such as passage retrieval, and
using very targeted interaction with the
searcher

3
HARD Track in 2004 Corpus

News from 2003
650k docs (1.6 Gb)
8 sources
Data format
Files are organized by source on a daily basis.
Each file contains multiple documents identified
by unique document IDs.
In addition, each document has some or all of the
following components
- Keyword (optional), surrounded by
ltKEYWORDgt tags - Date/time (optional),
surrounded by ltDATE_TIMEgt tags - Headline,
surrounded by ltHEADLINEgt tags - Main part,
surrounded by ltTEXTgt tags. ltPgt tags are
used within this part to identify
paragraph boundaries.

4
HARD Track in 2004 Topic

5
HARD Track in 2004 -Metadata

HARD topics are distinguished from the basic TREC
style by being annotated with metadata
Five metadata items this year
Familiarity little, much
Genre news-report, opinion-editorial, other,
any
Geography US, non-US
Subject sports, science, economics, etc
(distribution chart)
Related text on-topic, relevant text
Three levels of relevance
Off-topic
On-topic does not satisfy some metadata
Relevant satisfies both the topic and the
metadata

6
HARD Track in 2004 Clarification Form

Clarification Form Allows participants to get
additional information from the searcher
Maximum time the assessor can spend on the form
per topic is 3 minutes
Strict rules for the form design
Participants can submit up to 2 clarification
forms per topic

7
HARD Track in 2004 Results Submission
Evaluation

Results submission
1. Baseline Run
2. Clarification Form
3. Final Run
Evaluation
1. SOFT-DOC is the most generous and
assumes that ON-TOPIC documents are considered.
2. HARD-DOC is the same, but only RELEVANT
documents are considered. This measure tests
whether sites are able to leverage the metadata
information.
3. SOFT-PSG is the passage-level version of
SOFT-DOC.
4. HARD-PSG is the most stringent
evaluation, where only indicated passages (where
appropriate) of RELEVANT documents are
considered.

8
INDU in HARD04 overview

Participants
Chris Friend, Ning Yu, Kiduk Yang
Research Methods
1. A fusion approach that combines
different retrieval techniques as well as data
resources to optimize the retrieval system.
2. Design a web-dynamic tuning interface to
leverage the tuning performance on retrieval
system with multiple variables

9
INDU in HARD04 System Architecture
10
INDU in HARD04 Metadata Strategy

Geography
-- Create US and non-US location lexicon by
query Yahoo! and other web resources and make
judgment on the duplications. (e.g. Paris,
Vancouver)
-- Search the geography cue in the first line
of news or keywords field.
Genre
-- Opinion-editorial are identified by high
proportion of quoted string (single double)
-- No explicit cue for news and other
though.
Familiarity
-- Create a rare word dictionary lexicon
-- Score docs by (rare words/total words)
Subject
-- Create subject lexicon for each subject
value by querying Yahoo category and WordNet
Hyponyms ( is a kind of subject)
-- Find cue in the keyword field in the
documents.
All the above metadata will be considered in the
post-retrieval re-ranking.
Related text

11
INDU in HARD04 Query Expansion

We believe that acronym, noun and noun phrase in
the title of each topic are more descriptive than
other words, so we expand the query by repeat
those term once.
Use synsets and definition to expand the rare
word(Cryptozoology) or new word (e.g. Weblog).
But this approach brings lots of noise. We did
not use it to expand the query directly. Instead,
we presented them to the user to let them choose
the proper terms for us.
Pseudo Relevant Feedback really hurt the
retrieval performance and we have to figure out
the reason later.

12
INDU in HARD04 Dynamic Tuning Interface

Dynamic Tuning Interface is an web-based
interface that facilitate the retrieval system
tuning by providing a easy and visible way to
identify the best variable combination.
How does it look like?
We believe that this is a unique feature and
could be a contribution for tuning retrieval
system with multiple variables.

13
INDU in HARD04 Passage Retrieval

Each sentence (which is really a paragraph
because it is a line of text ending in lt/Pgt) is
scored by adding together the products of each
term's frequency and term weight for each term
that occurs in the topic.
The sentence (paragraph) with the highest score
is chosen for use in passage retrieval AND for
use in the clarification form for that topic.
Okapi term weight is applied

14
INDU in HARD04 Results Submission

15
Future Work

Improve the dynamic tuning interface (e.g.add
history function)
How to properly use stop and stemming? which to
choose (plural, simple or combo)? Where to stem
(not acronym, not proper name)? When to stem
(pre-stop -gt stem -gt field specific post-stop)?
Implement the phrase match in post-retrieval
stage. (match top relevant docs with the acronym,
proper phrase, noun phrase)
Find out the answer for some weird performance
--VSM beats Okapi weight which normally
works much
better
--Different stemmer applied on query and
document
indexing turns out to performs better than
keep the
stemmer consistence
-- Pseudo feedback really hurt the results in
HARD
case while not the ROBUST case
-- etc.