High Accuracy Retrieval from Documents HARD TRACK in TREC2004 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

High Accuracy Retrieval from Documents HARD TRACK in TREC2004

Description:

Attended by International researchers from academic, commercial, and government ... Files are organized by source on a daily basis. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 21
Provided by: nyu7
Category:

less

Transcript and Presenter's Notes

Title: High Accuracy Retrieval from Documents HARD TRACK in TREC2004


1
High Accuracy Retrieval from Documents (HARD)
TRACK in TREC2004
  • SLIS research forum 2004 Ning Yu

2
Background
  • What is Text REtrieval Conference
  • Annual Information Retrieval conference
  • Attended by International researchers from
    academic,
    commercial, and government institutions (93
    groups from 22 countries in 2003)
  • Each TREC task is called a Track. Tracks varies
    almost every year. (Track Timeline)
  • What is HARD track?
  • It stands for High Accuracy Retrieval from
    Document
  • It starts from 2003 as an evaluation track
  • Goal to achieve high accuracy retrieval from
    documents by leveraging additional information
    about the searcher and/or the search context,
    through techniques such as passage retrieval, and
    using very targeted interaction with the
    searcher

3
HARD Track in 2004 Corpus
  • News from 2003
  • 650k docs (1.6 Gb)
  • 8 sources
  • Data format
  • Files are organized by source on a daily basis.
    Each file contains multiple documents identified
    by unique document IDs.
  • In addition, each document has some or all of the
    following components
  • - Keyword (optional), surrounded by
    ltKEYWORDgt tags - Date/time (optional),
    surrounded by ltDATE_TIMEgt tags - Headline,
    surrounded by ltHEADLINEgt tags - Main part,
    surrounded by ltTEXTgt tags. ltPgt tags are
  • used within this part to identify
    paragraph boundaries.

4
HARD Track in 2004 Topic
  • Follows the basic TREC style
  • 50 topics
  • Contains metadata that describes the searcher and
    the context of the query
  • Typical topic example
  • lttopicsgt
  • lttopicgt
  • ltnumbergtHard-nnnlt/numbergt
  • lttitlegtShort, few words description of
    topiclt/titlegt
  • ltdescriptiongtSentence-length description
    of
  • topic.lt/descriptiongt
  • lttopic-narrativegtParagraph-length
    description of topic. No
  • mention of restrictions captured in the
    metadata should
  • occur in this section. This is intended
    primarily to help
  • future relevance assessors. No specific
    format is
  • required.lt/topic-narrativegt
  • lt/topicgt

5
HARD Track in 2004 -Metadata
  • HARD topics are distinguished from the basic TREC
    style by being annotated with metadata
  • Five metadata items this year
  • Familiarity little, much
  • Genre news-report, opinion-editorial, other,
    any
  • Geography US, non-US
  • Subject sports, science, economics, etc
    (distribution chart)
  • Related text on-topic, relevant text
  • Three levels of relevance
  • Off-topic
  • On-topic does not satisfy some metadata
  • Relevant satisfies both the topic and the
    metadata

6
HARD Track in 2004 Clarification Form
  • Clarification Form Allows participants to get
    additional information from the searcher
  • Maximum time the assessor can spend on the form
    per topic is 3 minutes
  • Strict rules for the form design
  • Participants can submit up to 2 clarification
    forms per topic

7
HARD Track in 2004 Results Submission
Evaluation
  • Results submission
  • 1. Baseline Run
  • 2. Clarification Form
  • 3. Final Run
  • Evaluation
  • 1. SOFT-DOC is the most generous and
    assumes that ON-TOPIC documents are considered.
  • 2. HARD-DOC is the same, but only RELEVANT
    documents are considered. This measure tests
    whether sites are able to leverage the metadata
    information.
  • 3. SOFT-PSG is the passage-level version of
    SOFT-DOC.
  • 4. HARD-PSG is the most stringent
    evaluation, where only indicated passages (where
    appropriate) of RELEVANT documents are
    considered.

8
INDU in HARD04 overview
  • Participants
  • Chris Friend, Ning Yu, Kiduk Yang
  • Research Methods
  • 1. A fusion approach that combines
    different retrieval techniques as well as data
    resources to optimize the retrieval system.
  • 2. Design a web-dynamic tuning interface to
    leverage the tuning performance on retrieval
    system with multiple variables

9
INDU in HARD04 System Architecture
10
INDU in HARD04 Metadata Strategy
  • Geography
  • -- Create US and non-US location lexicon by
    query Yahoo! and other web resources and make
    judgment on the duplications. (e.g. Paris,
    Vancouver)
  • -- Search the geography cue in the first line
    of news or keywords field.
  • Genre
  • -- Opinion-editorial are identified by high
    proportion of quoted string (single double)
  • -- No explicit cue for news and other
    though.
  • Familiarity
  • -- Create a rare word dictionary lexicon
  • -- Score docs by (rare words/total words)
  • Subject
  • -- Create subject lexicon for each subject
    value by querying Yahoo category and WordNet
    Hyponyms ( is a kind of subject)
  • -- Find cue in the keyword field in the
    documents.
  • All the above metadata will be considered in the
    post-retrieval re-ranking.
  • Related text

11
INDU in HARD04 Query Expansion
  • We believe that acronym, noun and noun phrase in
    the title of each topic are more descriptive than
    other words, so we expand the query by repeat
    those term once.
  • Use synsets and definition to expand the rare
    word(Cryptozoology) or new word (e.g. Weblog).
    But this approach brings lots of noise. We did
    not use it to expand the query directly. Instead,
    we presented them to the user to let them choose
    the proper terms for us.
  • Pseudo Relevant Feedback really hurt the
    retrieval performance and we have to figure out
    the reason later.

12
INDU in HARD04 Dynamic Tuning Interface
  • Dynamic Tuning Interface is an web-based
    interface that facilitate the retrieval system
    tuning by providing a easy and visible way to
    identify the best variable combination.
  • How does it look like?
  • We believe that this is a unique feature and
    could be a contribution for tuning retrieval
    system with multiple variables.

13
INDU in HARD04 Passage Retrieval
  • Each sentence (which is really a paragraph
    because it is a line of text ending in lt/Pgt) is
    scored by adding together the products of each
    term's frequency and term weight for each term
    that occurs in the topic.
  • The sentence (paragraph) with the highest score
    is chosen for use in passage retrieval AND for
    use in the clarification form for that topic.
  • Okapi term weight is applied

14
INDU in HARD04 Results Submission
  • Baseline Submission
  • -- wdvqlz1 VSM weight, long query, acronym
    noun, combo stemmer
  • -- wdvqlz VSM weight, long query, acronym
    noun, simple stemmer
  • -- wdoqlzOkapi weight, long query, acronym
    noun, simple stemmer
  • -- wdoqlz1Okapi wight, long query, acronym
    noun, combo stemmer
  • Clarification Form
  • -- INDU1 synsets top sentences
  • -- INDU2 noun phrase from related text top
    sentences
  • Final Submission
  • -- wdvqlzcf1 CF query expansion
  • -- wdvqlzp Passage retrieval

15
Future Work
  • Improve the dynamic tuning interface (e.g.add
    history function)
  • How to properly use stop and stemming? which to
    choose (plural, simple or combo)? Where to stem
    (not acronym, not proper name)? When to stem
    (pre-stop -gt stem -gt field specific post-stop)?
  • Implement the phrase match in post-retrieval
    stage. (match top relevant docs with the acronym,
    proper phrase, noun phrase)
  • Find out the answer for some weird performance
  • --VSM beats Okapi weight which normally
    works much
  • better
  • --Different stemmer applied on query and
    document
  • indexing turns out to performs better than
    keep the
  • stemmer consistence
  • -- Pseudo feedback really hurt the results in
    HARD
  • case while not the ROBUST case
  • -- etc.

16
HARD resources
  • HARD04 Guideline
  • HARD04 Overview
  • Task-Specific Query Expansion (MultiText
    Experiments for TREC 2003)
  • Rutgers' HARD and Web Interactive Track
    Experiments at TREC 2003
  • TREC 2003 Robust, HARD and QA Track Experiments
    using PIRCS

17
Question Time
  • 5 minutes

18
Thank You.
19
TREC TIMELINE
20
Subject Distribution
back
Write a Comment
User Comments (0)
About PowerShow.com