Title: Exploring Redundancy in Question Answering
1Exploring Redundancy in Question Answering
Charles L. A. Clarke,Gordon V. Cormack, Thomas
R. Lynam
Group 4 Kuljit Singh Lesley Ponneri Phebu
George Pradeep Nayar
2Question Answering
- Find concise answers for short questions
-
- Exploiting Redundancy in Question Answering
- Â
-
Example Question Who is the President of the
USA? Â Answer George Bush
3What is QA
- Typical QA task requires concise answers to
short - questions
- Large target corpus used as the source for these
- answers
- Answers take the form of values
- names, phrases, sentences or brief text
- fragments
Unlike IR systems where, a full length document,
here a paragraph or a text fragment is retrieved
4Architecture
Query
Question
Parsing
Passage Retrieval
Corpus
Passage
Answer Selection
Selection Rules
Answers
5Components
- Parser
- Parser analyzes the question to generate
two types of information - 1) A Query for submitting to the passage
retrieval component - 2) Set of selection rules to extract answers
like - Person, monetary Value, Date
Passage Retrieval This component executes
the query over the target corpus, retrieving a
ranked list of top k passages for further
analysis by the answer selection component
6Components
Answer Selection Identifies
possible answers from the passage and then ranks
them on a variety of heuristics The
heuristics takes into account 1) Number of
times a candidate has occurred in the
component 2) Its location 3) Other special cases
information provided by selection rules
7Passage Retrieval
Each document in the corpus is treated as an
ordered sequence of terms D d1,d2,d3..dm
A Query is generated from the questions and takes
the form of Q q1,q2,q3.
Term set T q2,q3
A passage from D is represented as an
extent(u,v), an ordered pair of coordinates with
1ltultvltm
An extent(u,v ) satisfies a term T ? Q if the
subsequence of D defined by the extent contains
at least one occurrence of each of the terms T
8Example
Suppose we have a passage Microsoft's new Web
services software will allow developers to create
secure applications more easily and screen out
the kind of unauthorized commands that are
commonly used by malicious hackers. Query run is
Web services software where q1,q2,q3 are Web,
services and software Let term set T be services
software The extent above (dotted )is a cover
which satisfies T as the subsequence of D defined
by the extent contains at least one occurrence of
each of the terms in T contains no subsequence
that also satisfies T
9Passage Retrieval
An extent(u,v) is a cover for T if (u,v)
satisfies T and the subsequence corresponding to
(u,v) contains no subsequence that also satisfies
T
We finally end with the equation that the
probability that an extent (u,v) contains all
the terms from T is ? log (N/f t ) - T
log (l) ....eqn 1 t? T
ft is the total number of times t appears in the
target Corpus
N is the total length of all the documents
l is the length of the extent
10Passage Retrieval
The above eqn 1 assigns a higher score to a
passage whose probability of occurrence is lower
For a given Query Q ,generate all covers for all
subsets using the above equation. All but the
highest ranking passage from each document are
discarded and the top k are used for further
analysis
Implementation of this technique depends on a
fast algorithm that computes all covers
11Answer Selection
- The goal of TREC9 QA experiments was to select
answers fragments of length 50 to 250 bytes
Candidates are single terms depending on the
category of the question
If a question asked for a proper noun The
candidate consists of those terms that match a
simple syntactic pattern of a proper noun
- If the question asks for length The candidates
consists of those numeric values that precede
appropriate units
- If not classified the candidate consists of all
the non-query, non-stop word terms appearing in
the retrieved passage
12Answer Selection
The term is assigned a weight W ct
log(N/ft)
Of passages in which the term appeared
Relative Frequency of term in Database
ct represents the Redundancy
The weights of the candidates are used to select
answer fragments
13Answer Selection
Score of each answer fragment is the summation
of the weights of the candidates in the fragment
- Other heuristics used
- a) Rank of the passage in which the fragment
appears - b) Location of fragment relative to the
center point
Once the highest scoring fragments are selected,
the weight of the candidates in that fragment is
reduced to zero
Fragments are re-scored, and highest scoring
fragment selected
- This process is repeated until the 5 fragments
are selected
14Answer Selection
W ct log(N/ft)
The weights assigned are dependent on Redundancy
Factor and term Frequency Factor
The individual contribution can be ascertained by
setting one to a constant
15Exploring Redundancy
- A single category of questions, that requires
the name of a person as the answer was explored
- Redundancy was used to isolate the required
name from the - top ranking passages
- The TREC 100GB VLC2 corpus was used which was
of lower - quality, implying that answers to all
questions were not - present
- A simple syntactic pattern was used to identify
candidate - answers
16Exploring Redundancy
- For each query the top k passages were
retrieved using - Equation 1
- Each passage was expanded symmetrically about
its center - point to w bytes
- For the experiment the parameters k and w, the
depth and - width respectively were varied
- Candidate answers were identified in the
passages and - assigned a score which is the count of the
number of distinct - passages in which the candidate appeared (ct )
- Ties were broken by applying a rule that takes
into account the - distance of each candidate from the center
point of the - passages
17Results
- Experiments with TREC-8 questions suggested
that a depth of - k 50 and width of w 1000 would produce
reasonable - results
- Using these parameters, 49 (56) of the 87
questions are - answered correctly and for 34 (39), a correct
answer is - ranked first
- Question runs for a range of depth and width
values are listed
18Table
19Opinion
- We rate the authors work at 7 on a scale from 1
to 10
Reasons
- Has a clear explanation of a QA System
- The experiments were conducted on TREC 9
corpus
- The impact of their experiments on lower
quality corpuses - was justified where there is no guarantee
that answers - exist
20Conclusion
Redundancy thus can be used as a method for
answer validation in a Question Answering
Systems