JAVELIN Project Briefing

About This Presentation

Title:

JAVELIN Project Briefing

Description:

How much did the Japan Bank for International Cooperation decide to loan to the ... languages (languages of the data collections where answers may be found) so that ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 97

Provided by: ehn1

Category:

more less

Transcript and Presenter's Notes

Title: JAVELIN Project Briefing

1
JAVELIN Project Briefing

February 16, 2007
Language Technologies InstituteCarnegie Mellon
University

2
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
3

MLQA Architecture
How much did the Japan Bank for International
Cooperation decide to loan to the Taiwan
High-Speed Corporation?
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
4
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
5
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Answer Type MONEY Keyword _____________
Chinese IX
Japanese IX
6
DocID JY-20010705J1TYMCC1300010, Confidence
44.01 DocID JY-20011116J1TYMCB1300010,
Confidence 42.95
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
7
MLQA Architecture
Answer Candidate Confidence 0.0718 Passage
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
8
MLQA Architecture
Cluster and Re-rank answer candidates.
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
9
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
10
Question Analyzer

Primary Subtasks
Question Classification
Key Term Identification
Semantic Analysis

11
Question Classification

Hybrid Approach
machine learning rule-based (same features)
Features
Lexical
unigrams, bigrams
Syntactic
focus adjective, main verb, wh-word, determiner
status of wh-word
Semantic
focus word type

12
Question Classification Focus words

Examples
Which town hosted the 2002 Winter Olympics?
How long is the Golden Gate Bridge?
Determining the semantic type of focus nouns
Look up in WordNet
town gt town-8135936
Use a manually-created mapping
town-8135936 gt CITY
city-metropolis-urban_center-8005407 gt CITY

13
Question Classification Algorithms

Machine Learning
Hierarchical classifier
E-C MAX_ENT, MAX_ENT
E-J MAX_ENT, ADABOOST_OVA
Rule-based
Example
MONEY lt WH_WORDhow_much,
FOCUS_ADJexpensive,
FOCUS_TYPEmoney
Hybrid Approach
Try both ML Rule-based
If Rule-based classification succeeded, use it
Else use ML-based classification

14
Key Term Identification

Sources of evidence
Syntactic category (POS) NN, JJ, VB, CD
Common phrases in dictionary
Named entity tags
Quoted text
Unification procedure based on priority of
evidence source

15
Semantic Analysis

Semantic Role Labeling
ASSERT v0.1
Back-up KANTOO (for be, have, etc.)
more on SRL later
Semantic Predicate Structures
Produced from SRL annotations and key terms
Focus argument is identified
Semantic Predicate Expansion
Using small, manually-created ontology
Relations is-a, inverse, implies, reflexive

16
Plans for Future Development

Question Classification
Replace manually-created knowledge sources
heuristics with learners
Re-architecture to place learned components as
supporting agents to rule-based control
Semantic Role Labeling
Nominalizations?
Predicate Expansion
Learn the expansion ontology automatically from
labeled corpora

17
Retrieval Strategist for NTCIR 6

Sentence and block retrieval
Blocks are overlapping windows, each containing
three sentences
Annotated Corpora
Chinese
Sentence and block boundaries
Named Entity Types www, phone, cardinal, time,
percent, person, quoted, money, booktitle, date,
ordinal, email, location, duration, organization,
measure
Japanese (CLQA and QAC)
Sentence segmentation, blocks
Named Entity Types time, date, optional,
location, demo, organization, artifact, made,
misc, money, any, person, numex
Named Entity Subtypes misc, people, cardinal,
age, weight, length, speed, information, area
Question focus terms person_bio, reason, method,
definition
Japanese case markers mo, totomoni, ka, no, tte,
nado, wo, ni, ga, toshite, e, wa, yori, nitsuite,
dake, kara, to, ya
Propbank-style semantic roles target, arg0-4,
argx

18
Query Formulation for NTCIR 6

Retrieve, rank and score blocks
One weighted synonym inner clause for each
keyterm, containing alternate forms, weighted by
confidence

weightblock( weight1 wsyn( 1.0 term1 0.85
alt1a 0.60 alt1b ) weight2 wsyn( 1.0 term2
0.75 alt2a ) )
19
Translation Module Overview
20
Outline

What TM does
Then (TM at NTCIR-5)
Now (TM at NTCIR-6)
Going Forward

21
Translation Module

Responsible for all translation-related tasks
within Javelin
Currently, TMs main task is to translate
keyterms (given by the Question Analyzer) from
the source language (language of the user input
question) to the target languages (languages of
the data collections where answers may be found)
so that the answer can be located and extracted
based on the translated keyterms

22
Then (NTCIR-5)

Goal Produce a high-quality translation for each
keyterm based on question context
View A translation problem
Evaluation Based on gold-standard translation
How
Use multiple translation resources
Dictionaries
MT systems
Web-mining techniques
Use web co-occurrence statistics to select the
best combination of translated keyterms for a
given question

23
Then (NTCIR-5)
TM
Translation Gathering
Translation Selection
Source Language Keyterms
Target Language Keyterms
World Wide Web
Dictionaries
MT Systems
Web Mining
24
Then (NTCIR-5)

Problems
A correct translation may not be useful in
document retrieval and answer extraction
Bill Clinton, William J. Clinton, President
Clinton (Alternate Forms)
Took over, invaded, attacked, occupied
(Near-synonyms)
Which one is correct? Which one(s) are good for
retrieval and extraction in a QA system?
Needs better translation of named entities
Accessing the web for gathering statistics could
be slow

25
Now (NTCIR-6)

Goal For each keyterm produce a SET of
translations useful for retrieval and extraction
View A Cross-Lingual Information Retrieval
(CLIR) problem
Evaluation No direct evaluation, but based on
retrieval results
How
Use multiple translation resources
Dictionaries
MT systems
Better web-mining techniques
Wikipedia
Named entity lists
More of everything
Use multiple translation candidates
Better for retrieval and extraction recall
But need to minimized noise but retain recall
Rank translation candidates
Use simpler web statistics

26
Now (NTCIR-6)
TM
Target Language Keyterms
Target Language Keyterms
Translation Gathering
Translation Alternatives Scoring
Source Language Keyterms
Target Language Keyterms
Named Entity Lists
Dictionaries
World Wide Web
Web Mining
MT Systems
Wikipedia
27
Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
28
Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
29
Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
30
Going Forward

Improving Translation Coverage
Improve web-mining translators
Improve keyterm extraction
May need alternate forms of the source language
keyterm
May need to segment/transform extracted keyterms
Japanese translation is poor
Named entities not translated properly
Keyterm segmentation problems
Improving CLIR
Use corpus statistics for ranking translation
candidates
Use established data sets for CLIR experiments
(TREC, NTCIR)

31
Chinese Answer Extraction Module

Outline
Review answer extraction in NTCIR5 through an
example
Explain the new techniques we developed for
NTCIR6

32
NTCIR5 Chinese Answer Extractor Module (we will
use an English example as our running example,
the techniques in this module is
language-independent)
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
33
NTCIR5 Chinese Answer Extractor Module
1. Identify Named-entities
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Location
Percent
34
NTCIR5 Chinese Answer Extractor Module
2. Identify expected answer type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Location
Percent
35
NTCIR5 Chinese Answer Extractor Module
3. Extract answer candidate that has a
named-entity type matches the expected answer type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
Location
36
NTCIR5 Chinese Answer Extractor Module
Score the answer candidate based on surface
distance to key terms. Select the answer
candidate closest to all key terms.
5 word tokens apart
Wisconsin
28
percent
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
Location
37
NTCIR6 Chinese Answer Extractor Module
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
A
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
38
NTCIR6 Chinese Answer Extractor Module
Find the best alignment of key terms using
max-flow dynamic programming algorithm
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
A
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
39
NTCIR6 Chinese Answer Extractor Module
Using max-flow algorithm, we take into account
partial matching of terms and synonym expansion,
by assigning different scores to these types of
matching
Q
percent
make
0.8
0.9
A
28
produce
Partial Matching
Synonym Expansion
40
NTCIR6 Chinese Answer Extractor Module
Identify answer type and select answer candidates
that have matching NE type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
41
NTCIR6 Chinese Answer Extractor Module
Produce dependency parse trees
whn
head
pcomp-n
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
subj
gen
prep
det
root
mod
i
gen
mod
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
root
pcomp-n
obj
pcomp-n
42
NTCIR6 Chinese Answer Extractor Module
Extract relation triples among the matching terms
whn
head
pcomp-n
Q
percent
of
the
nations
cheese
Wisconsin
produce
subj
gen
prep
root
mod
i
gen
mod
T
Wisconsin
produce
28
percent
of
the
nations
cheese
root
pcomp-n
obj
pcomp-n
43
NTCIR6 Chinese Answer Extractor Module
Extract relation triples cont.
Q
prep
subj
percent
of
Wisconsin
produce
pcomp-n
whn, head
cheese
of
percent
produce
mod
mod,i
T
28 percent
of
Wisconsin
produce
pcomp-n
obj
cheese
of
28 percent
produce
44
NTCIR6 Chinese Answer Extractor Module
Combine multiple sources of information using a
maximum-entropy model.
Q
What
percent
of
28
percent
of
1. atype-NE matching
T
Percent
Answer Type PERCENT
2. Dependency path matching
mod
prep
T
28 percent
of
Q
percent
of
pcomp-n
pcomp-n
cheese
of
cheese
of
3. Term alignment score
4. Sentence term occurrence
5. Passage term occurrence
6. etc
45
Future work for Chinese IX

Currently building a more powerful and expressive
model to learn the syntax and semantic
transformation from question to answer.
Plug in more external resources such as
paraphrase database and semantic resources
(gazetteers, WordNet, thesaurus) into the new
model.

46
Japanese Answer Extraction
47
NTCIR6 CLQA EJ/JJ task Answer Extraction

Given retrieved documents, we want to pick Named
Entities that belong to the expected answer type
or other relevant type.
Named Entity tagging
Used used CaboCha for 9 NEs.
Patterns based NE tagger is also used for NUMEX,
DATE, TIME, PERCENT classes and for more
fine-grained NEs (ORGANIZATION.UNIVERSITY)
NE family assumption
Families
LOCATION, PERSON, ORGANIZATION, ARTIFACT
NUMEX, PERCENT
DATE, TIME
If the answer type is LOCATION, pick other
members in the family into answer candidate pool
too, because NE tagger may have mistakenly tagged
LOCATION as PERSON
MaxEnt Learner learns different weights for
LOCATION-LOCATION and LOCATION-PERSON
Then, we want to estimate the probability of each
Named Entity being an answer.
Used Maximum Etropy model where we can
incorpolate easily customizable features
We can model both proximity (used in JAVELIN IIs
LIGHT IX ) and patterns (used in JAVELIN IIs FST
IX)

48
NTCIR6 CLQA EJ/JJ task Answer Extraction

Numeric features
Q denotes question sentence, and A denotes answer
candidate sentence.
KEYTERM of key terms from Q found in A
ALIAS of aliases (obtained from Wikipedia and
Eijiro) from Q found in A
RELATED_TERM of related terms (obtained from
web mining) from Q found in A
KEYTERM_DIST Closest sentence level distance of
a key term from Q.
ALIAS_DIST Closest sentence level distance of a
key term from Q.
RELATED_TERM_DIST Closest sentence level
distance of a key term from Q.
PREDICATE_ARGUMENT in what degree, predicate
argument structure from Q and A are similar
Binary features
ATYPE pairs of answer types in Q and A
KEYTERM_ATTACHMENT -NO, -NI, -WA, -GA, -MO, -WO,
-KANJI, -PAREN
If a certain word occurs directly after the key
term in A.

49
NTCIR6 QAC Overview

Japanese-to-Japanese complex (non-factoid) QA
task.
Answer unit is larger than factoid QA task
Phrases, sentences, multiple sentences,
summarized text
In reality, it was a pilot task
Kind of questions were not predefined
Small training data
Evaluation currently, human judgment only
Unknown N in number of top N answer candidate to
evaluate
Data
Corpus Mainichi news paper 1998-2001
Training 30 questions
Formal run 100 questions

50
NTCIR6 QAC Complex Questions

Example questions by expected answer (translated)
Relationship, difference
What is the difference between skeleton and luge?
Reason, cause
Why was it easy to predict the eruption of Mt.
Usu?
What is the background of the rise of Islamic
fundamentalism?
Definition, description, (person bio)
What is the NPO law?
What are the problems of aged Mir space station?
Effect, result
How does dioxin affect to human body?
Method, process
Degree
Opinion
What did Ryoko Tamura comment after winning the
gold medal?

51
NTCIR6 QAC Our Approach

Question Analysis
Keyterm extraction, Dictionary based keyterm
expansion, answer type analysis
Document Retrieval
Same as factoid QA. Block (3-sentence level)
retrieval with Indri
Answer Extraction
Machine learning (Maximum Entropy model) using
keyterm, answer types, patterns, as features
Answer Selection
Duplicate answer merging

52
NTCIR6 QAC Answer types

A-type categories we defined
How to make use of answer type?
As a feature in answer extraction phase

METHOD PROCESS REASON RESULT CONDITION
DEFINITION PERSON_BIO DEGREE
How do you choose. In what process, how
???????/???? Why..?, What is the reason of
????????????,???????????? In what
condition? What is ?, What is the advantage of
? Who is ? How much damage?
53
NTCIR6 QAC Keyterm expansion

Based observations, we found some vocabulary
mismatches between questions and answers
Created an synonym/aliase dictionary from
Wikipedia and Eijiro (English-to-Japanese
dictionary)
From Wikipedia
Use redirection information where aliases can be
extracted
E.g. Carnegie Mellon, CMU
From Eijiro
Assume target words are synonyms each other
Risk of treating financial bank and river
bank as synonym

54
NTCIR6 QAC Answer Extraction

Used Maximum Entropy model where we can
incorporate easily customizable features
One-sentence assumption
Finding answer boundaries is difficult, because
in non-factoid QA, it requires more
text-understanding
So, we assumed one sentence is an appropriate
span to start with
Then, answer extraction problem became more like
an answer selection problem
Top N answer candidates to return
As long as the score given to the answer
candidate is over the threshold

55
NTCIR6 QAC Answer Extraction Features

Numeric features (Q denotes question sentence,
and A denotes answer candidate sentence. )
KEYTERM of key terms from Q found in A
ALIAS of aliases (obtained from Wikipedia and
Eijiro) from Q found in A
RELATED_TERM of related terms (obtained from
web mining) from Q found in A
KEYTERM_DIST Closest sentence level distance of
a key term from Q.
ALIAS_DIST Closest sentence level distance of a
key term from Q.
RELATED_TERM_DIST Closest sentence level
distance of a key term from Q.
SENTENCE_LENGTH length of the A sentence in
normal distribution

56
NTCIR6 QAC Answer Extraction Features

Binary features
Q denotes question sentence, and A denotes answer
candidate sentence.
PATTERN_CUE If there is a hand-crafted cue in A
HAS_SUBJ If there is a subject in A
HAS_PRON If there is a pronoun in A
ATYPE Answer type analyzed from Q
LIST_QUE If list ques "(1)","???","?","?"
found in A
PAREN If "?","?" or "(",")" found in A
PARAGRAPH_HEAD If A is the beginning of a
paragraph.
KEYTERM_ATTACHMENT -NO, -NI, -WA, -GA, -MO, -WO,
-KANJI, -PAREN
If a certain word occurs directly after the key
term in A.
Observation -NO and -KANJI are strong feature

57
NTCIR6 QAC Human judgment result

Manual judgment was done by a person who is
outside of NTCIR for all 100 questions.
Top 4 answer candidates were evaluated
Answer candidates are labeled as
A the candidate contains the answer
B the candidate contains the answer but the main
topic of the candidate is not about the answer
C the candidate contains a part of the answer
D the candidate does not contain the answer
Judged result
A24, B30, C13, D310 out of 377 answer
candidates
Our Interpretation of the result
Precision is 18 ((ABC)/(ABCD)) for loose
evaluation.
For 42 of questions, we were able to return at
least one candidates with A,B,C label

58
Future plans

In answer type analysis, classify multiple binary
features of question, instead of picking out only
one A-type category.
Instead of introducing one-sentence assumption,
see the answer extraction as answer segmentation
problem
Automatic evaluation metrics from
text-segmentation task, such as COAP
(Co-Occurrence Agreement Probability), will be
available even if factoid and non-factoid
questions are mixed together. (cf. Basic Element
approach)

59
NTCIR6 CLQA EJ/JJ task Future Work

Use or develop more accurate NE tagger, trading
off with speed
True annotation
ltPERSONgt??lt/PERSONgt????????ltPERSONgt???lt/PERSONgt???
?????
Output from CaboCha
ltLOCATIONgt?lt/LOCATIONgt?????????ltPERSONgt??lt/PERSONgt
ltORGANIZATIONgt?lt/ORGANIZATIONgt????????
Output from Bar
ltPERSONgt??lt/PERSONgt????????ltPERSONgt??lt/PERSONgtltORG
ANIZATIONgt?lt/ORGANIZATIONgt????????
Try other classifier learner algorithms
SVM, Ada boost, Decision Tree, Voted Perceptron,
e.t.c.
Feature engineering and beyond
We put this and this features and got the best
accuracy.
So what?
Want to interpret the result by answering a
question
How much does the feature A contributed to
extract the answer?

60
Answer Generator

NTCIR5
Cluster similar or redundant answers
For a cluster containing K answers whose
extraction confidence scores are S1, S2, ..., SK,
the cluster confidence is computed as
NTCIR6
Apply an answer ranking model to estimate a
probability of an answer given multiple answer
relevance and similarity features

61
Answer Ranking Model

Two subtasks for answer ranking
Identify relevant answer candidates estimate
P(correct(Ai)Ai,Q)
Exploit answer redundancy estimate
P(correct(Ai)Ai,Aj)
Goal Estimate P(correct(Ai)Q,A1, An)
Use logistic regression to estimate answer
probability given the degree of answer relevance
and the amount of supporting evidence provided in
the set of answer candidates

62
Answer Ranking Model (2)
where, simk(Ai, Aj) a scoring function used
to calculate an answer similarity between Ai and
Aj relk(Ai) a feature function used to produce
an answer relevance score for an answer Ai K1,
K2 the number of feature functions for answer
validity and answer similarity scores,
respectively N the number of answer
candidates a0,ßk?k weights learned from training
data
63
Feature Representation

Answer Relevance Features
Knowledge-based Features
Data-driven Features
Answer Similarity Features
String distance metrics
Synonyms
Each feature produces an answer relevance or
answer similarity score

64
Knowledge-based Feature Gazetteers

Electronic gazetteers provide geographic
information
English
Use Tipster Gazetteer, CIA World Factbook,
Information about the US states
(www.50states.com)
Japanese
Extract Japanese location information from Yahoo
Use Gengo GoiTaikei location names
Chinese
Extract location names from the Web and HowNet
Translated names
Translate country names provided by the CIA World
Factbook and the Tipster gazetteers into Chinese
and Japanese names
Top 3 translations were used

65
Relevance Score (Gazetteers)
66
Knowledge-based Feature Ontologies

Ontologies such as WordNet contain information
about relationships between words and general
meaning types (synsets, semantic categories,
etc.)
English
WordNet WordNet 2.1 contains 155,327 words,
117,597 synsets and 207,016 word-sense pairs
Japanese
Gengo GoiTaikei contains 300,000 Japanese words
with their associated 3,000 semantic classes
Chinese
HowNet contains 65,000 Chinese concepts and
75,000 corresponding English equivalents

67
Relevance Score (WordNet)
68
Data-driven Feature Google

Use Google for English, Japanese and Chinese

For each answer candidate Ai
1. Initialize the Google score gs(Ai) 0
2. Create a query
3. Retrieve the top 10 snippets from
Google
4. For each snippet s
4.1. Initialize the co-occurrence score
cs(s) 1
4.2. For each keyterm translation k in s
4.2.1. Compute distance d, the minimum
number of words between k and the
answer candidate
4.2.2. Update the snippet co-occurrence
score
4.3 gs(Ai) gs(Ai) cs(s)

Question What is the prefectural capital
city whose name is written in hiragana?
Keyterms and their translation
- prefectural ???? (0.75)
?????? (0.25)
- capital city ?? (0.78)
????? (0.11)
?? (0.11)
- written ??(0.6)
- hiragana ??? (0.5)
?? (0.3)
???? (0.11)
???? (0.1)
Answer candidate ?????
Query
????? (???? OR
??????) (?? OR ?????
OR ??) (??) (? ?? OR
?? OR ???? OR ????)

69
Data-driven Feature Wikipedia

Use Wikipedia for English, Japanese and Chinese
Algorithm

70
Similarity Features

String Distance
Levenshtein, Cosine, Jaro and Jaro-Winkler
Synonyms
Binary similarity score for synonyms
English WordNet synonyms, Wikipedia redirection,
CIA World Factbook
Japanese Wikipedia redirection, EIJIRO
dictionary
Chinese Wikipedia redirection

1, if Ai is a synonym of Aj 0, otherwise
sim(Ai,Aj)
71
Answer Similarity using Canonical forms

Type specific conversion rules

72
AG Results (E-J)

73
Breakdown by Answer Type
E-J
E-C
74
Effects of Keyterm Translation on Answer Ranking
75
Future Work

Continue to analyze the effects of keyterm
translation on answer ranking
Improve Web validation
Query relaxation when there is no matched Wed
documents
e.g. Which city in Japan is the "Ramen Museum"
located?
Ramen Museum is translated into "ramen ??? and
there is no matched Web documents
Change the query to (ramen AND ???) or
incorporate English keyterm (Ramen Museum)
Extend our joint prediction model to CLQA
Apply a probabilistic graphical model to estimate
the joint probability of all answer candidates,
from which the probability of an answer can be
inferred

76
NTCIR Evaluation Results
77
Evaluation Metrics

Datasets
Monthly Evaluation
Current Perfomance
Performance History
Periodic Analysis
Error Analysis (on Development Set)
Plans for Future Development

78
Evaluation Metrics Report

End-to-end and modular evaluation
Evaluation of speed and accuracy
Summary of internal HTML evaluation reports
http//durazno.lti.cs.cmu.edu/wiki/moin.cgi/Javeli
n_Project/Multilingual/Evaluation

79
Plans for Future Development

Question Classification
Replace manually-created knowledge sources
heuristics with learners
Re-architecture to place learned components as
supporting agents to rule-based control
Semantic Role Labeling
Nominalizations
Semantic Predicate Expansion
Automatic ontology acquisition

80
Structured Retrieval for Question Answering
81
Standard Approach to Retrieval for QA
Output Answers
Input Question

Question Analysis
Determines what linguistic and semantic
constraints must hold for a document to contain a
valid answer
Formulates a bag-of-words query using question
keywords and named entity placeholder
representing expected answer type
Document Retrieval
Corpus indexed for keywords and named entity
annotations 13
Provides best-match documents containing keywords
and NE
Answer Extraction and Post Processing
Checks constraints and extracts NE answer

82
Issues with Standard Approach

Why is the standard approach sometimes
sub-optimal for QA?
May not scale to large collections
When question keywords are frequent and co-occur
frequently, many documents that do not answer the
question may be matched. eg. What country is
Berlin in?
Named entities can help narrow the search space
but still match sentences such as Berlin is near
Poland.
May be slow
If answer extraction or constraint checking is
not a cheap operation, current approach may
retrieve large numbers of irrelevant documents
that need to be checked.
May be ineffective for non-factoid (e.g.
relationship) questions
Can we reduce the number of documents we need to
process in order to find an answer more quickly?
more relevant documents more highly ranked

83
Alternative Structured Retrieval

Using higher-order information can distinguish
relevant vs. non-relevant results which look the
same to a bag-of-words retrieval model
Linguistic and semantic analyses are stored as
annotations and indexed as fields.
Constraint checking at retrieval time can improve
document ranking based on matching constraints,
thereby reducing post-processing burden.

84
The Role of Retrieval in QA
Output Answers
Input Question

Coarse, first-pass filter to narrow search space
for answers
Finding actual answers requires checking
linguistic and semantic constraints
Bag-of-words retrieval does not support such
constraint checking at retrieval time
May need to process large number of irrelevant
documents to find best answers.
Want to improve document ranking based on
constraints

85
Retrieval Approaches for QA
Output Answers
Input Question

System A
Query composed of question keywords and a named
entity placeholder
Bag-of-words retrieval
Constraint checking using ASSERT, answer
extraction

System B
Likely answer-bearing structures posited one
query per structure
Structured retrieval with Constraint checking
Answer extraction

86
Research Questions

How can we compare Systems A and B?
Experiment Answer-Bearing Sentence Retrieval
How does the effectiveness of the structured
approach compare to bag-of-words?
Does structured retrieval effectiveness vary with
question complexity?
Experiment The Effect of Annotation Quality
To what degree is the effectiveness of structured
retrieval dependent on the quality of the
annotations?

87
ExperimentAnswer-Bearing Sentence Retrieval

Hypothesis Structured Retrieval retrieves more
relevant documents more highly ranked compared to
bag-of-words retrieval
AQUAINT Corpus (LDC2002T31)
Sentence Segmentation by MXTerminator 14
Named Entity Recognition by BBN Identifinder 1
Semantic Role Labels by ASSERT 12
109 TREC 2002 Factoid Questions
Exhaustive document-level judgments over AQUAINT
2, 8
Training (55) and test (54) sets, with similar
answer type distribution
Answer-bearing sentences manually identified
Must completely contain the answer without
requiring inference or aggregation of information
across multiple sentences
Gold-standard question analysis/query formulation

88
Example Answer Bearing Sentence
Q1402 What year did Wilt Chamberlain score 100
points? A At the time of his 100-point game with
the Philadelphia Warriors in 1962, Chamberlain
was renting an apartment in New York.
TARGET
renting
89
Question-Structure Mapping
Q1402 What year did Wilt Chamberlain score 100
points?
TARGET
combinesentence( max( combinetarget(
max( combine./argm-tmp( 100 point
anydate ) ) max( combine./arg0(
max( combineperson( chamberlain
) ) ) ) ) ) )
ARGM-TMP
100 points
An answer-bearing structure
A structured query that retrieves instances of
this structure
90
Answer-Bearing Sentence Retrieval

Two experimental conditions
single structure one structured query
Only answer-bearing sentences matching a single
structure are considered relevant
every structure many queries, round robin
Any answer-bearing sentence considered relevant
Most QA systems somewhere in between, querying
for several, but not all, structures
Keyword Named Entity Baseline

91
Results Training Topics
12.8
96.9
Optimal smoothing parameters Jelinek-Mercer
19, with collection language model weighted 0.2
and document language model weighted 0.2
92
Results Test Topics
11.4
46.6
93
Results
Training Topics
Test Topics
96.9
46.6
Optimal smoothing parameters Jelinek-Mercer
19, with collection language model weighted 0.2
and document language model weighted 0.2
94
Structure Complexity

Results show that, on average, structured
retrieval has superior recall of answer-bearing
sentences.
For what types of queries is structured retrieval
most helpful?
Analyze recall at 200 for queries of different
levels of complexity.
Complexity of structure estimated by counting the
number of combine operators, not including the
outermost.

95
The more complex the structure sought, the more
useful knowledge of that structure is in ranking
answer-bearing sentences.
96
In the test set, there are fewer queries, total,
and fewer highly complex queries. This widens
confidence intervals, but there is still a range
where 95 confidence intervals do not overlap
much or at all.
97
The Effect of Annotation Quality

Penn Treebank WSJ 9 corpus
WSJ_GOLD Gold standard Propbank 6 annotations
WSJ_DEGRADED Semantic role labeling by ASSERT
(88.8 accurate)
All questions answerable over the corpus
Exhaustively generated sentence-level relevance
judgments
10,690 questions having more than one answer

98
Question and Judgment Generation

Each sentence that contains a Propbank annotation
can answer at least one question
Dow Jones publishes The Wall Street Journal,
Barrons magazine, other periodicals and
community newspapers.
What does Dow Jones publish?
Who publishes The Wall Street Journal, Barrons
...?
Does Dow Jones publish The Wall Street Journal,
Barrons ... ?

TARGET
publishes
99
What does Dow Jones publish?

The group of sentences that answer this question
WSJ_0427 Dow Jones publishes The Wall Street
Journal, Barrons magazine, other periodicals and
community newspapers and operates electronic
business information services.
WSJ_0152 Dow Jones publishes The Wall Street
Journal, Barrons magazine, and community
newspapers and operates financial news services
and computer data bases.
WSJ_1551 Dow Jones also publishes Barrons
magazine, other periodicals and community
newspapers and operates electronic business
informaiton services.

100
Judgments for WSJ_DEGRADED

Sentences relevant for WSJ_GOLD are not relevant
for WSJ_DEGRADED if ASSERT omits or mislabels an
argument
This models the reality of a QA system that can
not determine relevance if annotations are
missing or incorrect or if sentence can not be
analyzed
Constraint checking and answer extraction both
depend on the analysis

101
Annotation Quality Results
Structured retrieval is robust to degraded
annotation quality.
102
Structured Retrieval Recall

Structured retrieval ranks sentences that satisfy
the constraints highly.
Structured retrieval outperforms the bag-of-words
approach in terms of recall of relevant
sentences.
Structured retrieval performs best when query
structures anticipate answer-bearing structures,
and when these structures are complex.

103
Structured Retrieval Precision

For questions with keywords that frequently
co-locate in the corpus, structured retrieval
should offer a sizable precision advantage, eg.
What country is Berlin in?
Querying on Berlin alone matches over 6,000
documents in the AQUAINT collection, most of
which do not answer the question.
Questions such as this were intentionally
excluded during construction of the test
collection to ease the human assessment burden.

104
Structured Retrieval Efficiency

Structured queries are slower to evaluate, but
retrieve more relevant results more highly
ranked, compared to bag-of-words queries.
A QA system seeking to achieve a certain recall
threshold will have to process fewer documents
Processing fewer results can improve end-to-end
system runtime, even for systems in which answer
extraction cost is low.
The structured retrieval approach requires that
the corpus be pre-processed off-line.
Using the bag-of-words approach, a QA system is
free to run analysis tools on-the-fly, but this
could negatively impact the latency of an
interactive system.

105
Structured Retrieval Robustness

Although accuracy degrades when the annotation
quality degrades, the relative performance edge
that structured retrieval enjoys over
bag-of-words is maintained. (details in the
paper)

106
Exploring the Problem Space
Corpus-based view
Query-based view
Domain
Keyword distribution over Corpus
Newswire, WMD, Medical
Language
EN, JP, CH
Annotations
Annotation or Structure distribution over Corpus
NE, SRL, NomSRL, Syntax, Special purpose event
frames
There is a hypothesized sub-space in which
structured retrieval consistently outperforms.
We may be able to determine the boundaries
experimentally and then generalize.
107
Conclusions

Structured retrieval retrieves more relevant
documents, more highly ranked, compared to
bag-of-words retrieval.
The better ranking requires the QA system to
process fewer documents to achieve a certain
level of recall of answer-bearing sentences.
Although accuracy degrades when the annotation
quality degrades, the relative performance edge
that structured retrieval enjoys over
bag-of-words is maintained.
Details are in the paper (submitted to SIGIR)

108
Future Work

Question Analysis for Structured Retrieval
Map question structures into likely
answer-bearing structures
Mitigate computational burden of corpus
annotation
How to merge results from different structured
queries in the event that more than one structure
is considered relevant?

109
Experimental Plan
110
References

1 Bikel, Schwartz and Weischedel. An algorithm
that learns whats in a name. ML,
34(1-3)211-231, 1999.
2 Bilotti, Katz and Lin. What works better for
question answering stemming or morphological
query expansion? In Proc. of the IR4QA Workshop
at SIGIR04. 2004.
6 Kingsbury, Palmer and Marcus. Adding
semantic annotation to the penn treebank. In
Proc. of HLT02. 2002.
8 Lin and Katz. Building a reusable test
collection for question answering. JASIST
57(7)851-861. 2006.
9 Marcus, Marcinkiewicz and Santorini.
Building a large annotated corpus of english the
penn treebank. CL, 19(2)313-330. 1993.
12 Pradhan, Ward, Hacioglu, Martin and
Jurafsky. Shallow semantic parsing using support
vector machines. In Proc. of HLT/NAACL04. 2004.
13 Prager, Brown, Coden and Radev.
Question-answering by predictive annotation. In
Proc. of SIGIR00. 2000.
14 Reynar and Ratnaparkhi. A maximum entropy
approach to identifying sentence boundaries. In
Proc. of ANLP97. 1997.
19 Zhai and Lafferty. A study of smoothing
methods for language models applied to ad hoc
information retrieval. In Proc. of SIGIR01.
2001.

111
Semantic Role Labeling for JAVELIN

Introduction to Semantic Role Labeling
Extracting information such as WHO did WHAT to
WHOM, WHEN and HOW, from the sentence
Predicate describes the action/event, its
arguments give information about who, what, whom,
when etc
Useful for Information Extraction, Question
Answering and Summarization. e.g.
bromocriptine-induced activation of p38 MAP
kinase contains information used to answer
questions such as What activates p38 MAP
kinase? or What induces the activation of p38
MAP kinase?

112
Semantic Role Labeling for JAVELIN

English SRL ASSERT from U. Colorado
ASSERT performance (F-measure)
Hand corrected parses 89.4
Automatic parsing 79.4
Upgrade to ASSERT 0.14b
Current ASSERT is slow model is loaded once per
document
Explore using the remote/client service options
from new ASSERT

113
Semantic Role Labeling for JAVELIN

Example from ASSERT
I mean, the line was out the door, when I first
got there.
ARG0 I TARGET mean ARG1 the line was out
the door when I first got there
I mean the line was out ARGM-TMP the door
ARGM-TMP when ARG0 I ARGM-TMP first TARGET
got ARGM-LOC there
ASSERT misses the be and have verbs. KANTOOs
rule based system is used to handle these cases.
be and have occur frequently in questions.
PROPBANK doesnt have any examples of predicate
be in the training corpus
Plan for own future work on SRL for questions

114
Semantic Role Labeling for JAVELIN

SRL for Chinese
C-ASSERT Chinese ASSERT
82.02 F-score
Chinese extension of English ASSERT
Example
????????????????????
ARG0 ??? ?? ARGM-ADV ? ??? ARG0 ? ?? ???
TARGET ?? ARG1 ?? ?

115
Semantic Role Labeling for JAVELIN

SRL for Japanese Not much work done. Develop
SRL system in house starting as a class project
Recently released - NAIST Text Corpus v1.2 beta-
includes verbal and nominative predicates with
labeled arguments
The Kyoto Text Corpus v4.0 includes POS,
non-projective dependency parse for 40,000
sentences and case role, anaphora, ellipsis and
co-reference for 5,000 sentences
Use CRF and Tree-CRF for the learning task

116
Questions ?

Write a Comment

User Comments (0)