Priberams question answering system for Portuguese - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Priberams question answering system for Portuguese

Description:

Carlos Amaral, Helena Figueira, Andr Martins, Afonso Mendes, Pedro Mendes, ... Syntactical parsing & anaphora resolution. Refinement for Web & book searching ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 19

Provided by: ATM69

Category:

more less

Transcript and Presenter's Notes

Title: Priberams question answering system for Portuguese

1
Priberamsquestion answering systemfor
Portuguese

Carlos Amaral, Helena Figueira, André Martins,
Afonso Mendes, Pedro Mendes, Cláudia Pinto

2
Summary

Introduction
A workbench for NLP
Lexical resources
Software tools
Question categorization
System description
Indexing process
Question analysis
Document retrieval
Sentence retrieval
Answer extraction
Evaluation Results
Conclusions

3
Introduction

Goal to build a question answering (QA) engine
that finds a unique exact answer for NL
questions.
Evaluation QA_at_CLEF Portuguese monolingual task.
Previous work by Priberam on this subject
LegiX a juridical information system
SintaGest a workbench for NLP
TRUST project (Text Retrieval Using Semantics
Technology) development of the Portuguese
module in a cross-language environment.

4
Lexical resources

Lexicon
Lemmas, inflections and POS
Sense definitions ()
Semantic features, subcategorization and
selection restrictions
Ontological and terminological domains
English and French equivalents ()
Lexical-semantic relations (e.g. derivations).
() Not used in the QA system.
Thesaurus
Ontology
Multilingual () (English, French, Portuguese)
enables translations
Designed by Synapse Développement for TRUST
() Only Portuguese information is used in the
QA system.

5
Software tools

Priberams SintaGest a NLP application that
allows
Building testing a context-free grammar (CFG)
Building testing contextual rules for
Morphological disambiguation
Named entity fixed expressions recognition
Building testing patterns for question
categorization/answer extraction
Compressing compiling all data into binary
files.
Statistical POS tagger
Used together w/ contextual rules for
morphological disambiguation
HMM-based (2nd order), trained with the
CETEMPublico corpus
Fast efficient performance gt Viterbi algorithm.

6
Question categorization (I)

86 question categories, flat structure
ltDENOMINATIONgt, ltDATE OF EVENTgt, ltTOWN NAMEgt,
ltBIRTH DATEgt, ltFUNCTIONgt,
Categorization performed through rich patterns
(more powerful than regular expressions)
More than one category is allowed (avoiding hard
decisions)
Rich patterns are conditional expressions w/
words (Word), lemmas (Root), POS (Cat), ontology
entries (Ont), question identifiers (QuestIdent),
and constant phrases
Everything built tested through SintaGest.

7
Question categorization (II)

There are 3 kinds of patterns
Question patterns (QPs) for question
categorization.
Answer patterns (APs) for sentence
categorization (during indexation).
Question answering patterns (QAPs) for answer
extraction.

Heuristic scores
Question (FUNCTION) Word(quem) Distance(0,3)
Root(ser) AnyCat(Nprop, ENT) 15 // e.g.
Quem é Jorge Sampaio? Word(que)
QuestIdent(FUNCTION_N) Distance(0,3)
QuestIdent(FUNCTION_V) 15 // e.g. Que cargo
desempenha Jorge Sampaio? Answer Pivot
AnyCat (Nprop, ENT) Root(ser) Definition With
Ergonym? 20 // e.g. Jorge Sampaio é o
Presidente da República... NounPhrase
With Ergonym? AnyCat (Trav, Vg) Pivot AnyCat
(Nprop, ENT) 15 // e.g. O presidente da
República, Jorge Sampaio... Answer
(FUNCTION) QuestIdent(FUNCTION_N) 10
Ergonym 10
QPs
QAPs
APs
8
QA system overview

The system architecture is composed by 5 major
modules

9
Indexing process

The collection of target documents is analysed
(off-line) and information is stored in a index
database.
Each document first feeds the sentence analyser
Sentence categorization each sentence is
classified with one or more question categories
through the APs.
We build indices for
Lemmas
Heads of derivation
NEs and fixed expressions
Question categories
Ontology domains (at document level)

10
Question analysis

Input
A NL question (e.g. Quem é o presidente da
Albânia?)
Procedure
Sentence analysis
Question categorization activation of QAPs
(through the QPs)
Extraction of pivots (words, NEs, phrases, dates,
abbreviations, )
Query expansion (heads of derivation
synonyms)
Output
Pivots lemmas, heads synonyms (e.g.
presidente, Albânia, presidir, albanês, chefe de
estado)
Question categories (e.g. ltFUNCTIONgt,
ltDENOMINATIONgt)
Relevant ontological domains
Active QAPs

11
Document retrieval

Input
Pivots lemmas (wLi), heads (wHi) synonyms
(wSij)
Question categories (ck) ontological domains
(ol)

Procedure
Word weighting ?(w) according to
POS
ilf (inv. lexical freq.)
idf (inv. docum. freq.).
Each document d is given a score ?d
Output
The top 30 scored documents.

?d 0 For Each pivot i If d contains lemma wLi
Then ?d KL ?(wLi) Else If d contains head
wHi Then ?d KH ?(wHi) Else If d contains
any synonym wSij Then ?d maxj(KS ? (wSij,
wLi) ?(wSij)) If d contains any question category
ck Then ?d KC If d contains any ontology
domain ol Then ?d KO ?d
RewardPivotProximity(d, ?d)
12
Sentence retrieval

Input
Scored documents (d, ?d) w/ relevant sentences
marked.
Procedure
Sentence analysis
Sentence scoring Each sentence s is given a
score ?s according to
Output
Scored sentences (s, ?s) above a fixed
threshold.

pivots lemmas, heads synonyms matching s
partial matches Fidel ? Fidel Castro
Order proximity of pivots in s
Existence of common question categories between q
and s
Score ?d of document d containing s.

13
Answer extraction

Input
Scored sentences (s, ?s)
Active QAPs (from the Question Analysis module)
Procedure
Answer extraction scoring through the QAPs
Answer coherence
Each answer a is rescored to ?a taking into
account its coherence to the whole collection of
candidate answers (e.g., Sali Berisha, Ramiz
Alia, Berisha)
Selection of the final answer.
e.g. O Presidente da Albânia, Sali Berisha,
tentou evitar o pior, afirmando que não está
provado que o Governo grego esteja envolvido no
ataque.
Output
The answer a with highest ?a or NIL if none
answer was extracted.

14
Results evaluation (I)

QA_at_CLEF evaluation
Portuguese monolingual task
210734 target documents (564 Mb) from Portuguese
Brazilian newspaper corpora Público1994,
Público1995, Folha1994, Folha1995
Test set of 200 questions (in Brazilian and
European Portuguese).
Results
64,5 of right answers (R)

15
Results evaluation (II)

Reasons for bad answers (WXU)

16
Conclusions

Priberams QA system exhibited encouraging
results
State-of-the-art accuracy (64.5) in QA_at_CLEF
evaluation
Possible advantages over other systems
Adjustable powerful patterns for categorization
extraction (SintaGest)
Query expansion through heads of derivation
synonyms
Use of ontology to introduce semantic knowledge
Some future work
Confidence measure for final answer validation
Handling of list-, how-, temporally-restricted
questions
Semantic disambiguation further exploiting of
the ontology
Syntactical parsing anaphora resolution
Refinement for Web book searching

17
Priberamsquestion answering systemfor
Portuguese

Carlos Amaral, Helena Figueira, André Martins,
Afonso Mendes, Pedro Mendes, Cláudia Pinto

18
Ontology

Concept-based
Tree-structured, 4 levels
Nodes are concepts
Leaves are senses of words
Words are translated in several languages
(English, French, Portuguese, Italian, Polish,
and soon Spanish and Czech)
There are 3387 terminal nodes (the most specific
concepts)

Write a Comment

User Comments (0)