Title: Priberams question answering system for Portuguese
1Priberamsquestion answering systemfor
Portuguese
- Carlos Amaral, Helena Figueira, André Martins,
Afonso Mendes, Pedro Mendes, Cláudia Pinto
2Summary
- Introduction
- A workbench for NLP
- Lexical resources
- Software tools
- Question categorization
- System description
- Indexing process
- Question analysis
- Document retrieval
- Sentence retrieval
- Answer extraction
- Evaluation Results
- Conclusions
3Introduction
- Goal to build a question answering (QA) engine
that finds a unique exact answer for NL
questions. - Evaluation QA_at_CLEF Portuguese monolingual task.
- Previous work by Priberam on this subject
- LegiX a juridical information system
- SintaGest a workbench for NLP
- TRUST project (Text Retrieval Using Semantics
Technology) development of the Portuguese
module in a cross-language environment.
4Lexical resources
- Lexicon
- Lemmas, inflections and POS
- Sense definitions ()
- Semantic features, subcategorization and
selection restrictions - Ontological and terminological domains
- English and French equivalents ()
- Lexical-semantic relations (e.g. derivations).
- () Not used in the QA system.
- Thesaurus
- Ontology
- Multilingual () (English, French, Portuguese)
enables translations - Designed by Synapse Développement for TRUST
- () Only Portuguese information is used in the
QA system.
5Software tools
- Priberams SintaGest a NLP application that
allows - Building testing a context-free grammar (CFG)
- Building testing contextual rules for
- Morphological disambiguation
- Named entity fixed expressions recognition
- Building testing patterns for question
categorization/answer extraction - Compressing compiling all data into binary
files. - Statistical POS tagger
- Used together w/ contextual rules for
morphological disambiguation - HMM-based (2nd order), trained with the
CETEMPublico corpus - Fast efficient performance gt Viterbi algorithm.
6Question categorization (I)
- 86 question categories, flat structure
- ltDENOMINATIONgt, ltDATE OF EVENTgt, ltTOWN NAMEgt,
ltBIRTH DATEgt, ltFUNCTIONgt, - Categorization performed through rich patterns
(more powerful than regular expressions) - More than one category is allowed (avoiding hard
decisions) - Rich patterns are conditional expressions w/
words (Word), lemmas (Root), POS (Cat), ontology
entries (Ont), question identifiers (QuestIdent),
and constant phrases - Everything built tested through SintaGest.
7Question categorization (II)
- There are 3 kinds of patterns
- Question patterns (QPs) for question
categorization. - Answer patterns (APs) for sentence
categorization (during indexation). - Question answering patterns (QAPs) for answer
extraction.
Heuristic scores
Question (FUNCTION) Word(quem) Distance(0,3)
Root(ser) AnyCat(Nprop, ENT) 15 // e.g.
Quem é Jorge Sampaio? Word(que)
QuestIdent(FUNCTION_N) Distance(0,3)
QuestIdent(FUNCTION_V) 15 // e.g. Que cargo
desempenha Jorge Sampaio? Answer Pivot
AnyCat (Nprop, ENT) Root(ser) Definition With
Ergonym? 20 // e.g. Jorge Sampaio é o
Presidente da República... NounPhrase
With Ergonym? AnyCat (Trav, Vg) Pivot AnyCat
(Nprop, ENT) 15 // e.g. O presidente da
República, Jorge Sampaio... Answer
(FUNCTION) QuestIdent(FUNCTION_N) 10
Ergonym 10
QPs
QAPs
APs
8QA system overview
- The system architecture is composed by 5 major
modules
9Indexing process
- The collection of target documents is analysed
(off-line) and information is stored in a index
database. - Each document first feeds the sentence analyser
- Sentence categorization each sentence is
classified with one or more question categories
through the APs. - We build indices for
- Lemmas
- Heads of derivation
- NEs and fixed expressions
- Question categories
- Ontology domains (at document level)
10Question analysis
- Input
- A NL question (e.g. Quem é o presidente da
Albânia?) - Procedure
- Sentence analysis
- Question categorization activation of QAPs
(through the QPs) - Extraction of pivots (words, NEs, phrases, dates,
abbreviations, ) - Query expansion (heads of derivation
synonyms) - Output
- Pivots lemmas, heads synonyms (e.g.
presidente, Albânia, presidir, albanês, chefe de
estado) - Question categories (e.g. ltFUNCTIONgt,
ltDENOMINATIONgt) - Relevant ontological domains
- Active QAPs
11Document retrieval
- Input
- Pivots lemmas (wLi), heads (wHi) synonyms
(wSij) - Question categories (ck) ontological domains
(ol)
- Procedure
- Word weighting ?(w) according to
- POS
- ilf (inv. lexical freq.)
- idf (inv. docum. freq.).
- Each document d is given a score ?d
- Output
- The top 30 scored documents.
?d 0 For Each pivot i If d contains lemma wLi
Then ?d KL ?(wLi) Else If d contains head
wHi Then ?d KH ?(wHi) Else If d contains
any synonym wSij Then ?d maxj(KS ? (wSij,
wLi) ?(wSij)) If d contains any question category
ck Then ?d KC If d contains any ontology
domain ol Then ?d KO ?d
RewardPivotProximity(d, ?d)
12Sentence retrieval
- Input
- Scored documents (d, ?d) w/ relevant sentences
marked. - Procedure
- Sentence analysis
- Sentence scoring Each sentence s is given a
score ?s according to - Output
- Scored sentences (s, ?s) above a fixed
threshold.
- pivots lemmas, heads synonyms matching s
- partial matches Fidel ? Fidel Castro
- Order proximity of pivots in s
- Existence of common question categories between q
and s - Score ?d of document d containing s.
13Answer extraction
- Input
- Scored sentences (s, ?s)
- Active QAPs (from the Question Analysis module)
- Procedure
- Answer extraction scoring through the QAPs
- Answer coherence
- Each answer a is rescored to ?a taking into
account its coherence to the whole collection of
candidate answers (e.g., Sali Berisha, Ramiz
Alia, Berisha) - Selection of the final answer.
- e.g. O Presidente da Albânia, Sali Berisha,
tentou evitar o pior, afirmando que não está
provado que o Governo grego esteja envolvido no
ataque. - Output
- The answer a with highest ?a or NIL if none
answer was extracted.
14Results evaluation (I)
- QA_at_CLEF evaluation
- Portuguese monolingual task
- 210734 target documents (564 Mb) from Portuguese
Brazilian newspaper corpora Público1994,
Público1995, Folha1994, Folha1995 - Test set of 200 questions (in Brazilian and
European Portuguese). - Results
- 64,5 of right answers (R)
15Results evaluation (II)
- Reasons for bad answers (WXU)
16Conclusions
- Priberams QA system exhibited encouraging
results - State-of-the-art accuracy (64.5) in QA_at_CLEF
evaluation - Possible advantages over other systems
- Adjustable powerful patterns for categorization
extraction (SintaGest) - Query expansion through heads of derivation
synonyms - Use of ontology to introduce semantic knowledge
- Some future work
- Confidence measure for final answer validation
- Handling of list-, how-, temporally-restricted
questions - Semantic disambiguation further exploiting of
the ontology - Syntactical parsing anaphora resolution
- Refinement for Web book searching
17Priberamsquestion answering systemfor
Portuguese
- Carlos Amaral, Helena Figueira, André Martins,
Afonso Mendes, Pedro Mendes, Cláudia Pinto
18Ontology
- Concept-based
- Tree-structured, 4 levels
- Nodes are concepts
- Leaves are senses of words
- Words are translated in several languages
(English, French, Portuguese, Italian, Polish,
and soon Spanish and Czech) - There are 3387 terminal nodes (the most specific
concepts)