Priberams question answering system for Portuguese - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Priberams question answering system for Portuguese

Description:

Carlos Amaral, Helena Figueira, Andr Martins, Afonso Mendes, Pedro Mendes, ... Syntactical parsing & anaphora resolution. Refinement for Web & book searching ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 19
Provided by: ATM69
Category:

less

Transcript and Presenter's Notes

Title: Priberams question answering system for Portuguese


1
Priberamsquestion answering systemfor
Portuguese
  • Carlos Amaral, Helena Figueira, André Martins,
    Afonso Mendes, Pedro Mendes, Cláudia Pinto

2
Summary
  • Introduction
  • A workbench for NLP
  • Lexical resources
  • Software tools
  • Question categorization
  • System description
  • Indexing process
  • Question analysis
  • Document retrieval
  • Sentence retrieval
  • Answer extraction
  • Evaluation Results
  • Conclusions

3
Introduction
  • Goal to build a question answering (QA) engine
    that finds a unique exact answer for NL
    questions.
  • Evaluation QA_at_CLEF Portuguese monolingual task.
  • Previous work by Priberam on this subject
  • LegiX a juridical information system
  • SintaGest a workbench for NLP
  • TRUST project (Text Retrieval Using Semantics
    Technology) development of the Portuguese
    module in a cross-language environment.

4
Lexical resources
  • Lexicon
  • Lemmas, inflections and POS
  • Sense definitions ()
  • Semantic features, subcategorization and
    selection restrictions
  • Ontological and terminological domains
  • English and French equivalents ()
  • Lexical-semantic relations (e.g. derivations).
  • () Not used in the QA system.
  • Thesaurus
  • Ontology
  • Multilingual () (English, French, Portuguese)
    enables translations
  • Designed by Synapse Développement for TRUST
  • () Only Portuguese information is used in the
    QA system.

5
Software tools
  • Priberams SintaGest a NLP application that
    allows
  • Building testing a context-free grammar (CFG)
  • Building testing contextual rules for
  • Morphological disambiguation
  • Named entity fixed expressions recognition
  • Building testing patterns for question
    categorization/answer extraction
  • Compressing compiling all data into binary
    files.
  • Statistical POS tagger
  • Used together w/ contextual rules for
    morphological disambiguation
  • HMM-based (2nd order), trained with the
    CETEMPublico corpus
  • Fast efficient performance gt Viterbi algorithm.

6
Question categorization (I)
  • 86 question categories, flat structure
  • ltDENOMINATIONgt, ltDATE OF EVENTgt, ltTOWN NAMEgt,
    ltBIRTH DATEgt, ltFUNCTIONgt,
  • Categorization performed through rich patterns
    (more powerful than regular expressions)
  • More than one category is allowed (avoiding hard
    decisions)
  • Rich patterns are conditional expressions w/
    words (Word), lemmas (Root), POS (Cat), ontology
    entries (Ont), question identifiers (QuestIdent),
    and constant phrases
  • Everything built tested through SintaGest.

7
Question categorization (II)
  • There are 3 kinds of patterns
  • Question patterns (QPs) for question
    categorization.
  • Answer patterns (APs) for sentence
    categorization (during indexation).
  • Question answering patterns (QAPs) for answer
    extraction.

Heuristic scores
Question (FUNCTION) Word(quem) Distance(0,3)
Root(ser) AnyCat(Nprop, ENT) 15 // e.g.
Quem é Jorge Sampaio? Word(que)
QuestIdent(FUNCTION_N) Distance(0,3)
QuestIdent(FUNCTION_V) 15 // e.g. Que cargo
desempenha Jorge Sampaio? Answer Pivot
AnyCat (Nprop, ENT) Root(ser) Definition With
Ergonym? 20 // e.g. Jorge Sampaio é o
Presidente da República... NounPhrase
With Ergonym? AnyCat (Trav, Vg) Pivot AnyCat
(Nprop, ENT) 15 // e.g. O presidente da
República, Jorge Sampaio... Answer
(FUNCTION) QuestIdent(FUNCTION_N) 10
Ergonym 10
QPs
QAPs
APs
8
QA system overview
  • The system architecture is composed by 5 major
    modules

9
Indexing process
  • The collection of target documents is analysed
    (off-line) and information is stored in a index
    database.
  • Each document first feeds the sentence analyser
  • Sentence categorization each sentence is
    classified with one or more question categories
    through the APs.
  • We build indices for
  • Lemmas
  • Heads of derivation
  • NEs and fixed expressions
  • Question categories
  • Ontology domains (at document level)

10
Question analysis
  • Input
  • A NL question (e.g. Quem é o presidente da
    Albânia?)
  • Procedure
  • Sentence analysis
  • Question categorization activation of QAPs
    (through the QPs)
  • Extraction of pivots (words, NEs, phrases, dates,
    abbreviations, )
  • Query expansion (heads of derivation
    synonyms)
  • Output
  • Pivots lemmas, heads synonyms (e.g.
    presidente, Albânia, presidir, albanês, chefe de
    estado)
  • Question categories (e.g. ltFUNCTIONgt,
    ltDENOMINATIONgt)
  • Relevant ontological domains
  • Active QAPs

11
Document retrieval
  • Input
  • Pivots lemmas (wLi), heads (wHi) synonyms
    (wSij)
  • Question categories (ck) ontological domains
    (ol)
  • Procedure
  • Word weighting ?(w) according to
  • POS
  • ilf (inv. lexical freq.)
  • idf (inv. docum. freq.).
  • Each document d is given a score ?d
  • Output
  • The top 30 scored documents.

?d 0 For Each pivot i If d contains lemma wLi
Then ?d KL ?(wLi) Else If d contains head
wHi Then ?d KH ?(wHi) Else If d contains
any synonym wSij Then ?d maxj(KS ? (wSij,
wLi) ?(wSij)) If d contains any question category
ck Then ?d KC If d contains any ontology
domain ol Then ?d KO ?d
RewardPivotProximity(d, ?d)
12
Sentence retrieval
  • Input
  • Scored documents (d, ?d) w/ relevant sentences
    marked.
  • Procedure
  • Sentence analysis
  • Sentence scoring Each sentence s is given a
    score ?s according to
  • Output
  • Scored sentences (s, ?s) above a fixed
    threshold.
  • pivots lemmas, heads synonyms matching s
  • partial matches Fidel ? Fidel Castro
  • Order proximity of pivots in s
  • Existence of common question categories between q
    and s
  • Score ?d of document d containing s.

13
Answer extraction
  • Input
  • Scored sentences (s, ?s)
  • Active QAPs (from the Question Analysis module)
  • Procedure
  • Answer extraction scoring through the QAPs
  • Answer coherence
  • Each answer a is rescored to ?a taking into
    account its coherence to the whole collection of
    candidate answers (e.g., Sali Berisha, Ramiz
    Alia, Berisha)
  • Selection of the final answer.
  • e.g. O Presidente da Albânia, Sali Berisha,
    tentou evitar o pior, afirmando que não está
    provado que o Governo grego esteja envolvido no
    ataque.
  • Output
  • The answer a with highest ?a or NIL if none
    answer was extracted.

14
Results evaluation (I)
  • QA_at_CLEF evaluation
  • Portuguese monolingual task
  • 210734 target documents (564 Mb) from Portuguese
    Brazilian newspaper corpora Público1994,
    Público1995, Folha1994, Folha1995
  • Test set of 200 questions (in Brazilian and
    European Portuguese).
  • Results
  • 64,5 of right answers (R)

15
Results evaluation (II)
  • Reasons for bad answers (WXU)

16
Conclusions
  • Priberams QA system exhibited encouraging
    results
  • State-of-the-art accuracy (64.5) in QA_at_CLEF
    evaluation
  • Possible advantages over other systems
  • Adjustable powerful patterns for categorization
    extraction (SintaGest)
  • Query expansion through heads of derivation
    synonyms
  • Use of ontology to introduce semantic knowledge
  • Some future work
  • Confidence measure for final answer validation
  • Handling of list-, how-, temporally-restricted
    questions
  • Semantic disambiguation further exploiting of
    the ontology
  • Syntactical parsing anaphora resolution
  • Refinement for Web book searching

17
Priberamsquestion answering systemfor
Portuguese
  • Carlos Amaral, Helena Figueira, André Martins,
    Afonso Mendes, Pedro Mendes, Cláudia Pinto

18
Ontology
  • Concept-based
  • Tree-structured, 4 levels
  • Nodes are concepts
  • Leaves are senses of words
  • Words are translated in several languages
    (English, French, Portuguese, Italian, Polish,
    and soon Spanish and Czech)
  • There are 3387 terminal nodes (the most specific
    concepts)
Write a Comment
User Comments (0)
About PowerShow.com