QuASI: Question Answering using Statistics, Semantics, and Inference - PowerPoint PPT Presentation

About This Presentation
Title:

QuASI: Question Answering using Statistics, Semantics, and Inference

Description:

Semantic Relation Classification. TREC Task 1: Overview ... GeneRIF classification ... Combination of text retrieval score and GeneRIF classification score. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 23
Provided by: hea4
Category:

less

Transcript and Presenter's Notes

Title: QuASI: Question Answering using Statistics, Semantics, and Inference


1
QuASI Question Answering using Statistics,
Semantics, and Inference
  • Marti Hearst, Jerry Feldman, Chris Manning,
    Srini Narayanan
  • Univ. of California-Berkeley / ICSI / Stanford
    University

2
Outline
  • TREC Genomics Track
  • Semantic Relation Classification

3
TREC Task 1 Overview
  • Search 525,938 MedLine records
  • Titles, abstracts, MeSH category terms, citation
    information
  • Topics
  • Taken from the GeneRIF portion of the LocusLink
    database
  • We are supplied with a gene names
  • Definition of a GeneRIF
  • For gene X, find all MEDLINE references that
    focus on the basic biology of the gene or its
    protein products from the designated organism. 
    Basic biology includes isolation, structure,
    genetics and function of genes/proteins in normal
    and disease states.

4
TREC Task 1 Sample Query
  • 3 2120 Homo sapiens OFFICIAL_GENE_NAME ets
    variant gene 6 (TEL ncogene)
  • 3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6
  • 3 2120 Homo sapiens ALIAS_SYMBOL TEL
  • 3 2120 Homo sapiens PREFERRED_PRODUCT ets variant
    gene 6
  • 3 2120 Homo sapiens PRODUCT ets variant gene 6
  • 3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene
  • The first column is the official topic number
    (1-50).
  • The second column contains the LocusLink ID for
    the gene.
  • The third column contains the name of organism.
  • The fourth column contains the gene name type.
  • The fifth column contains the gene name.

5
TREC Task 1 Approach
  • Two main components
  • Retrieve relevant docs
  • May miss many because of variation in how gene
    names are expressed
  • Rank order them

6
TREC Task 1 Approach
  • Retrieval
  • Normalization of query terms
  • Special characters are replaced with spaces in
    both queries and documents.
  • Term expansion
  • A set of pattern based rules is applied to the
    original list of query terms, to expand the
    original set, and increase recall.
  • Some rules with lower confidence get a lower
    weight in the ranking step.
  • Stop word removal
  • Organism identification
  • Gene names are often shared across different
    organisms
  • Developed a method to automatically determine
    which MeSH terms correspond to LocusLink Organism
    terms
  • Retrieved Medline docs indicated by LocusLink
    links corresponding to a given organism
  • Organism terms were the most frequent MeSH
    categories among the selected docs
  • Used these terms to identify the organism term in
    Medline
  • An example of playing two databases off each
    other.
  • Mesh concepts
  • When an exact match is found between one of the
    query terms and a MeSH term assigned to a
    document, the document is retrieved.

7
Gene Name Expansion
8
Organism Filtering
9
TREC Task 1 Approach
  • Relevance ranking
  • IBMs DB2 Net Search Extender was used as the
    text search engine.
  • Scoring
  • Each query is a union of 5 different sub-queries
    -
  • titles,
  • abstracts,
  • titles using low confidence expansion rules,
  • abstracts using low confidence expansion rules,
    and
  • MeSH concepts.
  • Each sub-query returns a set of documents with a
    relevance score from the text search engine (or a
    fixed value for MeSH matches)
  • The aggregated score is the weighted SUM of the
    individual scores with optional weights applied
    to each sub-query score.
  • SUM performs better than MAX, since it gives
    higher confidence to documents found in multiple
    sub-queries.
  • Scores are normalized to be in the (0,1) range,
    by dividing the score by the highest aggregated
    score achieved for the query.

10
TREC Task 1 Approach
  • GeneRIF classification
  • A Naïve Bayes model is used to assign to each
    document the probability it is a GeneRIF.
  • MeSH terms are used as features.
  • Combination of text retrieval score and GeneRIF
    classification score.
  • We tried both an additive and a multiplicative
    approach. Both behave similarly with a slightly
    better performance achieved with the additive one.

11
TREC Task 1 Results
  • Performance is measured using the standard
    trec_eval program.
  • On training data
  • Best published result 0.4125
  • With GeneRIF classifier 0.5101
  • Without GeneRIF classifier 0.5028
  • On testing data (turned in 8/4/03)
  • With GeneRIF classifier 0.3933
  • Without GeneRIF classifier 0.3768

12
TREC Task 2
  • Problem Definition
  • Given GeneRIFS formatted as
  • 1    355    12107169    J Biol Chem 2002 Sep
    13277(37)34343-8.    the death effector domain
    of FADD is involved in interaction with Fas.
  • 2    355    12177303    Nucleic Acids Res 2002
    Aug 1530(16)3609-14.    In the case of
    Fas-mediated apoptosis, when we transiently
    introduced these hybrid-ribozyme libraries into
    Fas-expressing HeLa cells, we were able to
    isolate surviving clones that were resistant to
    or exhibited a delay in Fas-mediated apoptosis w
  • reproduce the GeneRIF from the MEDLINE record.  

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com