QuASI: Question Answering using Statistics, Semantics, and Inference - PowerPoint PPT Presentation

About This Presentation

Title:

QuASI: Question Answering using Statistics, Semantics, and Inference

Description:

Semantic Relation Classification. TREC Task 1: Overview ... GeneRIF classification ... Combination of text retrieval score and GeneRIF classification score. ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 23

Provided by: hea4

Learn more at: https://people.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: QuASI: Question Answering using Statistics, Semantics, and Inference

1
QuASI Question Answering using Statistics,
Semantics, and Inference

Marti Hearst, Jerry Feldman, Chris Manning,
Srini Narayanan
Univ. of California-Berkeley / ICSI / Stanford
University

2
Outline

TREC Genomics Track
Semantic Relation Classification

3
TREC Task 1 Overview

Search 525,938 MedLine records
Titles, abstracts, MeSH category terms, citation
information
Topics
Taken from the GeneRIF portion of the LocusLink
database
We are supplied with a gene names
Definition of a GeneRIF
For gene X, find all MEDLINE references that
focus on the basic biology of the gene or its
protein products from the designated organism.
Basic biology includes isolation, structure,
genetics and function of genes/proteins in normal
and disease states.

4
TREC Task 1 Sample Query

3 2120 Homo sapiens OFFICIAL_GENE_NAME ets
variant gene 6 (TEL ncogene)
3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6
3 2120 Homo sapiens ALIAS_SYMBOL TEL
3 2120 Homo sapiens PREFERRED_PRODUCT ets variant
gene 6
3 2120 Homo sapiens PRODUCT ets variant gene 6
3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene
The first column is the official topic number
(1-50).
The second column contains the LocusLink ID for
the gene.
The third column contains the name of organism.
The fourth column contains the gene name type.
The fifth column contains the gene name.

5
TREC Task 1 Approach

Two main components
Retrieve relevant docs
May miss many because of variation in how gene
names are expressed
Rank order them

6
TREC Task 1 Approach

Retrieval
Normalization of query terms
Special characters are replaced with spaces in
both queries and documents.
Term expansion
A set of pattern based rules is applied to the
original list of query terms, to expand the
original set, and increase recall.
Some rules with lower confidence get a lower
weight in the ranking step.
Stop word removal
Organism identification
Gene names are often shared across different
organisms
Developed a method to automatically determine
which MeSH terms correspond to LocusLink Organism
terms
Retrieved Medline docs indicated by LocusLink
links corresponding to a given organism
Organism terms were the most frequent MeSH
categories among the selected docs
Used these terms to identify the organism term in
Medline
An example of playing two databases off each
other.
Mesh concepts
When an exact match is found between one of the
query terms and a MeSH term assigned to a
document, the document is retrieved.

7
Gene Name Expansion
8
Organism Filtering
9
TREC Task 1 Approach

Relevance ranking
IBMs DB2 Net Search Extender was used as the
text search engine.
Scoring
Each query is a union of 5 different sub-queries
-
titles,
abstracts,
titles using low confidence expansion rules,
abstracts using low confidence expansion rules,
and
MeSH concepts.
Each sub-query returns a set of documents with a
relevance score from the text search engine (or a
fixed value for MeSH matches)
The aggregated score is the weighted SUM of the
individual scores with optional weights applied
to each sub-query score.
SUM performs better than MAX, since it gives
higher confidence to documents found in multiple
sub-queries.
Scores are normalized to be in the (0,1) range,
by dividing the score by the highest aggregated
score achieved for the query.

10
TREC Task 1 Approach

GeneRIF classification
A Naïve Bayes model is used to assign to each
document the probability it is a GeneRIF.
MeSH terms are used as features.
Combination of text retrieval score and GeneRIF
classification score.
We tried both an additive and a multiplicative
approach. Both behave similarly with a slightly
better performance achieved with the additive one.

11
TREC Task 1 Results

Performance is measured using the standard
trec_eval program.
On training data
Best published result 0.4125
With GeneRIF classifier 0.5101
Without GeneRIF classifier 0.5028
On testing data (turned in 8/4/03)
With GeneRIF classifier 0.3933
Without GeneRIF classifier 0.3768

12
TREC Task 2

Problem Definition
Given GeneRIFS formatted as
1 355 12107169 J Biol Chem 2002 Sep
13277(37)34343-8. the death effector domain
of FADD is involved in interaction with Fas.
2 355 12177303 Nucleic Acids Res 2002
Aug 1530(16)3609-14. In the case of
Fas-mediated apoptosis, when we transiently
introduced these hybrid-ribozyme libraries into
Fas-expressing HeLa cells, we were able to
isolate surviving clones that were resistant to
or exhibited a delay in Fas-mediated apoptosis w
reproduce the GeneRIF from the MEDLINE record.

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)

Write a Comment

User Comments (0)