Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR

Description:

... (C0028754) is an independent risk factor for periodontal disease (C0031090). Phrase Index: epidemiological study, obesity, periodontal disease ... – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 20

Provided by: NanZ1

Category:

more less

Transcript and Presenter's Notes

Title: Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR

1
Using Concept-based Indexing to Improve Language
Modeling Approach to Genomic IR

Xiaohua (Davis) Zhou, Xiaodan Zhang, Tony Hu
Data Mining and Bioinformatics Lab
College of Information Science Technology
Drexel University, Philadelphia, PA, United
States

2
Background -1

What is Genomic Information Retrieval (GIR)?
Genomic Sequence Search (e.g. BLAST, previous
talk)
Genomic Network Search (sequence, protein, gene,
cell, disease, drug, etc) Semantic Web
Textual Information Retrieval We are doing
Collections such as Medline (1.4 million
scientific papers in the domain of biomedicine)
TREC Genomic Track 2003/2004/2005
Why GIR?
Most biological and medical findings are
scattered in textual publications rather than
structured databases.

3
Background -2

Who will use GIR systems?
For life scientists e.g. Discover proteins
related certain breast cancer.
For text mining systems
Retrieve documents relevant to a topic for text
mining purpose, e.g. clustering and summarize
proteins affecting the development of breast
cancers.

4
Background -3

What is the characteristics of the GIR?
Phrase Many concepts are multi-word phrases
High blood pressure, receptor antagonist activity
Synonym One concept has many synonyms
YPL042c SSN3, GIG2 , NUT7, RYE5, SRB10, UME5
Polysemy Authors like to use partial concept
names, short names, abbreviations while
describing a concept in an article, causing the
ambiguity.
Relationship Documents address various of
biological relationships, e.g. protein-protein
interactions and gene functions.

5
Background -4

Unified Medical Language System (UMLS)
http//www.nlm.nih.gov/research/umls
Integration of many biomedical vocabularies such
as MeSH and SNOWMED
A concept is a unique meaning representing a set
of synonymous terms.
C0020538 is a concept about the hypertension it
represents a set of synonymous terms including
high blood pressure, hypertension, and
hypertensive disease
1 Million concepts, 135 semantic types, and 54
relation types.

6
Agenda

Background
Problem Motivation
Our Approach
Retrieval Model
Concept Extraction Document Indexing
Experiment
Conclusions and Future Work

7
Problem Motivation

Problem Descriptions
This paper targets two problems synonym and
polysemy, identified in the background section.
It is a very old problem word sense
disambiguation for IR
The results are inconclusive.
WSD accuracy is low.
Why Study It Again?
Target a new specific domain
Large scale ontology (e.g. UMLS) is available
New formal retrieval frameworks, e.g. Language
Models

8
Existing Approaches -1

Dictionary-based Query Expansion
Example
See YPL042c in the original query, then add
SSN3, GIG2 , NUT7, RYE5, SRB10, UME5
Heuristic approach difficult to determine the
weight of the expanded terms.
Noise difficult to disambiguate the term in the
original query.

9
Existing Approaches -2

Translation-based Language Model
Example
auto could be translated to car with a high
probability, but to computer with a low
probability.
A formal approach to query expansion
Computational intensive
Difficult to large real training data to estimate
translation probabilities.

10
Our Idea

Existing Approaches
Solve the problem during the stage of
query-document term matching.
Research Questions
Can we partially solve the problem during
indexing?

A recent epidemiological study (C0002783)
revealed that obesity (C0028754) is an
independent risk factor for periodontal disease
(C0031090). Phrase Index epidemiological study,
obesity, periodontal disease Concept Index
C0002783, C0028754, C0031090
11
Concept-based Indexing

Strength
With large ontology available, it is technically
feasible
Make document-query matching simpler
If necessary, we can still use query expansion
during the matching stage
Weakness
Indexing is slow
Highly depends on domain ontology
More Specific Research Questions
Does concept-based indexing outperform
phrase-based indexing?

12
Concept Extraction

MaxMatcher
Developed by our lab.
Extracting UMLS concepts from texts using
approximate dictionary lookup.
Basic Ideas
Biological terms have many variants. Exact string
matching will cause low recall.
Match core tokens instead of all tokens in a
concept name.
Dictionary CD28 protein, Text CD28
For the detail, refer to the paper or talk to me
in person after the presentation.

13
Experiment Settings

TREC 2004 Genomic Track
4.6 Million Medline Abstracts. We use a subset of
the collection, 42,255 documents, which are
manually judged.
50 ad hoc queries
Retrieval Model
Basic Unigram Document Language Model
Measures
Phrase-based Indexing as the baseline
MAP, P_at_10, P_at_100

14
Result Analysis
Fig. 2. The comparison of the Concept Approach
with the Baseline Approach) on 50 ad hoc topics.
The paired-sample T test (M7.77, t3.316,
df49, p0.002) shows the concept approach is
significantly better than the baseline approach
in terms of MAP.
15
Result Analysis
Table 1. The comparison of our runs with
official runs participated in TREC04 Genomics
Track. Runs in TREC are ranked by Mean Average
Precision (MAP)
16
Conclusions

Major Findings
Concept-based indexing would improve the
retrieval performance in comparison with
phrase-based indexing
The performance of our run (the concept-based
indexing basic language model) is much better
than the mean MAP of all participated groups in
TREC 2004, and is comparable to the best run.

17
Future Work

Add sub-concepts to the index
Breast Cancer Breast, Cancer
Explore the concept-based indexing with extra IR
techniques
document model expansion
Blind feedback
Compare word-based indexing with concept-based
indexing and phrase-based indexing
Incorporate relationships into the retrieval
model
Two related further work
Relation-based Document Retrieval for Biomedical
Literature Databases, DASFAA 2006, 689-701
Context-Sensitive Semantic Smoothing for the
Language Modeling Approach to Genomic IR, SIGIR
2006