Title: Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR
1Using Concept-based Indexing to Improve Language
Modeling Approach to Genomic IR
- Xiaohua (Davis) Zhou, Xiaodan Zhang, Tony Hu
- Data Mining and Bioinformatics Lab
- College of Information Science Technology
Drexel University, Philadelphia, PA, United
States
2Background -1
- What is Genomic Information Retrieval (GIR)?
- Genomic Sequence Search (e.g. BLAST, previous
talk) - Genomic Network Search (sequence, protein, gene,
cell, disease, drug, etc) Semantic Web - Textual Information Retrieval We are doing
- Collections such as Medline (1.4 million
scientific papers in the domain of biomedicine) - TREC Genomic Track 2003/2004/2005
- Why GIR?
- Most biological and medical findings are
scattered in textual publications rather than
structured databases.
3Background -2
- Who will use GIR systems?
- For life scientists e.g. Discover proteins
related certain breast cancer. - For text mining systems
- Retrieve documents relevant to a topic for text
mining purpose, e.g. clustering and summarize
proteins affecting the development of breast
cancers.
4Background -3
- What is the characteristics of the GIR?
- Phrase Many concepts are multi-word phrases
- High blood pressure, receptor antagonist activity
- Synonym One concept has many synonyms
- YPL042c SSN3, GIG2 , NUT7, RYE5, SRB10, UME5
- Polysemy Authors like to use partial concept
names, short names, abbreviations while
describing a concept in an article, causing the
ambiguity. - Relationship Documents address various of
biological relationships, e.g. protein-protein
interactions and gene functions.
5Background -4
- Unified Medical Language System (UMLS)
- http//www.nlm.nih.gov/research/umls
- Integration of many biomedical vocabularies such
as MeSH and SNOWMED - A concept is a unique meaning representing a set
of synonymous terms. - C0020538 is a concept about the hypertension it
represents a set of synonymous terms including
high blood pressure, hypertension, and
hypertensive disease - 1 Million concepts, 135 semantic types, and 54
relation types.
6Agenda
- Background
- Problem Motivation
- Our Approach
- Retrieval Model
- Concept Extraction Document Indexing
- Experiment
- Conclusions and Future Work
7Problem Motivation
- Problem Descriptions
- This paper targets two problems synonym and
polysemy, identified in the background section. - It is a very old problem word sense
disambiguation for IR - The results are inconclusive.
- WSD accuracy is low.
- Why Study It Again?
- Target a new specific domain
- Large scale ontology (e.g. UMLS) is available
- New formal retrieval frameworks, e.g. Language
Models
8Existing Approaches -1
- Dictionary-based Query Expansion
- Example
- See YPL042c in the original query, then add
- SSN3, GIG2 , NUT7, RYE5, SRB10, UME5
- Heuristic approach difficult to determine the
weight of the expanded terms. - Noise difficult to disambiguate the term in the
original query.
9Existing Approaches -2
- Translation-based Language Model
- Example
- auto could be translated to car with a high
probability, but to computer with a low
probability. - A formal approach to query expansion
- Computational intensive
- Difficult to large real training data to estimate
translation probabilities.
10Our Idea
- Existing Approaches
- Solve the problem during the stage of
query-document term matching. - Research Questions
-
- Can we partially solve the problem during
indexing?
A recent epidemiological study (C0002783)
revealed that obesity (C0028754) is an
independent risk factor for periodontal disease
(C0031090). Phrase Index epidemiological study,
obesity, periodontal disease Concept Index
C0002783, C0028754, C0031090
11Concept-based Indexing
- Strength
- With large ontology available, it is technically
feasible - Make document-query matching simpler
- If necessary, we can still use query expansion
during the matching stage - Weakness
- Indexing is slow
- Highly depends on domain ontology
- More Specific Research Questions
- Does concept-based indexing outperform
phrase-based indexing?
12Concept Extraction
- MaxMatcher
- Developed by our lab.
- Extracting UMLS concepts from texts using
approximate dictionary lookup. - Basic Ideas
- Biological terms have many variants. Exact string
matching will cause low recall. - Match core tokens instead of all tokens in a
concept name. - Dictionary CD28 protein, Text CD28
- For the detail, refer to the paper or talk to me
in person after the presentation.
13Experiment Settings
- TREC 2004 Genomic Track
- 4.6 Million Medline Abstracts. We use a subset of
the collection, 42,255 documents, which are
manually judged. - 50 ad hoc queries
- Retrieval Model
- Basic Unigram Document Language Model
- Measures
- Phrase-based Indexing as the baseline
- MAP, P_at_10, P_at_100
14Result Analysis
Fig. 2. The comparison of the Concept Approach
with the Baseline Approach) on 50 ad hoc topics.
The paired-sample T test (M7.77, t3.316,
df49, p0.002) shows the concept approach is
significantly better than the baseline approach
in terms of MAP.
15Result Analysis
Table 1. The comparison of our runs with
official runs participated in TREC04 Genomics
Track. Runs in TREC are ranked by Mean Average
Precision (MAP)
16Conclusions
- Major Findings
- Concept-based indexing would improve the
retrieval performance in comparison with
phrase-based indexing - The performance of our run (the concept-based
indexing basic language model) is much better
than the mean MAP of all participated groups in
TREC 2004, and is comparable to the best run.
17Future Work
- Add sub-concepts to the index
- Breast Cancer Breast, Cancer
- Explore the concept-based indexing with extra IR
techniques - document model expansion
- Blind feedback
- Compare word-based indexing with concept-based
indexing and phrase-based indexing - Incorporate relationships into the retrieval
model - Two related further work
- Relation-based Document Retrieval for Biomedical
Literature Databases, DASFAA 2006, 689-701 - Context-Sensitive Semantic Smoothing for the
Language Modeling Approach to Genomic IR, SIGIR
2006
18Future Work
- DASFAA Paper
- Use relationships to index and search biomedical
literatures - SIGIR Paper
- Use concepts and sub-concepts
- Context-sensitive semantic smoothing
- Translate concept pairs to single concepts
19Questions and Comments