Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR

Description:

... (C0028754) is an independent risk factor for periodontal disease (C0031090). Phrase Index: epidemiological study, obesity, periodontal disease ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 20
Provided by: NanZ1
Category:

less

Transcript and Presenter's Notes

Title: Using Conceptbased Indexing to Improve Language Modeling Approach to Genomic IR


1
Using Concept-based Indexing to Improve Language
Modeling Approach to Genomic IR
  • Xiaohua (Davis) Zhou, Xiaodan Zhang, Tony Hu
  • Data Mining and Bioinformatics Lab
  • College of Information Science Technology
    Drexel University, Philadelphia, PA, United
    States

2
Background -1
  • What is Genomic Information Retrieval (GIR)?
  • Genomic Sequence Search (e.g. BLAST, previous
    talk)
  • Genomic Network Search (sequence, protein, gene,
    cell, disease, drug, etc) Semantic Web
  • Textual Information Retrieval We are doing
  • Collections such as Medline (1.4 million
    scientific papers in the domain of biomedicine)
  • TREC Genomic Track 2003/2004/2005
  • Why GIR?
  • Most biological and medical findings are
    scattered in textual publications rather than
    structured databases.

3
Background -2
  • Who will use GIR systems?
  • For life scientists e.g. Discover proteins
    related certain breast cancer.
  • For text mining systems
  • Retrieve documents relevant to a topic for text
    mining purpose, e.g. clustering and summarize
    proteins affecting the development of breast
    cancers.

4
Background -3
  • What is the characteristics of the GIR?
  • Phrase Many concepts are multi-word phrases
  • High blood pressure, receptor antagonist activity
  • Synonym One concept has many synonyms
  • YPL042c SSN3, GIG2 , NUT7, RYE5, SRB10, UME5
  • Polysemy Authors like to use partial concept
    names, short names, abbreviations while
    describing a concept in an article, causing the
    ambiguity.
  • Relationship Documents address various of
    biological relationships, e.g. protein-protein
    interactions and gene functions.

5
Background -4
  • Unified Medical Language System (UMLS)
  • http//www.nlm.nih.gov/research/umls
  • Integration of many biomedical vocabularies such
    as MeSH and SNOWMED
  • A concept is a unique meaning representing a set
    of synonymous terms.
  • C0020538 is a concept about the hypertension it
    represents a set of synonymous terms including
    high blood pressure, hypertension, and
    hypertensive disease
  • 1 Million concepts, 135 semantic types, and 54
    relation types.

6
Agenda
  • Background
  • Problem Motivation
  • Our Approach
  • Retrieval Model
  • Concept Extraction Document Indexing
  • Experiment
  • Conclusions and Future Work

7
Problem Motivation
  • Problem Descriptions
  • This paper targets two problems synonym and
    polysemy, identified in the background section.
  • It is a very old problem word sense
    disambiguation for IR
  • The results are inconclusive.
  • WSD accuracy is low.
  • Why Study It Again?
  • Target a new specific domain
  • Large scale ontology (e.g. UMLS) is available
  • New formal retrieval frameworks, e.g. Language
    Models

8
Existing Approaches -1
  • Dictionary-based Query Expansion
  • Example
  • See YPL042c in the original query, then add
  • SSN3, GIG2 , NUT7, RYE5, SRB10, UME5
  • Heuristic approach difficult to determine the
    weight of the expanded terms.
  • Noise difficult to disambiguate the term in the
    original query.

9
Existing Approaches -2
  • Translation-based Language Model
  • Example
  • auto could be translated to car with a high
    probability, but to computer with a low
    probability.
  • A formal approach to query expansion
  • Computational intensive
  • Difficult to large real training data to estimate
    translation probabilities.

10
Our Idea
  • Existing Approaches
  • Solve the problem during the stage of
    query-document term matching.
  • Research Questions
  • Can we partially solve the problem during
    indexing?

A recent epidemiological study (C0002783)
revealed that obesity (C0028754) is an
independent risk factor for periodontal disease
(C0031090). Phrase Index epidemiological study,
obesity, periodontal disease Concept Index
C0002783, C0028754, C0031090
11
Concept-based Indexing
  • Strength
  • With large ontology available, it is technically
    feasible
  • Make document-query matching simpler
  • If necessary, we can still use query expansion
    during the matching stage
  • Weakness
  • Indexing is slow
  • Highly depends on domain ontology
  • More Specific Research Questions
  • Does concept-based indexing outperform
    phrase-based indexing?

12
Concept Extraction
  • MaxMatcher
  • Developed by our lab.
  • Extracting UMLS concepts from texts using
    approximate dictionary lookup.
  • Basic Ideas
  • Biological terms have many variants. Exact string
    matching will cause low recall.
  • Match core tokens instead of all tokens in a
    concept name.
  • Dictionary CD28 protein, Text CD28
  • For the detail, refer to the paper or talk to me
    in person after the presentation.

13
Experiment Settings
  • TREC 2004 Genomic Track
  • 4.6 Million Medline Abstracts. We use a subset of
    the collection, 42,255 documents, which are
    manually judged.
  • 50 ad hoc queries
  • Retrieval Model
  • Basic Unigram Document Language Model
  • Measures
  • Phrase-based Indexing as the baseline
  • MAP, P_at_10, P_at_100

14
Result Analysis
Fig. 2. The comparison of the Concept Approach
with the Baseline Approach) on 50 ad hoc topics.
The paired-sample T test (M7.77, t3.316,
df49, p0.002) shows the concept approach is
significantly better than the baseline approach
in terms of MAP.
15
Result Analysis
Table 1. The comparison of our runs with
official runs participated in TREC04 Genomics
Track. Runs in TREC are ranked by Mean Average
Precision (MAP)
16
Conclusions
  • Major Findings
  • Concept-based indexing would improve the
    retrieval performance in comparison with
    phrase-based indexing
  • The performance of our run (the concept-based
    indexing basic language model) is much better
    than the mean MAP of all participated groups in
    TREC 2004, and is comparable to the best run.

17
Future Work
  • Add sub-concepts to the index
  • Breast Cancer Breast, Cancer
  • Explore the concept-based indexing with extra IR
    techniques
  • document model expansion
  • Blind feedback
  • Compare word-based indexing with concept-based
    indexing and phrase-based indexing
  • Incorporate relationships into the retrieval
    model
  • Two related further work
  • Relation-based Document Retrieval for Biomedical
    Literature Databases, DASFAA 2006, 689-701
  • Context-Sensitive Semantic Smoothing for the
    Language Modeling Approach to Genomic IR, SIGIR
    2006

18
Future Work
  • DASFAA Paper
  • Use relationships to index and search biomedical
    literatures
  • SIGIR Paper
  • Use concepts and sub-concepts
  • Context-sensitive semantic smoothing
  • Translate concept pairs to single concepts

19
Questions and Comments
  • ?
Write a Comment
User Comments (0)
About PowerShow.com