Title: Freetext Medical Document Retrieval via Phrasebased Vector Space Model
1Free-text Medical Document Retrieval via
Phrase-based Vector Space Model
- Wenlei Mao, MS and Wesley W. Chu, PhD
- wenlei_at_cs.ucla.edu and wwc_at_cs.ucla.edu
- Computer Science Department
- University of California, Los Angeles
2Outline
- Vector space model (VSM) in document retrieval
- Stem-based VSM
- Concept-based VSM
- Conceptual similarity
- Phrase-based VSM
- Retrieval effectiveness comparison
- Conclusion
3Document Retrieval
- Find free-text documents to answer queries like,
- Hyperthermia, leukocytosis, increased
intracranial pressure, and central
herniation.Cerebral edema secondary to
infection, diagnosis and treatment.
4Vector Space Model (VSM)
5Stem-based VSM
- Morphological variants bear similar content
- E.g., edema and edemas
- Use stemmer to extract stems
- Lovins stemmer and Porter stemmer
6Shortcomings of Stem-based VSM
- Inability to capture multi-word concepts
- Increased intracranial pressure
- Inability to utilize the relations between
concepts - Synonyms hyperthermia and fever
- IS-A relation hyperthermia and body
temperature elevation
7Concept-based VSM
- Uses concepts in knowledge base (KB) as terms
- KB Metathesaurus in UMLS
- Captures multi-word concepts
- Captures synonyms
8Shortcomings of Concept-based VSM
- Concepts may be related
- E.g. hyperthermia and body temperature
elevation are not identical but related concepts - Need to quantify conceptual relations
- Knowledge bases are often incomplete, which
reduces the retrieval effectiveness
9Conceptual Similarity Evaluation
10Deriving Conceptual Similarity From Hypernym
Hierarchy
11Shortcomings of Concept-based VSM
- Concepts may be related
- The conceptual similarity measure, s(ci,cj),
quantifies relations between concepts. - Knowledge bases are often incomplete, which
reduces the retrieval effectiveness.
12Incompleteness of the Knowledge Bases
- In general, concept-based VSM cannot outperform
stem-based VSM
13Phrase-based Indexing Examples
14Evaluate Phrase-based Document Similarity
15To Compare Retrieval Effectiveness
- The test set OHSUMED
- 106 queries, 14K documents
- Expert relevance judgment R or N
- Retrieval effectiveness
- Recall the percentage of relevant documents
retrieved so far - Precision the percentage of retrieved documents
that are relevant
16Retrieval Effectiveness Comparison (Corpus
OHSUMED, KB UMLS)
16100 queries vs. 5 50 queries
17Stem and Concept Similarity Contribution Weights
18Sensitivity of Retrieval Effectiveness to fs and
fc
19Computation Complexity Using Phrase-based VSM
- Data reorganization
- Build separate indexes on stems and concepts
- Keep a list of related concepts cjs and
conceptual similarity s(ci,cj) with ci. - Time complexities of document similarity
calculation, same order of magnitude - Stem-based VSM
- Phrase-based VSM
20Conclusion
- A new document indexing paradigm based on phrases
is proposed - Use phrases (concept and its word stems) as terms
- Document similarity is derived from both the stem
and the concept contributions - Conceptual similarity quantifies the concept
relations and improves retrieval effectiveness - Stems remedy the incomplete coverage of the
knowledge base (missing concepts and missing
links between related concepts) - Experimental results reveal a significant
retrieval effectiveness improvement of the
phrase-based VSM over the stem-based VSM
21Acknowledgement
This research is supported in part by NIC/NIH
Grant4442511-33780
22Model Comparison