Title: The Hows and Whys of Semantic Distance
1 The Hows and Whys of Semantic Distance
- Vipul Kashyap, National Library of Medicine, NIH
- kashyap_at_nlm.nih.gov
- NIST Semantic Distance Workshop, Gaithersburg, MD
- November 11, 2003
2Outline
- Why Semantic Distance ?
- Some applications of Semantic Distance
- The Semantic Conveyance problem
- Translations across multiple ontologies
- Role of Semantic Distance
- One approach of measuring Semantic Distance
- The Hows of Semantic Distance
- Types of Semantic Distance measures
3Why do we need Semantic Distance (in Healthcare)?
- Healthcare Information is characterized by
multiple terminologies, - E.g., MeSH, CPT, LOINC, SnoMed, etc.
- Interoperability across terminologies is crucial
to healthcare information system interoperability - Which terminology do I interoperate with?
- What criteria/measure do we use?
- Application dependent v/s application specific
- Should the measure be machine understandable?
- Should the measure be human understandable?
4Applications of Semantic Distance
- Information Retrieval
- To improve automated assignment of indexing based
descriptors - Semantic Vocabulary Integration
- To choose the closest related concepts while
translating in and out of the multiple
vocabularies - Look at the Semantic Vocabulary Interoperability
Project at - http//cgsb2.nlm.nih.gov/kashyap/projects/SVIP
5Semantic Conveyance
XXX Book
Sender
Receiver
- The Sender has his own ontology (The Red
Ontology) - The Receiver has his own (The Blue Ontology)
- For communication to take place,
- The receiver should translate the message
(content) from the Red Ontology to the Blue
Ontology - Questions
- Is it always possible?
- How many candidate possibilities are there?
- How do we choose from them, Semantic Distance?
6Terminology 1 The Red Terminology
Instructions
Reference-Manual
http//www.cogsci.princeton.edu/wn/w3wn.html
7 Terminology 2 The Blue Terminology
Conference
Agent
Person
Organization
Author
Publisher
University
Thesis
Periodical-Publication
http//www-ksl.stanford.edu/knowledge-sharing/onto
logies/html/bibliographic-data/
8 Inter-terminological relationships
Typically represented in the UMLS Metathesaurus
- Synonyms
- semantics preserving
- Hyponyms/Hypernyms
- semantics altering
- typically results in loss of information
- List of Hyponyms
- technical-manual hyponym manual
- book hyponym book
- proceedings hyponym book
- thesis hyponym book
- misc-publication hyponym book
- technical-reports hyponym book
- press hyponym periodical-publicatio
n - periodical hyponym periodical-publicatio
n
9 Translations across multiple
terminologies
union(Book, Proceedings, ..., Misc-Publication),
document
Technical-Manual
GuideBook
10Semantic Conveyance
XXX Book
Sender
Receiver
XXX Document XXX union(.)
- XXX Document
- XXX union(.)
- XXX .
- How do we chose between the various alternatives
- Semantic Distance to the rescue!!
11Proposal for Semantic Distance Extensional
Measure
Loss in Precision
Loss in Recall
Ext(Term)
Ext(Translation)
Precision Ext(Term) ? Ext(Translation)
Ext(Translation)
Recall Ext(Term) ? Ext(Translation)
Ext(Term)
Percentage Loss Ext(Term) ?
Ext(Translation)
Ext(Term) Ext(Translation)
12Choosing an optimal translationLocal v/s Global
Decision Making
Publication
Document
LOSS(Document, Book)
Document
Document
Journal
Publication
Book
Journal
Book
Journal
LOSS(Publication, Journal)
LOSS(Journal, Book)
LOSS(Document, Publication)
- Local Decision Making
- LOSS(Publication, Journal) LOSS(Document,
Publication) - Document is chosen as the translation
- But LOSS(Book, Document) LOSS(Book, Journal) !!
- Global Decision Making
- Both translations Document, Journal are passed
on to the next level - Journal is chosen as the appropriate translation
13Using Subsumption for tighter bounds on Semantic
Distance
- Term subsumes Translation
- Ext(Translation) ? Ext(Term) ? Ext(Term) ?
Ext(Translation) Ext(Translation) - Precision 1,
- Recall Ext(Translation)
- Ext(Term)
- Should be able incorporate other
application-specific measures to adapt distance
measures - Same terminological translation might be have
different semantic distances based on application
specific adaptations
14Proposal for Semantic DistanceIntensional
Measure
- Difference in Translation
- Book ? union(Book, Thesis, Proceedings,
Technical-Manual, Misc-Publication) - Terminological Difference
- Book ? (AND Publication (ATLEAST 1 ISBN))
- Publication ? (AND document (ATLEAST 1
PLACE-OF-PUBLICATION)) - Book ? (AND document (ATLEAST 1 ISBN) (ATLEAST 1
PLACE-OF-PUBLICATION)) - Loss of Information
- (-) union(Trade-Book, Brochure, SongBook,
PrayerBook, TextBook) - information related to trade books, brochures,
song books, prayer books and text books is lost - () (AND (ATLEAST 1 ISBN) (ATLEAST 1
PLACE-OF-PUBLICATION)) - spurious documents that dont have an ISBN number
and a place of publication are gained
15Measures for Semantic Distance Pros and Cons
- Intensional Measure
- May not make sense as it mixes two vocabularies,
- e.g., does Book - Book make any sense ?
- The problem becomes worse if the two
terminologies are in different languages - Makes it hard for the system to differentiate
between the various alternatives - Extensional Measure
- Based on Standard Information Retrieval Measures
(F-measure) - Can be tailored to reflect change in semantic
distance for different applications - However
- Probability distributions of various terms need
to be estimated - An information loss interval doesnt make much
sense to the user.
16How do we measure Semantic Distance?
- Extensional Approaches
- Intensional Approaches (based on concept
definitions) - Combination of the above approaches
- Combining the above using some weightage schemes
- Applying semantics of subsumption to the semantic
distance measures - Applying other constraints specified in the
semantic network
17Types of Semantic Distance Metrics Intensional
- Numerical
- Based on features (e.g., Tverskys measure)
- Based on traversal of specific conceptual
relationships (is-a, part-of) and arbitrary
domain specific relationships - Non-numerical
- Based on semantic concept differences, e.g. a
book without a publication date - Important for human understandability
18Proposal for Semantic DistanceTverskys measure
from Psycho-semantics
- S(a, b) A n B
- A n B a(a, b) A B (1 - a(a, b))B
A - S(a, b) is the similarity between two arbitrary
objects, a,b - A and B are feature sets of a, b respectively
- a is a real no. ? ? ? a ? 1
19Types of Semantic Distance Metrics Extensional
- Numerical Based on estimation of underlying
concept intensions - Computation of joint and conditional probability
distributions - Computation of concept co-occurrences in
documents - Computation of cosine measures in a vector space
mode
20A Classification of Numerical (Intensional
Extensional Approaches)
- Traversal of graph-based information models
- Traversal of Hierarchical Relationships
- Intensional
- Feature contrast based approaches (e.g., Tversky)
- Intensional
- Probabilistic approaches (e.g., Precision,
Recall, F-measure) - Based on estimation of extensions/distributions
of concepts - Some combination of the above?
21Counter-Example Hierarchical approaches
C4
C2
C3
C1
- Hiearchical approach
- semantic-distance(C1, C2) C4)
- However, if we look at concept extensions, it
might be the case that - only 10 of C2 is C1 and
- 90 of C3 is C1 and 20 of C4 is C2
- This implies, 18 of C4 is C1
- Thus, semantic-distance(C1, C2)
semantic-distance(C1, C4)
22Counter Example Probabilistic approaches
C2
C3
C1
- Probabilistic approach
- semantic-distance(C1, C2) and semantic-distance(C2
, C3) are given - Can we compute semantic-distance(C1, C3) ?
- It is quite possible that
- semantic-distance(C1, C3) C2) and
- semantic-distance(C1, C3) C3)
- Feature contrast based approach
- If the features of the concepts are known,
Tverskys measures may be used - However, the Semantic Proximity algorithm has to
identify C3 as a concept that may have semantic
resemblance with C1
23Research areas dealing with Semantic Distances
- Knowledge Representation
- primarily non-numerical, intensional
- Statistical Clustering
- primarily numerical, vector space model
- Data Mining
- primarily numerical, probabilistic
- Machine Learning
- primarily numerical
- Information Retrieval
- primarily numerical, probabilistic
- Medical Informatics
- primarily numerical, intensional
- Natural Language Processing
- ??
- . any other field
24Conclusions
- Semantic Distance measures need to be application
specific - Text Retrieval
- (Structured) Data Retrieval
- Domain and Context Specific
- Semantic Distance measures should be both human
and machine processable - They should be based on standard measures as far
as possible - E.g., F-measure from Information Retrieval
- There is a need for estimation of various
distributions of medical concepts in a given
population - E.g. May need to mine CDC databases