Vipul Kashyap - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Vipul Kashyap

Description:

Labels of the node = words corresponding to those weights. LSI-based indexing ... Subject Matter Experts don't have to start from scratch by reading all the documents ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 37

Provided by: vipulk

Category:

more less

Transcript and Presenter's Notes

Title: Vipul Kashyap

1
TaxaMiner An Experimental Framework
for Automated Taxonomy Bootstrapping

Vipul Kashyap
National Library of Medicine
kashyap_at_nlm.nih.gov
http//cgsb2.nlm.nih.gov/kashyap
May 27, 2004

2
Outline

Motivation
TaxaMiner Approach
TaxaMiner Methodology
Taxonomy Evaluation
Experimental Setup
Initial Results
Discussion of Problems and Approaches
LSI and Term Neighborhood Expansion
Conclusions and Future Work

3
Motivation

Vast amounts of biomedical research literature
Taxonomies/Thesauri A useful and popular form of
knowledge organization
MeSH, Gene Ontology
Yahoo! Taxonomy
Largely manual efforts used to create
taxonomies/thesauri
Huge investments of time and resources
Not scalable
Semi-automatic Taxonomy Generation
Bootstrap rough taxonomies/thesauri
Human involvement to create and refine taxonomies
TaxaMiner Project

4
TaxaMiner Approach
Data Extractionand Sampling
Pre-process data using NLP techniques
Document Indexing
TaxonomyEvaluation
DocumentClustering
Label Generationand Smoothing
TaxonomyExtraction
5
TaxaMiner Methodology

NLP Techniques for Pre-processing
Phrase X Parser
Document Indexing
SMART, LSI
Document Clustering
Bisecting K-Means
Taxonomy Extraction
Label Generation and Smoothing
Term Neighborhood Expansion

6
TaxaMiner MethodologyNLP-based pre-processing
of documents

Kupffer cells from halothane-exposed guinea pigs
carry trifluoracetylated protein adducts
Simple Noun Phrases
Kupffer cells
Halothane exposed guinea pigs
Trifluoroacetylated protein adducts
Macro Noun Phrase
Kupffer cells from halothane exposed guinea pigs
Mega Noun Phrases
Kupffer cells from halothane exposed guinea pigs
Trifluoroacetylated protein adducts

7
TaxaMiner MethodologyDocument Indexing

SMART Indexing tool (from Cornell University)
Document vectors consist of word-based features
TF IDF weights
Log entropy weighting function
Latent Semantic Indexing
SVD analysis of term/document matrix
Underlying Latent Dimensions
Document vectors consist of weighting of these
latent features

8
TaxaMiner MethodologyDocument Clustering
Document Cluster Hierarchy
Bisecting K-Means Clustering
D
H
9
TaxaMiner MethodologyBisecting K Means

Given a set of document vectors D d1, , dM
Iterative splitting of a chosen cluster at each
stage
Initial cluster is the whole document set.
Till a terminating condition is reached
Compute
Centroid of a cluster
Cohesiveness of a particular cluster
Quality of a partition
Cluster choice criterion choose cluster with
least cohesiveness
Termination condition Decrease in partition
quality

10
TaxaMiner MethodologyTaxonomy Extraction
Taxonomy Extraction
Document Cluster Hierarchy
D
D
H
?1
?2
11
TaxaMiner MethodologyTaxonomy Extraction

Concept differentiation in a taxonomy is captured
by the difference in cluster cohesiveness
Observation
Succesive values of cohesiveness down a cluster
hierarchy are montonically increasing
Input
A set of thresholds ?1 ? ? ?N
Output
A taxonomy T that corresponds to the taxonomy
creators notions of differentiation.

12
TaxaMiner MethodologyTaxonomy Extraction
Taxonomy Extraction and Labeling
Document Cluster Hierarchy
D
L1, L2,
D
H
L7, L8,
?1
L3, L4,
L5, L6,
L9, L10,
L10, L11,
?2
L12, L13, L14,
13
TaxaMiner MethodologyLabel Assignment

Input Centroid vector of the taxonomy node
SMART-based indexing
Choose the top K weights in the centroid vector
Labels of the node words corresponding to those
weights
LSI-based indexing
Compute cosine between the term vectors and the
centroid vector
Choose the terms corresponding to top K cosine
values

14
TaxaMiner MethodologyLabel Refinement
A, B, D
?
A, B, D
D, F, G,
C, D, E,
F, G,
C, E,
D, H, I,
D, H, I,
Propagate label D to Parent
Keep label D at Child
15
Taxonomy Evaluation

Taxonomy Content Quality
Precision-based measure (CQM-P)
of all the labels generated that are present in
the Gold Standard Taxonomy
Recall-based measure (CQM-R)
of all the labels in the Gold Standard Taxonomy
that were generated
Taxonomy Structural Quality
Precision-based measure (SQM-P)
of all the parent-child relationships that are
reflected consistently in the Gold Standard
Taxonomy
Recall-based measure (SQM-R)
of all the parent-child relationships in the
Gold Standard taxonomy that are reflected
consistently
Consistency, if a parent-child relationship
appears
As a parent-child, OR
As a ancestor-descendant in the gold standard
taxonomy
Bake-off in an application context
Use of Gold Standard and Generated Taxonomies to
create search expressions

16
Taxonomy Evaluation Content

Let
CQM-P can be defined as
CQM-R can be defined as

17
Taxonomy Evaluation Structure

Let
pcLinks(T) lta,bgt a is parent of b in T
adLinks(T) lta,bgt a is ancestor of b in T
adLinks(T) ? pcLinks(T)
SQM-P can be defined as
SQM-R can be defined as

18
Experimental Setup

MeSH Subtree under Neoplasms chosen as the Gold
Standard
Tree number C04, 649 Concepts
Identify MEDLINE citations that have as
annotations concepts from C04
Extract the raw text from the MEDLINE citations.
These form the documents in the setup
Apply the TaxaMiner methodology to generate a
taxonomy from these documents.

19
Initial ResultsIs there an optimal data set
size?
20
Initial ResultsDoes recall improve?
21
Initial Results Does increase in data set size
inhibit learning of taxonomic structure?
22
Initial Results Does increase in data set size
inhibit learning of taxonomic structure?
23
Initial ResultsDoes NLP improve taxonomy
quality?
24
Initial ResultsDoes NLP improve taxonomy
quality?
25
Gold Standard Taxonomy
Gold Standard Taxonomy
26
Generated Taxonomy
Generated Taxonomy
27
Initial ResultsImproving Taxonomy Quality

Problems
Poor Structural Quality
Poor Precision (lt 20)
Performed OK on the Recall (achieved upto 58 in
some cases)
Approaches
Involve the user in determining the ? values
Improve label refinement algorithms
Use of LSI and term neighborhood expansion (TNE)
to identify representative set of labels at
each node.

28
Term Neighborhood Expansion (TNE)

Given
A taxonomy node N,
lexicon L of terms created from the underlying
document corpus
t ? labels(N), be the term vector
corresponding to t
neighborhood(t) ltw, ?tgt w ? L, ?t
central-terms(N)
ltw, ?gt t ? labels(N), ltw, ?tgt ?
neighborhood(t), ?
core-terms(N) top K central terms
w ltw, ?gt ? central-terms(N), ? is
among the top K values

29
Comparison of SMARTNLP with LSI without NLP
TNE
30
Comparison of SMARTNLP with LSI without NLP
TNE
31
Comparison of SMARTNLP with LSI without NLP
TNE
32
(No Transcript)
33
Experimental Framework

Sampling
Uniform sampling v/s Density biased sampling
Natural Language Processing
Noun Phrases (i) Simple, (ii) Macro, (iii) Mega
Verb Phrases
Indexing
Term-based dimensions Word-based vs. Phrase
based
SVD eigenvector-based dimensions
Clustering
Document v/s Term based clustering
Bisecting K-means v/s Principle Direction
Divisive Partitioning
Distance Measures
Euclidean v/s Cosine
Cluster Quality Measures
Internal Measures Pair wise distance, v/s
Distance from Centroid
External Measures
K-Means Number of Iterations
Label assignment
Threshold (Value of Top K)

34
Conclusions

Critical need for automating the creation of
domain specific taxonomies/thesauri
TaxaMiner methodology and approach shows promise
and encouraging results
Initial results showed reasonable recall
Use of LSI and Term Neighborhood expansion
improves the quality of taxonomy generated.
Definite reduction in cost and time
Subject Matter Experts dont have to start from
scratch by reading all the documents

35
Future Work

Perform more extensive experimentation
Investigate the various dimensions of the
experimental framework
Investigate the use of LSI and TNE v/s NLP
The role of co-occurrence patterns?
Explore and design internal taxonomy quality
metrics based on
coverage, discrimination, consistency, etc.
Enhance techniques for Ontology Learning
Use of lexico-syntactic patterns
Use of general thesauri, e.g., WordNet
Investigate spin-offs
document classification, semantic annotations

36
Acknowledgements

Collaborators
University of Georgia Cartic Ramakrishnan,
Christopher Thomas
NLM/LHC Tom Rindflesch
Telcordia Technologies Devasis Bassu
LSI data and related discussions
NLM/NCBI John Wilbur
Penn State University Hui Han, Hongyuan Zha
Lister Hill Center
Tom Rindflesch, Olivier Bodenreider, Anantha
Bangalore, Mehmet Kayaalp, Samir Antani