Title: Vipul Kashyap
1 TaxaMiner An Experimental Framework
for Automated Taxonomy Bootstrapping
- Vipul Kashyap
- National Library of Medicine
- kashyap_at_nlm.nih.gov
- http//cgsb2.nlm.nih.gov/kashyap
- May 27, 2004
2 Outline
- Motivation
- TaxaMiner Approach
- TaxaMiner Methodology
- Taxonomy Evaluation
- Experimental Setup
- Initial Results
- Discussion of Problems and Approaches
- LSI and Term Neighborhood Expansion
- Conclusions and Future Work
3Motivation
- Vast amounts of biomedical research literature
- Taxonomies/Thesauri A useful and popular form of
knowledge organization - MeSH, Gene Ontology
- Yahoo! Taxonomy
- Largely manual efforts used to create
taxonomies/thesauri - Huge investments of time and resources
- Not scalable
- Semi-automatic Taxonomy Generation
- Bootstrap rough taxonomies/thesauri
- Human involvement to create and refine taxonomies
- TaxaMiner Project
4TaxaMiner Approach
Data Extractionand Sampling
Pre-process data using NLP techniques
Document Indexing
TaxonomyEvaluation
DocumentClustering
Label Generationand Smoothing
TaxonomyExtraction
5 TaxaMiner Methodology
- NLP Techniques for Pre-processing
- Phrase X Parser
- Document Indexing
- SMART, LSI
- Document Clustering
- Bisecting K-Means
- Taxonomy Extraction
- Label Generation and Smoothing
- Term Neighborhood Expansion
6TaxaMiner MethodologyNLP-based pre-processing
of documents
- Kupffer cells from halothane-exposed guinea pigs
carry trifluoracetylated protein adducts - Simple Noun Phrases
- Kupffer cells
- Halothane exposed guinea pigs
- Trifluoroacetylated protein adducts
- Macro Noun Phrase
- Kupffer cells from halothane exposed guinea pigs
- Mega Noun Phrases
- Kupffer cells from halothane exposed guinea pigs
- Trifluoroacetylated protein adducts
7TaxaMiner MethodologyDocument Indexing
- SMART Indexing tool (from Cornell University)
- Document vectors consist of word-based features
- TF IDF weights
- Log entropy weighting function
- Latent Semantic Indexing
- SVD analysis of term/document matrix
- Underlying Latent Dimensions
- Document vectors consist of weighting of these
latent features
8TaxaMiner MethodologyDocument Clustering
Document Cluster Hierarchy
Bisecting K-Means Clustering
D
H
9TaxaMiner MethodologyBisecting K Means
- Given a set of document vectors D d1, , dM
- Iterative splitting of a chosen cluster at each
stage - Initial cluster is the whole document set.
- Till a terminating condition is reached
- Compute
- Centroid of a cluster
- Cohesiveness of a particular cluster
- Quality of a partition
- Cluster choice criterion choose cluster with
least cohesiveness - Termination condition Decrease in partition
quality
10TaxaMiner MethodologyTaxonomy Extraction
Taxonomy Extraction
Document Cluster Hierarchy
D
D
H
?1
?2
11TaxaMiner MethodologyTaxonomy Extraction
- Concept differentiation in a taxonomy is captured
by the difference in cluster cohesiveness - Observation
- Succesive values of cohesiveness down a cluster
hierarchy are montonically increasing - Input
- A set of thresholds ?1 ? ? ?N
- Output
- A taxonomy T that corresponds to the taxonomy
creators notions of differentiation.
12TaxaMiner MethodologyTaxonomy Extraction
Taxonomy Extraction and Labeling
Document Cluster Hierarchy
D
L1, L2,
D
H
L7, L8,
?1
L3, L4,
L5, L6,
L9, L10,
L10, L11,
?2
L12, L13, L14,
13TaxaMiner MethodologyLabel Assignment
- Input Centroid vector of the taxonomy node
- SMART-based indexing
- Choose the top K weights in the centroid vector
- Labels of the node words corresponding to those
weights - LSI-based indexing
- Compute cosine between the term vectors and the
centroid vector - Choose the terms corresponding to top K cosine
values
14TaxaMiner MethodologyLabel Refinement
A, B, D
?
A, B, D
D, F, G,
C, D, E,
F, G,
C, E,
D, H, I,
D, H, I,
Propagate label D to Parent
Keep label D at Child
15Taxonomy Evaluation
- Taxonomy Content Quality
- Precision-based measure (CQM-P)
- of all the labels generated that are present in
the Gold Standard Taxonomy - Recall-based measure (CQM-R)
- of all the labels in the Gold Standard Taxonomy
that were generated - Taxonomy Structural Quality
- Precision-based measure (SQM-P)
- of all the parent-child relationships that are
reflected consistently in the Gold Standard
Taxonomy - Recall-based measure (SQM-R)
- of all the parent-child relationships in the
Gold Standard taxonomy that are reflected
consistently - Consistency, if a parent-child relationship
appears - As a parent-child, OR
- As a ancestor-descendant in the gold standard
taxonomy - Bake-off in an application context
- Use of Gold Standard and Generated Taxonomies to
create search expressions
16Taxonomy Evaluation Content
- Let
- CQM-P can be defined as
-
- CQM-R can be defined as
17Taxonomy Evaluation Structure
- Let
- pcLinks(T) lta,bgt a is parent of b in T
- adLinks(T) lta,bgt a is ancestor of b in T
- adLinks(T) ? pcLinks(T)
- SQM-P can be defined as
-
- SQM-R can be defined as
18Experimental Setup
- MeSH Subtree under Neoplasms chosen as the Gold
Standard - Tree number C04, 649 Concepts
- Identify MEDLINE citations that have as
annotations concepts from C04 - Extract the raw text from the MEDLINE citations.
These form the documents in the setup - Apply the TaxaMiner methodology to generate a
taxonomy from these documents.
19Initial ResultsIs there an optimal data set
size?
20Initial ResultsDoes recall improve?
21Initial Results Does increase in data set size
inhibit learning of taxonomic structure?
22Initial Results Does increase in data set size
inhibit learning of taxonomic structure?
23Initial ResultsDoes NLP improve taxonomy
quality?
24Initial ResultsDoes NLP improve taxonomy
quality?
25Gold Standard Taxonomy
Gold Standard Taxonomy
26Generated Taxonomy
Generated Taxonomy
27Initial ResultsImproving Taxonomy Quality
- Problems
- Poor Structural Quality
- Poor Precision (lt 20)
- Performed OK on the Recall (achieved upto 58 in
some cases) - Approaches
- Involve the user in determining the ? values
- Improve label refinement algorithms
- Use of LSI and term neighborhood expansion (TNE)
to identify representative set of labels at
each node.
28Term Neighborhood Expansion (TNE)
- Given
- A taxonomy node N,
- lexicon L of terms created from the underlying
document corpus - t ? labels(N), be the term vector
corresponding to t - neighborhood(t) ltw, ?tgt w ? L, ?t
- central-terms(N)
- ltw, ?gt t ? labels(N), ltw, ?tgt ?
neighborhood(t), ? - core-terms(N) top K central terms
- w ltw, ?gt ? central-terms(N), ? is
among the top K values
29Comparison of SMARTNLP with LSI without NLP
TNE
30Comparison of SMARTNLP with LSI without NLP
TNE
31Comparison of SMARTNLP with LSI without NLP
TNE
32(No Transcript)
33Experimental Framework
- Sampling
- Uniform sampling v/s Density biased sampling
- Natural Language Processing
- Noun Phrases (i) Simple, (ii) Macro, (iii) Mega
- Verb Phrases
- Indexing
- Term-based dimensions Word-based vs. Phrase
based - SVD eigenvector-based dimensions
- Clustering
- Document v/s Term based clustering
- Bisecting K-means v/s Principle Direction
Divisive Partitioning - Distance Measures
- Euclidean v/s Cosine
- Cluster Quality Measures
- Internal Measures Pair wise distance, v/s
Distance from Centroid - External Measures
- K-Means Number of Iterations
- Label assignment
- Threshold (Value of Top K)
34 Conclusions
- Critical need for automating the creation of
domain specific taxonomies/thesauri - TaxaMiner methodology and approach shows promise
and encouraging results - Initial results showed reasonable recall
- Use of LSI and Term Neighborhood expansion
improves the quality of taxonomy generated. - Definite reduction in cost and time
- Subject Matter Experts dont have to start from
scratch by reading all the documents
35 Future Work
- Perform more extensive experimentation
- Investigate the various dimensions of the
experimental framework - Investigate the use of LSI and TNE v/s NLP
- The role of co-occurrence patterns?
- Explore and design internal taxonomy quality
metrics based on - coverage, discrimination, consistency, etc.
- Enhance techniques for Ontology Learning
- Use of lexico-syntactic patterns
- Use of general thesauri, e.g., WordNet
- Investigate spin-offs
- document classification, semantic annotations
36 Acknowledgements
- Collaborators
- University of Georgia Cartic Ramakrishnan,
Christopher Thomas - NLM/LHC Tom Rindflesch
- Telcordia Technologies Devasis Bassu
- LSI data and related discussions
- NLM/NCBI John Wilbur
- Penn State University Hui Han, Hongyuan Zha
- Lister Hill Center
- Tom Rindflesch, Olivier Bodenreider, Anantha
Bangalore, Mehmet Kayaalp, Samir Antani