Vipul Kashyap - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Vipul Kashyap

Description:

Labels of the node = words corresponding to those weights. LSI-based indexing ... Subject Matter Experts don't have to start from scratch by reading all the documents ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 37
Provided by: vipulk
Category:
Tags: kashyap | start | that | vipul | with | words

less

Transcript and Presenter's Notes

Title: Vipul Kashyap


1
TaxaMiner An Experimental Framework
for Automated Taxonomy Bootstrapping
  • Vipul Kashyap
  • National Library of Medicine
  • kashyap_at_nlm.nih.gov
  • http//cgsb2.nlm.nih.gov/kashyap
  • May 27, 2004

2
Outline
  • Motivation
  • TaxaMiner Approach
  • TaxaMiner Methodology
  • Taxonomy Evaluation
  • Experimental Setup
  • Initial Results
  • Discussion of Problems and Approaches
  • LSI and Term Neighborhood Expansion
  • Conclusions and Future Work

3
Motivation
  • Vast amounts of biomedical research literature
  • Taxonomies/Thesauri A useful and popular form of
    knowledge organization
  • MeSH, Gene Ontology
  • Yahoo! Taxonomy
  • Largely manual efforts used to create
    taxonomies/thesauri
  • Huge investments of time and resources
  • Not scalable
  • Semi-automatic Taxonomy Generation
  • Bootstrap rough taxonomies/thesauri
  • Human involvement to create and refine taxonomies
  • TaxaMiner Project

4
TaxaMiner Approach
Data Extractionand Sampling
Pre-process data using NLP techniques
Document Indexing
TaxonomyEvaluation
DocumentClustering
Label Generationand Smoothing
TaxonomyExtraction
5
TaxaMiner Methodology
  • NLP Techniques for Pre-processing
  • Phrase X Parser
  • Document Indexing
  • SMART, LSI
  • Document Clustering
  • Bisecting K-Means
  • Taxonomy Extraction
  • Label Generation and Smoothing
  • Term Neighborhood Expansion

6
TaxaMiner MethodologyNLP-based pre-processing
of documents
  • Kupffer cells from halothane-exposed guinea pigs
    carry trifluoracetylated protein adducts
  • Simple Noun Phrases
  • Kupffer cells
  • Halothane exposed guinea pigs
  • Trifluoroacetylated protein adducts
  • Macro Noun Phrase
  • Kupffer cells from halothane exposed guinea pigs
  • Mega Noun Phrases
  • Kupffer cells from halothane exposed guinea pigs
  • Trifluoroacetylated protein adducts

7
TaxaMiner MethodologyDocument Indexing
  • SMART Indexing tool (from Cornell University)
  • Document vectors consist of word-based features
  • TF IDF weights
  • Log entropy weighting function
  • Latent Semantic Indexing
  • SVD analysis of term/document matrix
  • Underlying Latent Dimensions
  • Document vectors consist of weighting of these
    latent features

8
TaxaMiner MethodologyDocument Clustering
Document Cluster Hierarchy
Bisecting K-Means Clustering
D
H
9
TaxaMiner MethodologyBisecting K Means
  • Given a set of document vectors D d1, , dM
  • Iterative splitting of a chosen cluster at each
    stage
  • Initial cluster is the whole document set.
  • Till a terminating condition is reached
  • Compute
  • Centroid of a cluster
  • Cohesiveness of a particular cluster
  • Quality of a partition
  • Cluster choice criterion choose cluster with
    least cohesiveness
  • Termination condition Decrease in partition
    quality

10
TaxaMiner MethodologyTaxonomy Extraction
Taxonomy Extraction
Document Cluster Hierarchy
D
D
H
?1
?2
11
TaxaMiner MethodologyTaxonomy Extraction
  • Concept differentiation in a taxonomy is captured
    by the difference in cluster cohesiveness
  • Observation
  • Succesive values of cohesiveness down a cluster
    hierarchy are montonically increasing
  • Input
  • A set of thresholds ?1 ? ? ?N
  • Output
  • A taxonomy T that corresponds to the taxonomy
    creators notions of differentiation.

12
TaxaMiner MethodologyTaxonomy Extraction
Taxonomy Extraction and Labeling
Document Cluster Hierarchy
D
L1, L2,
D
H
L7, L8,
?1
L3, L4,
L5, L6,
L9, L10,
L10, L11,
?2
L12, L13, L14,
13
TaxaMiner MethodologyLabel Assignment
  • Input Centroid vector of the taxonomy node
  • SMART-based indexing
  • Choose the top K weights in the centroid vector
  • Labels of the node words corresponding to those
    weights
  • LSI-based indexing
  • Compute cosine between the term vectors and the
    centroid vector
  • Choose the terms corresponding to top K cosine
    values

14
TaxaMiner MethodologyLabel Refinement
A, B, D
?
A, B, D
D, F, G,
C, D, E,
F, G,
C, E,
D, H, I,
D, H, I,
Propagate label D to Parent
Keep label D at Child
15
Taxonomy Evaluation
  • Taxonomy Content Quality
  • Precision-based measure (CQM-P)
  • of all the labels generated that are present in
    the Gold Standard Taxonomy
  • Recall-based measure (CQM-R)
  • of all the labels in the Gold Standard Taxonomy
    that were generated
  • Taxonomy Structural Quality
  • Precision-based measure (SQM-P)
  • of all the parent-child relationships that are
    reflected consistently in the Gold Standard
    Taxonomy
  • Recall-based measure (SQM-R)
  • of all the parent-child relationships in the
    Gold Standard taxonomy that are reflected
    consistently
  • Consistency, if a parent-child relationship
    appears
  • As a parent-child, OR
  • As a ancestor-descendant in the gold standard
    taxonomy
  • Bake-off in an application context
  • Use of Gold Standard and Generated Taxonomies to
    create search expressions

16
Taxonomy Evaluation Content
  • Let
  • CQM-P can be defined as
  • CQM-R can be defined as

17
Taxonomy Evaluation Structure
  • Let
  • pcLinks(T) lta,bgt a is parent of b in T
  • adLinks(T) lta,bgt a is ancestor of b in T
  • adLinks(T) ? pcLinks(T)
  • SQM-P can be defined as
  • SQM-R can be defined as

18
Experimental Setup
  • MeSH Subtree under Neoplasms chosen as the Gold
    Standard
  • Tree number C04, 649 Concepts
  • Identify MEDLINE citations that have as
    annotations concepts from C04
  • Extract the raw text from the MEDLINE citations.
    These form the documents in the setup
  • Apply the TaxaMiner methodology to generate a
    taxonomy from these documents.

19
Initial ResultsIs there an optimal data set
size?
20
Initial ResultsDoes recall improve?
21
Initial Results Does increase in data set size
inhibit learning of taxonomic structure?
22
Initial Results Does increase in data set size
inhibit learning of taxonomic structure?
23
Initial ResultsDoes NLP improve taxonomy
quality?
24
Initial ResultsDoes NLP improve taxonomy
quality?
25
Gold Standard Taxonomy
Gold Standard Taxonomy
26
Generated Taxonomy
Generated Taxonomy
27
Initial ResultsImproving Taxonomy Quality
  • Problems
  • Poor Structural Quality
  • Poor Precision (lt 20)
  • Performed OK on the Recall (achieved upto 58 in
    some cases)
  • Approaches
  • Involve the user in determining the ? values
  • Improve label refinement algorithms
  • Use of LSI and term neighborhood expansion (TNE)
    to identify representative set of labels at
    each node.

28
Term Neighborhood Expansion (TNE)
  • Given
  • A taxonomy node N,
  • lexicon L of terms created from the underlying
    document corpus
  • t ? labels(N), be the term vector
    corresponding to t
  • neighborhood(t) ltw, ?tgt w ? L, ?t
  • central-terms(N)
  • ltw, ?gt t ? labels(N), ltw, ?tgt ?
    neighborhood(t), ?
  • core-terms(N) top K central terms
  • w ltw, ?gt ? central-terms(N), ? is
    among the top K values

29
Comparison of SMARTNLP with LSI without NLP
TNE
30
Comparison of SMARTNLP with LSI without NLP
TNE
31
Comparison of SMARTNLP with LSI without NLP
TNE
32
(No Transcript)
33
Experimental Framework
  • Sampling
  • Uniform sampling v/s Density biased sampling
  • Natural Language Processing
  • Noun Phrases (i) Simple, (ii) Macro, (iii) Mega
  • Verb Phrases
  • Indexing
  • Term-based dimensions Word-based vs. Phrase
    based
  • SVD eigenvector-based dimensions
  • Clustering
  • Document v/s Term based clustering
  • Bisecting K-means v/s Principle Direction
    Divisive Partitioning
  • Distance Measures
  • Euclidean v/s Cosine
  • Cluster Quality Measures
  • Internal Measures Pair wise distance, v/s
    Distance from Centroid
  • External Measures
  • K-Means Number of Iterations
  • Label assignment
  • Threshold (Value of Top K)

34
Conclusions
  • Critical need for automating the creation of
    domain specific taxonomies/thesauri
  • TaxaMiner methodology and approach shows promise
    and encouraging results
  • Initial results showed reasonable recall
  • Use of LSI and Term Neighborhood expansion
    improves the quality of taxonomy generated.
  • Definite reduction in cost and time
  • Subject Matter Experts dont have to start from
    scratch by reading all the documents

35
Future Work
  • Perform more extensive experimentation
  • Investigate the various dimensions of the
    experimental framework
  • Investigate the use of LSI and TNE v/s NLP
  • The role of co-occurrence patterns?
  • Explore and design internal taxonomy quality
    metrics based on
  • coverage, discrimination, consistency, etc.
  • Enhance techniques for Ontology Learning
  • Use of lexico-syntactic patterns
  • Use of general thesauri, e.g., WordNet
  • Investigate spin-offs
  • document classification, semantic annotations

36
Acknowledgements
  • Collaborators
  • University of Georgia Cartic Ramakrishnan,
    Christopher Thomas
  • NLM/LHC Tom Rindflesch
  • Telcordia Technologies Devasis Bassu
  • LSI data and related discussions
  • NLM/NCBI John Wilbur
  • Penn State University Hui Han, Hongyuan Zha
  • Lister Hill Center
  • Tom Rindflesch, Olivier Bodenreider, Anantha
    Bangalore, Mehmet Kayaalp, Samir Antani
Write a Comment
User Comments (0)
About PowerShow.com