Castanet: Using WordNet to Build Facet Hierarchies - PowerPoint PPT Presentation

About This Presentation
Title:

Castanet: Using WordNet to Build Facet Hierarchies

Description:

Hot and Sweet Chicken: 1 pepper, 2 apricots, 1 pound chicken breast, 1 Tbsp gingerroot ... Results on recipes collection for 'Would you use this system in your work? ... – PowerPoint PPT presentation

Number of Views:636
Avg rating:3.0/5.0
Slides: 50
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Castanet: Using WordNet to Build Facet Hierarchies


1
CastanetUsing WordNet to Build Facet
Hierarchies
  • Emilia Stoica and Marti HearstSchool of
    Information,
  • Berkeley

2
Focus Search and Navigation of Large
Collections
Shopping Sites
Digital Libraries
E-Government Sites
Image Collections
3
Problems with Site Search
  • Study by Vividence in 2001 on 69 Sites
  • 70 eCommerce
  • 31 Service
  • 21 Content
  • 2 Community
  • Poorly organized search results
  • Frustration and wasted time
  • Poor information architecture
  • Confusion
  • Dead ends
  • "back and forthing"
  • Forced to search

4
The Problem With Hierarchy
  • Most things can be classified in more than one
    way.
  • Most organizational systems do not handle this
    well.
  • Example Animal Classification

otter penguin robin salmon wolf cobra bat
Skin Covering
Locomotion
Diet
5
The Problem With Hierarchy
start
swim
fly
run
slither
fur
scales
feathers
fur
scales
feathers
fur
scales
feathers

fish
fish
fish
fish
fish
fish
fish
fish
fish
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
insects
insects
insects
insects
insects
insects
insects
insects
insects
salmon
bat
robin
wolf
6
The Idea of Facets
  • Facets are a way of labeling data
  • A kind of Metadata (data about data)
  • Can be thought of as properties of items
  • Facets vs. Categories
  • Items are placed INTO a category system
  • Multiple facet labels are ASSIGNED TO items

7
The Idea of Facets
  • Hot and Sweet Chicken 1 pepper, 2 apricots,
    1 pound chicken breast, 1 Tbsp gingerroot

Meat Chicken
8
Using Facets
  • Now there are multiple ways to get to each item

Preparation Method Fry Saute Boil
Bake Broil Freeze
Desserts Cakes Cookies Dairy
Ice Cream Sherbet Flan
Fruits Cherries Berries Blueberries
Strawberries Bananas Pineapple
Fruit gt Pineapple Dessert gt Cake Preparation gt
Bake
Dessert gt Dairy gt Sherbet Fruit gt Berries gt
Strawberries Preparation gt Freeze
9
Castanet
  • Semi-automatic algorithm for creating
    hierarchical faceted metadata
  • Carves out a structure from the hypernym
    (IS-A) relations within WordNet
  • Produces surprisingly good results for a wide
    range of subjects
  • e.g., arts, medicine, recipes, math, news,
    bibliographical records

10
WordNet Challenges
  • A word may have more than one sense
  • - Fine granularity of word sense distinctions
  • e.g., newspaper (1) - daily publication
    on
  • folded sheets
  • newspaper (3) - physical object
  • - Ambiguity for the same sense

11
WordNet Challenges (cont.)
  • The hypernym path may be quite long (e.g., sense
    3 of tuna has 14 nodes)
  • Sparse coverage of proper names and noun phrases
    (not addressed)

12
Algorithm Goals
  • Build a set of facet hierarchies
  • Balance depth and breadth
  • Avoid skinny paths
  • Dont go too deep or too broad
  • Choose understandable labels
  • Disambiguate words
  • Currently a word can take on only one sense

13
Our Approach
Documents
14
1. Select Terms
  • Select well-distributed terms from the collection
  • Eliminate stopwords
  • Retain only those terms with a distribution
    higher than a threshold
  • (default top 10)

Build core tree
Augm. core tree
Documents
Select terms
Comp. tree
Remove top level categ.
WordNet
15
2. Build Core Tree
  • Build a backbone
  • Create paths from unambiguous terms only
  • Bias the structure towards appropriate senses of
    words
  • Get hypernym path if term
  • - has only one sense, or
  • - matches a pre-selected
  • WordNet domain
  • Adding a new term increases a count at each node
    on its path by of docs with the term.

16
2. Build Core Tree (cont.)
  • Merge hypernym paths to build a tree

17
3. Augment Core Tree
  • Attach to Core tree the terms with more than one
    sense
  • Favor the more common path over other alternatives

18
Augment Core Tree (cont.)
19
Optional Step Domains
  • To disambiguate, use Domains
  • Wordnet has 212 Domains
  • medicine, mathematics, biology, chemistry,
    linguistics, soccer, etc.
  • A better collection has been developed by Magnini
    2000
  • Assigns a domain to every noun synset
  • Automatically scan the collection to see which
    domains apply
  • The user selects which of the suggested domains
    to use or may add own
  • Paths for terms that match the selected domains
    are added to the core tree

20
Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food gt concave
shape gt space
gt ingredient, fixings gt
depression gt angle
gt flavorer
Given domain food, choose
sense 3
21
4. Compress Tree
  • Rule 1
  • Eliminate a parent with fewer than k children
    unless it is the root or its distribution is
    larger than 0.1maxdist

dessert
frozen dessert
ice cream sundae
parfait
sherbet,sorbet
sundae
sherbet
22
4. Compress Tree (cont.)
  • Rule 2
  • Eliminate a child whose name appears within the
    parents name

dessert
frozen dessert
sundae
parfait
sherbet
23
5. Divide into Facets
24
5. Divide into Facets(Remove top levels)
Rule 1 Manually eliminate the top t levels (t 4
for recipe collection).
Rule 2 For each resulting tree, test if it has
more than n children (n 2) If yes,
stop. If not, delete the root and test again.
25
Example Recipes (3500 docs)
26
Castanet Output (shown in Flamenco)
27
Castanet Output
28
Castanet Output
29
Castanet Output
30
Castanet Output
31
(No Transcript)
32
Castanet Evaluation
  • This is a tool for information architects, so
    people of this type did the evaluation
  • We compared output on
  • Recipes
  • Biomedical journal titles
  • We compared to two state-of-the-art algorithms
  • LDA (Blei et al. 04)
  • Subsumption (Sanderson Croft 99)

33
Subsumption Output
34
Subsumption Output
35
Subsumption Output
36
Subsumption Output
37
LDA Output
38
LDA Output
39
LDA Output
40
Evaluation Method
  • Information architects assessed the category
    systems
  • For each of 2 systems output
  • Examined and commented on top-level
  • Examined and commented on two sub-levels
  • Then comment on overall properties
  • Meaningful?
  • Systematic?
  • Likely to use in your work?

41
Evaluation (cont.)
  • Sample questions for top level categories

    - Would you add/remove/rename any category ?
  • - Did this category match your expectations ?
  • Sample questions for a specific category
  • - Would you add/move/remove any
    sub-categories ?
  • - Would you promote any sub-category to top
    level ?
  • General questions
  • - Would you use Castanet ?
  • - Would you use LDA ?
  • - Would you use Subsumption ?
  • - Would you use list of most frequent terms ?

42
Evaluation Results
  • Results on recipes collection for
    Would you use this system in your work?
  • Yes in some cases or yes, definitely
  • Castanet 29/34
  • LDA 0/18
  • Subsumption 6/16
  • Baseline 25/34
  • Average response to questions about quality
    (4 strongly agree)

43
Evaluation Results
  • Average responses for top-level categories
  • 4 no changes, 1 change many
  • Average responses for 2 subcategories

44
Needed Improvements
  • Take spelling variations and morphological
    variants into account
  • Use verbs and adjectives, not just nouns
  • Normalize noun phrases
  • Allow terms to have more than one sense
  • Improve algorithm for assigning documents to
    categories.

45
Opportunities for Tagging
  • New opportunity Tagging, folksonomies
  • (flickr, de.lici.ous)
  • People are creating facets in a decentralized
    manner
  • They are assigning multiple facets to items
  • This is done on a massive scale
  • This leads naturally to meaningful associations

46
Conclusions
  • Flexible application of hierarchical faceted
    metadata is a proven approach for navigating
    large information collections.
  • Midway in complexity between simple hierarchies
    and deep knowledge representation.
  • Currently in use on e-commerce sites spreading
    to other domains
  • Systems are needed to help create faceted
    metadata structures
  • Our WordNet-based algorithm, while not perfect,
    seems like it will be a useful tool for
    Information Architects.

47
Conclusions
  • Castanet builds a set of faceted hierarchies by
    finding IS-A relations between terms using
    WordNet.
  • The method has been tested on various domains
  • medicine, recipes, math, news, arts,
    bibliographical records
  • Usability study shows
  • Castanet is preferred to other state-of-the art
    solutions.
  • Information architects want to use the tool in
    their work.

48
Learn More
  • Funding
  • This work supported in part by NSF (IIS-9984741)
  • For more information
  • Stoica, E., Hearst, M., and Richardson, M.,
    Automating Creation of Hierarchical Faceted
    Metadata Structures, NAACL/HLT 2007
  • See http//flamenco.berkeley.edu

49
Motivation
  • Want to assign labels from multiple hierarchies

50
The Problem with
Hierarchy
  • Inflexible
  • Force the user to start with a particular
    category
  • What if I dont know the animals diet, but the
    interface makes me start with that category?
  • Wasteful
  • Have to repeat combinations of categories
  • Makes for extra clicking and extra coding
  • Difficult to modify
  • To add a new category type, must duplicate it
    everywhere or change things everywhere
Write a Comment
User Comments (0)
About PowerShow.com